← Back to all projects

GOOGLE CLOUD PLATFORM DEEP DIVE PROJECTS

Learn Google Cloud Platform: From Zero to Cloud Architect

Goal: Deeply understand Google Cloud Platform—from basic resource management to building production-grade, globally distributed applications. Master compute, storage, networking, databases, serverless, Kubernetes, security, and infrastructure as code.


Why Google Cloud Platform Matters

Google Cloud Platform powers some of the world’s most demanding workloads—from YouTube’s video streaming to Google Search’s indexing. Unlike other cloud providers, GCP was built on the same infrastructure Google uses internally. Understanding GCP means understanding:

  • Global-first architecture: VPCs span the globe, not just regions
  • Containerization heritage: Kubernetes was born at Google (Borg → K8s)
  • Data at scale: BigQuery processes petabytes in seconds
  • AI/ML integration: TensorFlow, Vertex AI, and TPUs are native
  • Developer experience: gcloud CLI is incredibly powerful

After completing these projects, you will:

  • Architect and deploy applications across GCP’s compute services
  • Design secure, multi-region VPC networks
  • Implement IAM policies following principle of least privilege
  • Build data pipelines with BigQuery, Firestore, and Spanner
  • Deploy containerized applications on GKE and Cloud Run
  • Create CI/CD pipelines with Cloud Build
  • Monitor and observe production systems
  • Manage infrastructure as code with Terraform

Core Concept Analysis

GCP Resource Hierarchy

Organization (your-company.com)
    │
    ├── Folder: Engineering
    │   ├── Project: prod-web-app
    │   │   ├── Compute Engine VMs
    │   │   ├── Cloud SQL instance
    │   │   └── Cloud Storage bucket
    │   └── Project: dev-web-app
    │       └── ...
    │
    ├── Folder: Data
    │   └── Project: analytics-warehouse
    │       ├── BigQuery dataset
    │       └── Dataflow jobs
    │
    └── Folder: Shared Services
        └── Project: shared-vpc-host
            ├── VPC network
            └── Firewall rules

Fundamental Concepts

  1. Projects: The fundamental organizational unit. All GCP resources belong to a project. Projects have:
    • Unique project ID (globally unique, immutable)
    • Project number (GCP-assigned)
    • Billing account association
    • IAM policies
  2. Regions and Zones:
    • Region: Geographic location (us-central1, europe-west1)
    • Zone: Isolated location within a region (us-central1-a)
    • Global resources: VPCs, Cloud Storage, BigQuery
  3. Compute Options:
    • Compute Engine: IaaS, full VM control
    • GKE: Managed Kubernetes
    • Cloud Run: Serverless containers
    • Cloud Functions: Serverless functions (FaaS)
    • App Engine: PaaS (Standard/Flexible)
  4. Networking:
    • VPC: Global software-defined network
    • Subnets: Regional IP address ranges
    • Firewall rules: Stateful, tag-based
    • Cloud NAT: Outbound NAT for private VMs
    • Cloud Load Balancing: Global/regional, L4/L7
  5. Identity & Access Management (IAM):
    • Members: Users, groups, service accounts
    • Roles: Collections of permissions (Viewer, Editor, Owner, Custom)
    • Policies: Bindings of members to roles
    • Service Accounts: Machine identities
  6. Storage Options:
    • Cloud Storage: Object storage (like S3)
    • Cloud SQL: Managed MySQL/PostgreSQL
    • Cloud Spanner: Globally distributed relational
    • Firestore: NoSQL document database
    • BigQuery: Data warehouse for analytics
    • Bigtable: Wide-column NoSQL (HBase-compatible)

Project List

Projects are ordered from fundamental understanding to advanced implementations.


Project 1: GCP CLI & API Explorer (Understand the Control Plane)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Node.js, Bash
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Cloud APIs / Authentication / CLI
  • Software or Tool: gcloud CLI, REST APIs
  • Main Book: “Google Cloud Platform in Action” by JJ Geewax

What you’ll build: A custom CLI tool that wraps the GCP APIs to manage resources—create projects, list VMs, upload to storage, and query BigQuery—all while logging every API call to show you exactly what’s happening under the hood.

Why it teaches GCP: Before you can build on GCP, you need to understand how the control plane works. Every gcloud command is an API call. This project demystifies authentication, service accounts, API quotas, and the resource hierarchy.

Core challenges you’ll face:

  • Authentication with Application Default Credentials → maps to OAuth2, service accounts, and workload identity
  • Understanding the API structure → maps to resource paths: projects/{project}/zones/{zone}/instances/{instance}
  • Handling pagination and rate limits → maps to API quotas and exponential backoff
  • Parsing API responses → maps to understanding GCP’s resource representations

Key Concepts:

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of REST APIs, terminal/shell familiarity

Real world outcome:

$ ./gcp-explorer auth status
✓ Authenticated as: developer@myproject.iam.gserviceaccount.com
✓ Default project: my-first-project
✓ Active APIs: compute.googleapis.com, storage.googleapis.com, bigquery.googleapis.com

$ ./gcp-explorer compute list
PROJECT          ZONE           NAME           STATUS    MACHINE_TYPE    IP
my-first-project us-central1-a  web-server-1   RUNNING   e2-medium       10.128.0.2
my-first-project us-central1-b  web-server-2   RUNNING   e2-medium       10.128.0.3

$ ./gcp-explorer storage upload ./data.csv gs://my-bucket/data/
API Call: POST https://storage.googleapis.com/upload/storage/v1/b/my-bucket/o?uploadType=resumable
Response: 200 OK - Object created: data/data.csv (1.2 MB)

$ ./gcp-explorer bigquery query "SELECT COUNT(*) FROM \`project.dataset.table\`"
API Call: POST https://bigquery.googleapis.com/bigquery/v2/projects/my-project/queries
Job ID: bquxjob_12345
Result: 1,234,567 rows

Implementation Hints:

The GCP API follows a consistent pattern:

https://{service}.googleapis.com/{version}/{resource_path}

For authentication, you’ll use the google-auth library which handles:

  1. Checking for GOOGLE_APPLICATION_CREDENTIALS environment variable
  2. Looking for Application Default Credentials
  3. Using the metadata server (on GCP VMs)
  4. Interactive OAuth flow

Questions to guide your implementation:

  • How does gcloud auth application-default login work?
  • What’s the difference between a user account and a service account?
  • How do you enable an API for a project?
  • What happens when you exceed API quotas?

Start by exploring the Compute Engine API to list instances, then expand to Storage and BigQuery.

Learning milestones:

  1. You authenticate successfully → You understand GCP’s identity model
  2. You make raw API calls → You see what gcloud does under the hood
  3. You handle errors gracefully → You understand quotas and permissions
  4. You chain operations together → You can orchestrate cloud resources

Project 2: Virtual Machine Orchestrator (Understand Compute Engine)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Terraform (HCL), Bash
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: IaaS / Virtual Machines / Instance Groups
  • Software or Tool: Compute Engine, Instance Templates
  • Main Book: “Google Cloud Platform in Action” by JJ Geewax

What you’ll build: A VM orchestration system that creates instance templates, deploys VMs across multiple zones, configures startup scripts, attaches disks, and implements health checking with automatic replacement of unhealthy instances.

Why it teaches Compute Engine: Compute Engine is the foundation of GCP’s compute services. Understanding VMs—their lifecycle, networking, storage, and metadata—is essential before moving to containers or serverless.

Core challenges you’ll face:

  • Creating instance templates → maps to immutable VM configurations
  • Zone vs. regional resources → maps to availability and fault tolerance
  • Startup scripts and metadata → maps to VM bootstrapping and configuration
  • Persistent disk management → maps to storage types, snapshots, images
  • Preemptible/Spot VMs → maps to cost optimization strategies

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 (GCP CLI & API Explorer), Linux basics, SSH

Real world outcome:

$ ./vm-orchestrator create-template \
    --name "web-server-template" \
    --machine-type "e2-medium" \
    --image-family "debian-11" \
    --startup-script "./install-nginx.sh" \
    --tags "http-server,https-server"

Template created: web-server-template

$ ./vm-orchestrator deploy \
    --template "web-server-template" \
    --count 3 \
    --zones "us-central1-a,us-central1-b,us-central1-c"

Deploying 3 instances across 3 zones...
  ✓ web-server-1 (us-central1-a) - RUNNING - 10.128.0.5
  ✓ web-server-2 (us-central1-b) - RUNNING - 10.128.0.6
  ✓ web-server-3 (us-central1-c) - RUNNING - 10.128.0.7

$ ./vm-orchestrator health-check --group "web-server-group"
Instance          Zone            Status    Health    Last Check
web-server-1      us-central1-a   RUNNING   HEALTHY   2s ago
web-server-2      us-central1-b   RUNNING   HEALTHY   2s ago
web-server-3      us-central1-c   RUNNING   UNHEALTHY 2s ago (HTTP 503)
  → Replacing unhealthy instance...
  ✓ web-server-3-new spawned

Implementation Hints:

A Compute Engine instance has several key components:

  • Machine type: CPU/memory configuration (e2-micro, n2-standard-4)
  • Boot disk: OS image (debian-11, ubuntu-2204-lts)
  • Network interfaces: VPC, subnet, internal/external IPs
  • Metadata: Key-value pairs accessible from within the VM
  • Service account: Identity for the VM to call other GCP APIs

Questions to guide your implementation:

  • How do you pass configuration to a VM at boot time?
  • What’s the difference between an image, snapshot, and instance template?
  • How does GCP’s metadata server work (169.254.169.254)?
  • When should you use preemptible VMs vs. standard VMs?

Start with a single VM, then implement templating, then multi-zone deployment.

Learning milestones:

  1. You create a VM programmatically → You understand the instance creation API
  2. You use startup scripts → You understand bootstrapping patterns
  3. You deploy across zones → You understand availability zones
  4. You implement health checking → You understand managed instance groups

Project 3: Object Storage System (Understand Cloud Storage)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Node.js, Bash
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Object Storage / IAM / Data Lifecycle
  • Software or Tool: Cloud Storage, gsutil
  • Main Book: “Google Cloud Platform in Action” by JJ Geewax

What you’ll build: A file management system that implements resumable uploads, signed URLs for secure sharing, lifecycle policies for cost optimization, object versioning, and cross-region replication.

Why it teaches Cloud Storage: Cloud Storage is GCP’s infinitely scalable object store. Understanding storage classes, IAM policies at the bucket/object level, and lifecycle management is crucial for any cloud architect.

Core challenges you’ll face:

  • Resumable uploads for large files → maps to chunked upload protocol
  • Signed URLs with expiration → maps to temporary access without credentials
  • Storage class selection → maps to cost vs. access time tradeoffs
  • Lifecycle policies → maps to automatic archival and deletion
  • IAM vs. ACLs → maps to uniform vs. fine-grained access control

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 (GCP CLI & API Explorer), understanding of HTTP, basic security concepts

Real world outcome:

$ ./gcs-manager create-bucket \
    --name "my-data-lake" \
    --location "US" \
    --storage-class "STANDARD" \
    --lifecycle-rules "./lifecycle.json"

Bucket created: gs://my-data-lake
Location: US (multi-region)
Storage class: STANDARD
Lifecycle rules:
  - Delete objects after 365 days
  - Move to NEARLINE after 30 days
  - Move to COLDLINE after 90 days

$ ./gcs-manager upload ./large-video.mp4 gs://my-data-lake/videos/
Uploading: large-video.mp4 (2.3 GB)
Progress: [████████████████████] 100% (2.3 GB)
Upload complete: gs://my-data-lake/videos/large-video.mp4
  Resumable upload ID: AEnB2UqX... (saved for resume)

$ ./gcs-manager share \
    --object "gs://my-data-lake/videos/large-video.mp4" \
    --expires "1h"

Signed URL (valid for 1 hour):
https://storage.googleapis.com/my-data-lake/videos/large-video.mp4?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=...

$ ./gcs-manager sync ./local-folder gs://my-data-lake/backup/ --delete
Syncing local-folder → gs://my-data-lake/backup/
  ↑ new-file.txt (added)
  ↑ updated-file.txt (modified)
  ✗ old-file.txt (deleted from destination)
Sync complete: 2 uploaded, 1 deleted

Implementation Hints:

Cloud Storage buckets have several important settings:

  • Location type: Region, dual-region, or multi-region
  • Storage class: Affects retrieval cost vs. storage cost
  • Access control: Uniform (IAM only) or fine-grained (ACLs)
  • Versioning: Keep multiple versions of objects

For resumable uploads:

  1. Initiate upload to get a resumable session URI
  2. Upload in chunks (256KB minimum)
  3. If interrupted, query the session URI for bytes received
  4. Resume from the last successful byte

Questions to guide your implementation:

  • When would you use Nearline vs. Coldline storage?
  • How do you secure a bucket so only specific service accounts can access it?
  • What’s the difference between gs:// URIs and HTTPS URLs?
  • How do you handle concurrent access to the same object?

Learning milestones:

  1. You upload/download files → You understand the basic API
  2. You implement signed URLs → You understand temporary credentials
  3. You configure lifecycle policies → You understand cost optimization
  4. You sync directories → You understand object comparison and ETags

Project 4: VPC Network Architect (Understand Networking)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Terraform (HCL), Go, Bash
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Cloud Networking / Firewalls / Load Balancing
  • Software or Tool: VPC, Cloud NAT, Cloud Load Balancing
  • Main Book: “Networking in Google Cloud Platform” by Google Cloud Training

What you’ll build: A network topology manager that creates VPCs with custom subnets, configures firewall rules using network tags, sets up Cloud NAT for private instances, and implements internal and external load balancers.

Why it teaches VPC Networking: GCP’s networking is unique—VPCs are global, subnets are regional, and firewall rules use tags instead of security groups. Understanding this model is essential for secure, scalable architectures.

Core challenges you’ll face:

  • Designing CIDR ranges → maps to IP address planning and avoiding overlaps
  • Tag-based firewall rules → maps to dynamic security based on instance attributes
  • Private Google Access → maps to accessing GCP APIs without public IPs
  • Load balancer types → maps to choosing between HTTP(S), TCP, internal
  • Shared VPC vs. VPC Peering → maps to multi-project networking

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 2 (VM Orchestrator), TCP/IP networking basics, understanding of CIDR notation

Real world outcome:

$ ./vpc-architect create-network \
    --name "production-vpc" \
    --description "Production network" \
    --mtu 1460

VPC created: production-vpc (global)

$ ./vpc-architect create-subnet \
    --network "production-vpc" \
    --name "web-tier" \
    --region "us-central1" \
    --cidr "10.0.1.0/24" \
    --private-google-access

Subnet created: web-tier
  Region: us-central1
  CIDR: 10.0.1.0/24 (254 usable IPs)
  Private Google Access: enabled

$ ./vpc-architect create-firewall \
    --network "production-vpc" \
    --name "allow-http" \
    --direction INGRESS \
    --target-tags "http-server" \
    --source-ranges "0.0.0.0/0" \
    --allow "tcp:80,tcp:443"

Firewall rule created: allow-http
  Applies to: instances with tag "http-server"
  Allows: TCP ports 80, 443 from any source

$ ./vpc-architect create-nat \
    --network "production-vpc" \
    --region "us-central1" \
    --router "prod-router"

Cloud NAT configured: prod-nat
  All instances in us-central1 can access internet
  No public IPs required

$ ./vpc-architect visualize --network "production-vpc"
┌──────────────────────────────────────────────────────────┐
│ VPC: production-vpc (Global)                             │
├──────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Region: us-central1                                  │ │
│ │ ┌─────────────────┐  ┌─────────────────┐           │ │
│ │ │ web-tier        │  │ db-tier         │           │ │
│ │ │ 10.0.1.0/24     │  │ 10.0.2.0/24     │           │ │
│ │ │ 3 instances     │→→│ 2 instances     │           │ │
│ │ └─────────────────┘  └─────────────────┘           │ │
│ │      ↓                                              │ │
│ │ [Cloud NAT: prod-nat]                              │ │
│ └─────────────────────────────────────────────────────┘ │
│                                                          │
│ Firewall Rules:                                          │
│   ✓ allow-http (tcp:80,443) → tag:http-server           │
│   ✓ allow-internal (all) → 10.0.0.0/16                  │
│   ✗ deny-all (implicit)                                  │
└──────────────────────────────────────────────────────────┘

Implementation Hints:

GCP VPCs have a unique model:

  • VPCs are global: A single VPC spans all regions
  • Subnets are regional: Each subnet exists in one region
  • Firewall rules are global: Apply to entire VPC, filtered by tags
  • Routes are global: Define how traffic flows

Firewall rule priority matters (0-65535, lower = higher priority):

  1. Rules are evaluated from lowest to highest priority
  2. First matching rule wins
  3. Default rules have priority 65534-65535

Questions to guide your implementation:

  • Why would you use a Shared VPC instead of VPC Peering?
  • How do you allow instances to access Cloud Storage without public IPs?
  • What’s the difference between target tags and source tags?
  • How do you route traffic between on-premises and GCP?

Learning milestones:

  1. You create VPCs and subnets → You understand the network hierarchy
  2. You configure firewall rules → You understand GCP’s security model
  3. You implement Cloud NAT → You understand outbound connectivity
  4. You set up load balancers → You understand traffic distribution

Project 5: IAM Policy Manager (Understand Security)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Terraform (HCL), Bash
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Identity & Access Management / Security
  • Software or Tool: Cloud IAM, Secret Manager
  • Main Book: “Google Cloud Security” by Google Cloud Training

What you’ll build: An IAM auditing and management tool that visualizes effective permissions, identifies over-privileged service accounts, enforces least-privilege policies, and manages secrets securely.

Why it teaches IAM: Security misconfiguration is the #1 cause of cloud breaches. Understanding IAM—roles, bindings, conditions, service accounts, and workload identity—is critical for any cloud architect.

Core challenges you’ll face:

  • Policy inheritance → maps to organization, folder, project hierarchy
  • Predefined vs. custom roles → maps to granular permission control
  • Service account key management → maps to avoiding key sprawl
  • Workload identity → maps to keyless authentication for GKE
  • IAM conditions → maps to context-aware access control

Key Concepts:

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1-3, understanding of authentication/authorization concepts

Real world outcome:

$ ./iam-manager audit --project "production-project"
╔══════════════════════════════════════════════════════════════╗
║ IAM Audit Report: production-project                         ║
╠══════════════════════════════════════════════════════════════╣
║ CRITICAL FINDINGS:                                           ║
║ ⚠️  3 service accounts with roles/owner (over-privileged)    ║
║ ⚠️  2 service accounts with exported keys (security risk)    ║
║ ⚠️  1 user with roles/editor on production (too broad)       ║
╠══════════════════════════════════════════════════════════════╣
║ Service Account: deploy-sa@project.iam.gserviceaccount.com   ║
║   Current role: roles/owner                                  ║
║   Recommended: roles/cloudfunctions.developer,               ║
║                roles/storage.objectAdmin                     ║
║   Reason: Only deploys functions and uploads to storage      ║
╚══════════════════════════════════════════════════════════════╝

$ ./iam-manager effective-permissions \
    --member "user:alice@company.com" \
    --resource "//storage.googleapis.com/projects/prod/buckets/data"

Effective Permissions for alice@company.com on bucket 'data':
├── storage.objects.get (from roles/storage.objectViewer @ project)
├── storage.objects.list (from roles/storage.objectViewer @ project)
├── storage.buckets.get (from roles/storage.objectViewer @ project)
└── (inherited from project: production-project)

Missing permissions for common operations:
  ✗ Cannot upload objects (needs storage.objects.create)
  ✗ Cannot delete objects (needs storage.objects.delete)

$ ./iam-manager create-role \
    --project "production-project" \
    --name "CustomDeployer" \
    --permissions "cloudfunctions.functions.create,cloudfunctions.functions.update,storage.objects.create"

Custom role created: projects/production-project/roles/CustomDeployer
Permissions: 3

$ ./iam-manager secrets list
NAME                  CREATED              VERSIONS  ACCESSED
db-password           2024-01-15           3         2 hours ago
api-key-stripe        2024-02-20           1         15 mins ago
jwt-signing-key       2024-03-01           2         1 hour ago

$ ./iam-manager secrets rotate "db-password" --generate
New version created: db-password/versions/4
Old version (3) marked for destruction in 7 days

Implementation Hints:

IAM bindings work at resource levels:

Resource: projects/my-project
  Binding:
    Role: roles/storage.objectViewer
    Members:
      - user:alice@company.com
      - serviceAccount:my-sa@my-project.iam.gserviceaccount.com
    Condition:
      expression: request.time < timestamp("2024-12-31T00:00:00Z")

To find effective permissions:

  1. Get IAM policy at organization level
  2. Get IAM policy at folder level
  3. Get IAM policy at project level
  4. Get IAM policy at resource level
  5. Union all permissions, respecting deny policies

Questions to guide your implementation:

  • When should you create a custom role vs. use predefined roles?
  • How do you grant a Cloud Function access to a secret without key files?
  • What’s the difference between setIamPolicy and addIamPolicyBinding?
  • How do IAM conditions work (e.g., time-based, IP-based)?

Learning milestones:

  1. You list and analyze IAM policies → You understand the permission model
  2. You implement least privilege → You understand security best practices
  3. You manage service accounts securely → You understand machine identity
  4. You use Secret Manager → You understand secrets management

Project 6: Kubernetes Cluster Manager (Understand GKE)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Bash, TypeScript
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Kubernetes / Container Orchestration
  • Software or Tool: GKE, kubectl, Autopilot
  • Main Book: “Kubernetes Up & Running” by Brendan Burns, Joe Beda, Kelsey Hightower

What you’ll build: A GKE cluster management tool that provisions Autopilot and Standard clusters, deploys workloads with proper resource requests, configures workload identity, implements network policies, and manages node pools.

Why it teaches GKE: Kubernetes was invented at Google (Borg). GKE is the most mature managed Kubernetes offering. Understanding node pools, Autopilot mode, workload identity, and GKE-specific features prepares you for production container orchestration.

Core challenges you’ll face:

  • Autopilot vs. Standard mode → maps to managed vs. self-managed nodes
  • Node pool configuration → maps to machine types, preemptibility, autoscaling
  • Workload Identity → maps to pod-level IAM without keys
  • Network policies → maps to pod-to-pod traffic control
  • GKE Ingress → maps to Cloud Load Balancing integration

Key Concepts:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 4-5, basic Kubernetes knowledge (pods, services, deployments), Docker

Real world outcome:

$ ./gke-manager create-cluster \
    --name "production-cluster" \
    --mode "autopilot" \
    --region "us-central1" \
    --workload-identity

Creating Autopilot cluster...
  ✓ Control plane provisioned
  ✓ Workload Identity enabled
  ✓ Network policy enabled
  ✓ Binary authorization configured

Cluster ready: gke_my-project_us-central1_production-cluster
Run: gcloud container clusters get-credentials production-cluster --region us-central1

$ ./gke-manager deploy \
    --cluster "production-cluster" \
    --manifest "./k8s/app.yaml" \
    --service-account "app-sa@my-project.iam.gserviceaccount.com"

Deploying to production-cluster...
  ✓ Namespace: production
  ✓ Deployment: web-app (3 replicas)
  ✓ Service: web-app (ClusterIP)
  ✓ Ingress: web-app-ingress (pending external IP)
  ✓ Workload Identity binding: app-sa → k8s/production/app

Waiting for external IP...
  ✓ Ingress IP: 34.120.xxx.xxx

$ ./gke-manager workloads --cluster "production-cluster"
NAMESPACE    NAME         READY   CPU      MEMORY   STATUS
production   web-app      3/3     100m     256Mi    Running
production   worker       2/2     500m     1Gi      Running
kube-system  dns          2/2     50m      128Mi    Running

$ ./gke-manager network-policy apply \
    --cluster "production-cluster" \
    --namespace "production" \
    --policy "deny-all-ingress,allow-web-to-worker"

Network policies applied:
  ✓ deny-all-ingress: Default deny all ingress
  ✓ allow-web-to-worker: web-app → worker on port 8080

Implementation Hints:

GKE cluster modes:

  • Autopilot: Google manages nodes, you pay per pod resource
  • Standard: You manage node pools, more control but more ops

Workload Identity flow:

  1. Create GCP service account
  2. Create Kubernetes service account
  3. Bind them: gcloud iam service-accounts add-iam-policy-binding
  4. Annotate K8s SA: iam.gke.io/gcp-service-account
  5. Pods using that K8s SA can access GCP APIs

Questions to guide your implementation:

  • When would you choose Standard over Autopilot?
  • How do you size resource requests/limits for autoscaling?
  • How does GKE Ingress differ from nginx ingress?
  • What’s the difference between cluster-level and namespace-level network policies?

Learning milestones:

  1. You create a GKE cluster → You understand cluster architecture
  2. You deploy with Workload Identity → You understand keyless auth
  3. You implement network policies → You understand pod security
  4. You configure autoscaling → You understand GKE optimization

Project 7: Serverless Container Platform (Understand Cloud Run)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Node.js, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Serverless Containers / Event-Driven
  • Software or Tool: Cloud Run, Artifact Registry
  • Main Book: “Building Serverless Applications with Google Cloud Run” by Wietse Venema

What you’ll build: A deployment pipeline that builds containers, pushes to Artifact Registry, deploys to Cloud Run with proper configuration (concurrency, min instances, secrets), and sets up traffic splitting for canary deployments.

Why it teaches Cloud Run: Cloud Run is Google’s answer to “containers without Kubernetes complexity.” Understanding request-based autoscaling, cold starts, concurrency limits, and traffic management is essential for modern serverless architecture.

Core challenges you’ll face:

  • Container optimization for cold starts → maps to startup time vs. image size
  • Concurrency settings → maps to requests per instance
  • Minimum instances → maps to cold start mitigation
  • Traffic splitting → maps to canary deployments, blue-green
  • Cloud Run Jobs vs. Services → maps to request-based vs. batch

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Docker basics, Project 1 (GCP CLI), understanding of HTTP services

Real world outcome:

$ ./cloudrun-deploy build \
    --source "./api" \
    --image "us-central1-docker.pkg.dev/my-project/api/backend:v1.2.0"

Building container...
  ✓ Dockerfile found
  ✓ Built: backend:v1.2.0 (142MB)
  ✓ Pushed to Artifact Registry

$ ./cloudrun-deploy service \
    --name "api-backend" \
    --image "us-central1-docker.pkg.dev/my-project/api/backend:v1.2.0" \
    --region "us-central1" \
    --concurrency 80 \
    --min-instances 1 \
    --max-instances 100 \
    --secrets "DB_PASSWORD=db-secret:latest,API_KEY=api-key:latest"

Deploying to Cloud Run...
  ✓ Service: api-backend
  ✓ Region: us-central1
  ✓ URL: https://api-backend-abc123-uc.a.run.app
  ✓ Concurrency: 80 requests/instance
  ✓ Min instances: 1 (always warm)
  ✓ Secrets mounted

$ ./cloudrun-deploy canary \
    --service "api-backend" \
    --new-revision "api-backend-00002" \
    --traffic-split "90:latest,10:api-backend-00002"

Traffic split configured:
  api-backend-00001 (stable): 90%
  api-backend-00002 (canary): 10%

$ ./cloudrun-deploy metrics --service "api-backend"
╭─────────────────────────────────────────────────────────────╮
│ Service: api-backend                                        │
├─────────────────────────────────────────────────────────────┤
│ Requests (last hour):     12,456                           │
│ Latency (p50/p95/p99):    45ms / 120ms / 350ms             │
│ Error rate:               0.12%                             │
│ Instance count:           3-8 (autoscaling)                 │
│ Cold starts:              23 (0.18%)                        │
│ Billed container time:    4.2 hours                        │
╰─────────────────────────────────────────────────────────────╯

Implementation Hints:

Cloud Run container requirements:

  • Must listen on $PORT (default 8080)
  • Must respond to HTTP requests
  • Stateless (no persistent local storage)
  • Maximum request timeout: 60 minutes

Concurrency model:

  • Default: 80 concurrent requests per instance
  • If CPU-bound, lower concurrency
  • If I/O-bound, raise concurrency
  • Cloud Run scales instances based on concurrent requests / concurrency setting

Questions to guide your implementation:

  • How do you reduce cold start time?
  • When should you use Cloud Run Jobs vs. Services?
  • How do you handle long-running tasks (> 60 mins)?
  • How do you connect Cloud Run to a VPC (for Cloud SQL)?

Learning milestones:

  1. You deploy a container → You understand the Cloud Run model
  2. You tune concurrency → You understand scaling behavior
  3. You implement traffic splitting → You understand deployment strategies
  4. You reduce cold starts → You understand performance optimization

Project 8: Event-Driven Data Pipeline (Understand Pub/Sub & Cloud Functions)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Node.js, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Event-Driven Architecture / Serverless
  • Software or Tool: Pub/Sub, Cloud Functions, Eventarc
  • Main Book: “Google Cloud Platform in Action” by JJ Geewax

What you’ll build: An event-driven system that uses Pub/Sub for message queuing, Cloud Functions for processing, dead-letter queues for failed messages, and Eventarc for routing Cloud Events from GCP services.

Why it teaches event-driven patterns: Modern cloud applications are event-driven. Understanding publish-subscribe patterns, exactly-once delivery, acknowledgment windows, and event routing is essential for decoupled, scalable architectures.

Core challenges you’ll face:

  • Message acknowledgment timing → maps to at-least-once delivery semantics
  • Dead letter queues → maps to handling poison messages
  • Push vs. pull subscriptions → maps to invocation patterns
  • Eventarc triggers → maps to reacting to GCP events
  • Idempotency → maps to handling duplicate messages

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 (GCP CLI), understanding of async programming, JSON

Real world outcome:

$ ./event-pipeline create-topic \
    --name "orders" \
    --schema "./schemas/order.avsc" \
    --dead-letter-topic "orders-dlq"

Topic created: projects/my-project/topics/orders
Schema: order.avsc (Avro)
Dead letter topic: orders-dlq

$ ./event-pipeline create-subscription \
    --topic "orders" \
    --name "order-processor" \
    --push-endpoint "https://order-processor-abc123.run.app" \
    --ack-deadline 60 \
    --max-retries 5

Subscription created: order-processor
Type: Push
Endpoint: https://order-processor-abc123.run.app
Retry policy: Exponential backoff, max 5 retries
Dead letter: After 5 failures → orders-dlq

$ ./event-pipeline deploy-function \
    --name "process-order" \
    --trigger-topic "orders" \
    --runtime "python311" \
    --source "./functions/order-processor"

Deploying Cloud Function...
  ✓ Function: process-order
  ✓ Trigger: Pub/Sub topic 'orders'
  ✓ Memory: 256MB
  ✓ Timeout: 60s

$ ./event-pipeline publish \
    --topic "orders" \
    --message '{"order_id": "12345", "items": [{"sku": "ABC", "qty": 2}]}'

Message published: 1234567890
Attributes: {}

$ ./event-pipeline monitor --topic "orders"
╭─────────────────────────────────────────────────────────────╮
│ Topic: orders                          Last 1 hour         │
├─────────────────────────────────────────────────────────────┤
│ Published:        1,234 messages (2.3 MB)                  │
│ Delivered:        1,230 messages (99.7%)                   │
│ Acknowledged:     1,228 messages                           │
│ Dead-lettered:    4 messages                               │
│ Oldest unacked:   None (all caught up)                     │
│                                                             │
│ Subscription: order-processor                               │
│   Backlog: 0 messages                                       │
│   Processing rate: 45 msg/sec                              │
╰─────────────────────────────────────────────────────────────╯

Implementation Hints:

Pub/Sub guarantees:

  • At-least-once delivery: Messages may be delivered multiple times
  • Ordering: Optional, within ordering key
  • Exactly-once processing: Requires Dataflow (not raw Pub/Sub)

Message flow:

  1. Publisher sends message to topic
  2. Pub/Sub stores message durably
  3. Subscription pushes to endpoint OR waiter pulls
  4. Processor acknowledges within deadline
  5. If no ack, message redelivered

Questions to guide your implementation:

  • How do you ensure idempotent processing?
  • When should you use push vs. pull subscriptions?
  • How do you handle backpressure?
  • How do you test event-driven systems locally?

Learning milestones:

  1. You publish and subscribe → You understand the Pub/Sub model
  2. You handle dead letters → You understand error handling
  3. You use Eventarc → You understand GCP event integration
  4. You implement idempotency → You understand reliability patterns

Project 9: Data Warehouse Analytics Platform (Understand BigQuery)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: SQL, Go, dbt
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Warehousing / Analytics / SQL
  • Software or Tool: BigQuery, BigQuery ML, Dataform
  • Main Book: “Google BigQuery: The Definitive Guide” by Valliappa Lakshmanan & Jordan Tigani

What you’ll build: An analytics platform that loads data from Cloud Storage, creates partitioned and clustered tables, runs complex analytical queries, trains ML models with BigQuery ML, and exports results.

Why it teaches BigQuery: BigQuery is Google’s serverless data warehouse that can query petabytes in seconds. Understanding its architecture (separation of storage and compute), pricing model, and optimization techniques is essential for data-intensive applications.

Core challenges you’ll face:

  • Partitioning and clustering → maps to query performance and cost
  • Slot management → maps to compute resource allocation
  • External tables → maps to querying data in place
  • BigQuery ML → maps to ML without exporting data
  • Materialized views → maps to query optimization

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: SQL knowledge, Project 3 (Cloud Storage), understanding of data modeling

Real world outcome:

$ ./bq-analytics create-dataset \
    --name "analytics" \
    --location "US" \
    --default-partition-expiration "365d"

Dataset created: my-project.analytics
Location: US
Default partition expiration: 365 days

$ ./bq-analytics load \
    --source "gs://my-bucket/events/*.parquet" \
    --table "analytics.events" \
    --partition-by "DATE(timestamp)" \
    --cluster-by "user_id,event_type" \
    --schema-detect

Loading data...
  ✓ Files: 1,234 Parquet files (45 GB)
  ✓ Rows loaded: 523,456,789
  ✓ Partitioned by: DATE(timestamp)
  ✓ Clustered by: user_id, event_type
  ✓ Table size: 12.3 GB (compressed)

$ ./bq-analytics query \
    --sql "SELECT event_type, COUNT(*) as count
           FROM analytics.events
           WHERE DATE(timestamp) = '2024-01-15'
           GROUP BY event_type
           ORDER BY count DESC
           LIMIT 10"

Query Statistics:
  Bytes processed: 1.2 GB (of 12.3 GB total)
  Slot milliseconds: 3,450
  Cached: false

Results:
╭─────────────────┬───────────╮
│ event_type      │ count     │
├─────────────────┼───────────┤
│ page_view       │ 2,345,678 │
│ button_click    │ 1,234,567 │
│ form_submit     │   456,789 │
│ ...             │ ...       │
╰─────────────────┴───────────╯

$ ./bq-analytics train-model \
    --name "analytics.churn_predictor" \
    --type "logistic_reg" \
    --training-data "SELECT * FROM analytics.user_features WHERE split = 'train'"

Training BigQuery ML model...
  ✓ Model: analytics.churn_predictor
  ✓ Type: Logistic Regression
  ✓ Training rows: 1,234,567
  ✓ Accuracy: 0.87
  ✓ AUC: 0.92

$ ./bq-analytics cost-analysis --dataset "analytics"
╭─────────────────────────────────────────────────────────────╮
│ Cost Analysis: analytics (Last 30 days)                     │
├─────────────────────────────────────────────────────────────┤
│ Storage:        $12.50 (12.3 GB active, 45 GB long-term)   │
│ Queries:        $245.00 (49 TB processed)                  │
│ Streaming:      $0.00 (no streaming inserts)               │
│                                                             │
│ Top expensive queries:                                      │
│   1. SELECT * FROM events (full scan) - $45                │
│   2. JOIN events ON users (expensive) - $23                │
│                                                             │
│ Optimization suggestions:                                   │
│   ⚠️  Add partition filter to query #1                      │
│   ⚠️  Cluster table by join key for query #2               │
╰─────────────────────────────────────────────────────────────╯

Implementation Hints:

BigQuery architecture:

  • Dremel: Distributed query execution engine
  • Colossus: Distributed storage (columnar format)
  • Slots: Units of compute (parallelism)

Partitioning vs. clustering:

  • Partitioning: Divides table into segments (by date, integer range)
  • Clustering: Sorts data within partitions (up to 4 columns)
  • Use both for maximum performance

Questions to guide your implementation:

  • When is on-demand pricing vs. flat-rate pricing better?
  • How do you prevent accidental full table scans?
  • When should you use materialized views vs. regular views?
  • How do you export BigQuery results to Cloud Storage?

Learning milestones:

  1. You load and query data → You understand basic BigQuery
  2. You optimize with partitioning → You understand cost control
  3. You use BigQuery ML → You understand in-database ML
  4. You analyze costs → You understand the pricing model

Project 10: Multi-Database Application (Understand Cloud SQL, Firestore, Spanner)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Java, Node.js
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Databases / Data Modeling / Transactions
  • Software or Tool: Cloud SQL, Firestore, Cloud Spanner
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: An application that uses the right database for the right job—Cloud SQL for relational data, Firestore for real-time documents, and Spanner for globally distributed transactions—with proper connection pooling and transaction handling.

Why it teaches GCP databases: GCP offers many database options, each with different tradeoffs. Understanding when to use Cloud SQL vs. Firestore vs. Spanner is critical for architecting scalable applications.

Core challenges you’ll face:

  • Cloud SQL connection management → maps to connection pooling, Cloud SQL Proxy
  • Firestore data modeling → maps to denormalization for read performance
  • Spanner schema design → maps to interleaved tables, avoiding hotspots
  • Cross-database consistency → maps to saga patterns, eventual consistency
  • Performance tuning → maps to indexes, query patterns

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: SQL knowledge, Project 1-5, understanding of database concepts

Real world outcome:

$ ./db-manager provision-cloudsql \
    --name "orders-db" \
    --engine "postgres" \
    --tier "db-custom-2-4096" \
    --high-availability \
    --private-ip

Cloud SQL instance created: orders-db
  Engine: PostgreSQL 15
  vCPUs: 2, Memory: 4 GB
  High Availability: enabled (standby in us-central1-b)
  Connection: Private IP (10.20.30.40)

$ ./db-manager create-firestore \
    --mode "native" \
    --location "us-central"

Firestore database created
  Mode: Native
  Location: us-central (multi-region)

$ ./db-manager provision-spanner \
    --name "global-inventory" \
    --config "nam-eur-asia1" \
    --nodes 3

Spanner instance created: global-inventory
  Configuration: nam-eur-asia1 (multi-continent)
  Nodes: 3 (10,000 QPS capacity)

$ ./db-manager benchmark
╭─────────────────────────────────────────────────────────────╮
│ Database Benchmark Results                                  │
├─────────────────────────────────────────────────────────────┤
│ Cloud SQL (PostgreSQL):                                     │
│   Write latency (p99): 12ms                                │
│   Read latency (p99): 3ms                                  │
│   Max connections: 100                                      │
│   Best for: Complex queries, transactions                  │
├─────────────────────────────────────────────────────────────┤
│ Firestore:                                                  │
│   Write latency (p99): 45ms                                │
│   Read latency (p99): 8ms                                  │
│   Max writes/sec: 10,000 (per database)                    │
│   Best for: Real-time sync, mobile apps                    │
├─────────────────────────────────────────────────────────────┤
│ Spanner:                                                    │
│   Write latency (p99): 8ms                                 │
│   Read latency (p99): 3ms                                  │
│   Max TPS: 10,000/node                                      │
│   Best for: Global distribution, strong consistency        │
╰─────────────────────────────────────────────────────────────╯

$ ./db-manager migrate \
    --source "cloudsql:orders-db" \
    --target "spanner:global-inventory" \
    --table "orders" \
    --strategy "dual-write"

Migration started: orders → global-inventory
  Phase 1: Backfill historical data
  Phase 2: Enable dual-write
  Phase 3: Switch reads to Spanner
  Phase 4: Disable Cloud SQL writes

Current: Phase 2 (dual-write active)
Progress: 67% complete

Implementation Hints:

Database selection guide:

  • Cloud SQL: Traditional RDBMS, managed MySQL/PostgreSQL/SQL Server
  • Firestore: Document database, real-time sync, mobile SDKs
  • Spanner: Globally distributed, relational, strong consistency
  • Bigtable: Wide-column, HBase API, high throughput

Cloud SQL connection patterns:

  1. Cloud SQL Proxy: Secure tunnel, automatic SSL
  2. Private IP: VPC peering, no public exposure
  3. Connection pooling: Use for serverless (Cloud Run, Functions)

Questions to guide your implementation:

  • When would you choose Spanner over Cloud SQL?
  • How do you handle the 1 write/second/document limit in Firestore?
  • How do you connect Cloud Run to Cloud SQL (hint: Cloud SQL Connector)?
  • What’s the cost difference between these databases?

Learning milestones:

  1. You provision each database → You understand GCP database options
  2. You model data appropriately → You understand each database’s strengths
  3. You handle transactions → You understand consistency models
  4. You optimize queries → You understand performance tuning

Project 11: CI/CD Pipeline (Understand Cloud Build & Artifact Registry)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: YAML (Cloud Build config)
  • Alternative Programming Languages: Bash, Python, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: CI/CD / DevOps / Container Registry
  • Software or Tool: Cloud Build, Artifact Registry, Cloud Deploy
  • Main Book: “Continuous Delivery” by Jez Humble & David Farley

What you’ll build: A complete CI/CD pipeline that triggers on git push, runs tests, builds containers, scans for vulnerabilities, pushes to Artifact Registry, and deploys to Cloud Run or GKE with approval gates.

Why it teaches CI/CD on GCP: Cloud Build is GCP’s native CI/CD service. Understanding build triggers, substitutions, secrets, and integration with other GCP services is essential for DevOps on GCP.

Core challenges you’ll face:

  • Build configuration → maps to cloudbuild.yaml syntax and steps
  • Secret injection → maps to using Secret Manager in builds
  • Artifact management → maps to container and language package registries
  • Deployment strategies → maps to Cloud Deploy pipelines
  • Vulnerability scanning → maps to on-demand and automatic scanning

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Git, Docker basics, Project 7 (Cloud Run) or Project 6 (GKE)

Real world outcome:

$ ./cicd-manager create-trigger \
    --name "main-build" \
    --repo "github.com/myorg/myapp" \
    --branch "^main$" \
    --config "./cloudbuild.yaml"

Trigger created: main-build
  Repository: github.com/myorg/myapp
  Branch pattern: ^main$
  Config: ./cloudbuild.yaml

$ ./cicd-manager create-pipeline \
    --name "production-deploy" \
    --targets "dev,staging,prod" \
    --approval-required "prod"

Cloud Deploy pipeline created: production-deploy
  Targets:
    1. dev (auto-promote)
    2. staging (auto-promote)
    3. prod (requires approval)

$ git push origin main
# Triggers build automatically

$ ./cicd-manager builds list
BUILD_ID      STATUS     DURATION  TRIGGER       IMAGES
abc123        SUCCESS    2m 34s    main-build    us-central1-docker.pkg.dev/...
def456        FAILURE    1m 12s    main-build    (none - test failed)
ghi789        RUNNING    0m 45s    main-build    ...

$ ./cicd-manager builds logs abc123
Step 1/6: Checkout code
  ✓ Cloned: github.com/myorg/myapp@abc123

Step 2/6: Run tests
  ✓ 234 tests passed, 0 failed

Step 3/6: Build container
  ✓ Built: myapp:abc123 (142MB)

Step 4/6: Scan for vulnerabilities
  ✓ Critical: 0, High: 0, Medium: 2

Step 5/6: Push to Artifact Registry
  ✓ Pushed: us-central1-docker.pkg.dev/myproject/repo/myapp:abc123

Step 6/6: Deploy to dev
  ✓ Released: myapp-dev → abc123

$ ./cicd-manager promote \
    --release "myapp-abc123" \
    --to "prod" \
    --approve

Promoting myapp-abc123 to prod...
  ✓ Approval recorded by: developer@company.com
  ✓ Deployment started
  ✓ Traffic shifted: 100%

Implementation Hints:

cloudbuild.yaml structure:

steps:
  - name: 'gcr.io/cloud-builders/docker'
    args: ['build', '-t', 'IMAGE_NAME', '.']
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', 'IMAGE_NAME']
images:
  - 'IMAGE_NAME'
substitutions:
  _REGION: us-central1

Builder images:

  • gcr.io/cloud-builders/docker: Docker commands
  • gcr.io/cloud-builders/gcloud: gcloud commands
  • gcr.io/cloud-builders/kubectl: Kubernetes commands
  • Community builders for other tools

Questions to guide your implementation:

  • How do you cache dependencies between builds?
  • How do you access secrets during build?
  • How do you run tests in parallel?
  • How do you implement rollbacks?

Learning milestones:

  1. You create a basic build → You understand Cloud Build
  2. You add triggers → You understand automation
  3. You implement scanning → You understand security
  4. You create deployment pipelines → You understand progressive delivery

Project 12: Observability Platform (Understand Monitoring, Logging, Tracing)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Node.js, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Observability / SRE / Monitoring
  • Software or Tool: Cloud Monitoring, Cloud Logging, Cloud Trace
  • Main Book: “Site Reliability Engineering” by Google

What you’ll build: A comprehensive observability system with custom metrics, structured logging, distributed tracing, alerting policies, and dashboards that give you full visibility into a distributed application.

Why it teaches observability: You can’t manage what you can’t measure. Understanding Google’s approach to observability—SLIs, SLOs, error budgets—is essential for running reliable systems.

Core challenges you’ll face:

  • Custom metrics → maps to defining and exporting application metrics
  • Structured logging → maps to JSON logs, log-based metrics
  • Distributed tracing → maps to OpenTelemetry, trace context propagation
  • Alerting policies → maps to notification channels, alert conditions
  • SLO monitoring → maps to availability, latency targets

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 6-8 (deployed services), understanding of metrics concepts

Real world outcome:

$ ./observability setup \
    --project "production-project" \
    --services "api,worker,web"

Setting up observability...
  ✓ Created metrics scope
  ✓ Enabled required APIs
  ✓ Installed agents on services

$ ./observability create-slo \
    --service "api" \
    --type "availability" \
    --target "99.9" \
    --window "30d"

SLO created: api-availability
  Target: 99.9% availability
  Window: 30 days
  Error budget: 43.2 minutes/month

$ ./observability create-alert \
    --name "high-error-rate" \
    --metric "logging.googleapis.com/log_entry_count" \
    --filter "severity=ERROR" \
    --threshold ">10/min" \
    --channels "email:oncall@company.com,slack:#alerts"

Alert policy created: high-error-rate
  Condition: Error log rate > 10/min
  Notification channels: email, slack

$ ./observability dashboard create \
    --name "API Service Dashboard" \
    --widgets "latency-heatmap,error-rate,request-count,slo-burndown"

Dashboard created: API Service Dashboard
URL: https://console.cloud.google.com/monitoring/dashboards/...

$ ./observability trace analyze --service "api" --last "1h"
╭─────────────────────────────────────────────────────────────╮
│ Trace Analysis: api (last 1 hour)                           │
├─────────────────────────────────────────────────────────────┤
│ Total traces: 12,456                                        │
│ Avg latency: 145ms                                          │
│ P99 latency: 890ms                                          │
│                                                             │
│ Slowest operations:                                         │
│   1. Cloud SQL query (avg 85ms)                            │
│   2. External API call (avg 120ms)                         │
│   3. Cache miss + rebuild (avg 200ms)                      │
│                                                             │
│ Sample slow trace: abc123-def456                            │
│   └─ api-handler (890ms)                                    │
│       ├─ auth-check (5ms)                                   │
│       ├─ db-query (650ms) ⚠️                               │
│       └─ response-build (10ms)                              │
╰─────────────────────────────────────────────────────────────╯

Implementation Hints:

Three pillars of observability:

  1. Metrics: Numeric measurements over time (CPU, latency, error rate)
  2. Logs: Discrete events with context (requests, errors, audit)
  3. Traces: Causal chain across services (request flow)

OpenTelemetry integration:

  • Cloud Trace and Cloud Monitoring support OpenTelemetry
  • Export metrics/traces with OTLP
  • Use Cloud Logging for logs (JSON structured)

Questions to guide your implementation:

  • What’s the difference between metrics and log-based metrics?
  • How do you propagate trace context across services?
  • How do you define an SLO that reflects user experience?
  • How do you calculate error budgets?

Learning milestones:

  1. You create custom metrics → You understand metrics collection
  2. You implement structured logging → You understand log analysis
  3. You add distributed tracing → You understand request flow
  4. You define SLOs and alerts → You understand SRE practices

Project 13: Infrastructure as Code (Understand Terraform + GCP)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform (HCL)
  • Alternative Programming Languages: Pulumi (Python/Go), Deployment Manager (YAML)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Infrastructure as Code / DevOps
  • Software or Tool: Terraform, Google Provider
  • Main Book: “Terraform: Up & Running” by Yevgeniy Brikman

What you’ll build: A complete GCP environment defined in Terraform—VPCs, GKE clusters, Cloud SQL, IAM policies, monitoring—with modules, remote state, and a CI/CD pipeline for infrastructure changes.

Why it teaches IaC: Infrastructure as Code is essential for reproducible, auditable cloud environments. Terraform is the industry standard, and the Google provider is mature and well-documented.

Core challenges you’ll face:

  • Resource dependencies → maps to Terraform’s dependency graph
  • State management → maps to GCS backend, state locking
  • Modules and reusability → maps to DRY infrastructure code
  • Import existing resources → maps to adopting IaC incrementally
  • Drift detection → maps to terraform plan, preventing manual changes

Key Concepts:

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-12 (understanding of GCP services), basic Terraform knowledge

Real world outcome:

$ tree infrastructure/
infrastructure/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── modules/
│   ├── gke-cluster/
│   ├── vpc-network/
│   ├── cloud-sql/
│   └── iam-bindings/
└── README.md

$ cd infrastructure/environments/dev && terraform init
Initializing modules...
- gke-cluster in ../../modules/gke-cluster
- vpc-network in ../../modules/vpc-network
Initializing backend (gcs)...
Terraform initialized!

$ terraform plan
Terraform will perform the following actions:

  # module.vpc.google_compute_network.main will be created
  + resource "google_compute_network" "main" {
      + name                    = "dev-vpc"
      + auto_create_subnetworks = false
      ...
    }

  # module.gke.google_container_cluster.primary will be created
  + resource "google_container_cluster" "primary" {
      + name     = "dev-cluster"
      + location = "us-central1"
      ...
    }

Plan: 15 to add, 0 to change, 0 to destroy.

$ terraform apply
Apply complete! Resources: 15 added, 0 changed, 0 destroyed.

Outputs:
  cluster_endpoint = "35.xxx.xxx.xxx"
  cluster_name = "dev-cluster"
  vpc_id = "projects/my-project/global/networks/dev-vpc"

$ terraform state list
module.gke.google_container_cluster.primary
module.gke.google_container_node_pool.primary_nodes
module.vpc.google_compute_network.main
module.vpc.google_compute_subnetwork.private
module.iam.google_project_iam_binding.gke_admin
...

Implementation Hints:

Terraform GCP best practices:

  • Use separate projects for different environments
  • Store state in GCS with versioning enabled
  • Use google-beta provider for new features
  • Enable required APIs in Terraform (google_project_service)

Module structure:

# modules/gke-cluster/main.tf
resource "google_container_cluster" "primary" {
  name     = var.cluster_name
  location = var.region
  ...
}

# modules/gke-cluster/variables.tf
variable "cluster_name" {
  type = string
}

# modules/gke-cluster/outputs.tf
output "endpoint" {
  value = google_container_cluster.primary.endpoint
}

Questions to guide your implementation:

  • How do you manage secrets in Terraform (hint: Secret Manager data source)?
  • How do you prevent state file conflicts in a team?
  • How do you import existing resources?
  • How do you test Terraform modules?

Learning milestones:

  1. You provision resources → You understand Terraform basics
  2. You create reusable modules → You understand modularity
  3. You manage state remotely → You understand team workflows
  4. You integrate with CI/CD → You understand GitOps

Project 14: Multi-Region Global Application (Understand Global Load Balancing)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Terraform (HCL)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Global Infrastructure / CDN / HA
  • Software or Tool: Global HTTP(S) LB, Cloud CDN, Armor
  • Main Book: “Designing Distributed Systems” by Brendan Burns

What you’ll build: A globally distributed application with instances in multiple regions, global HTTP(S) load balancing, Cloud CDN for static assets, Cloud Armor for WAF/DDoS protection, and automatic failover.

Why it teaches global architecture: GCP’s global load balancing is unique—a single anycast IP routes traffic to the nearest healthy backend. Understanding this model is essential for building truly global applications.

Core challenges you’ll face:

  • Global vs. regional load balancing → maps to choosing the right LB type
  • Backend services and health checks → maps to traffic management
  • Cloud CDN configuration → maps to caching policies, cache invalidation
  • Cloud Armor policies → maps to WAF rules, rate limiting
  • Multi-region failover → maps to disaster recovery

Key Concepts:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 4, 6-7, understanding of DNS and HTTP

Real world outcome:

$ ./global-app deploy \
    --regions "us-central1,europe-west1,asia-east1" \
    --service "web-app"

Deploying to 3 regions...
  ✓ us-central1: 2 instances (RUNNING)
  ✓ europe-west1: 2 instances (RUNNING)
  ✓ asia-east1: 2 instances (RUNNING)

$ ./global-app create-lb \
    --name "global-web-lb" \
    --backends "us-central1,europe-west1,asia-east1" \
    --cdn-enabled \
    --armor-policy "default-policy"

Creating Global HTTP(S) Load Balancer...
  ✓ Reserved global IP: 34.120.xxx.xxx
  ✓ SSL certificate: managed (for example.com)
  ✓ Backend services configured (3 regions)
  ✓ Health checks enabled (HTTP /health)
  ✓ Cloud CDN enabled (cache static assets)
  ✓ Cloud Armor policy attached

Global endpoint: https://34.120.xxx.xxx
Configure DNS: example.com A 34.120.xxx.xxx

$ ./global-app test-latency
╭─────────────────────────────────────────────────────────────╮
│ Latency Test: https://example.com                           │
├─────────────────────────────────────────────────────────────┤
│ Client Location      Latency    Backend Used               │
│ ─────────────────────────────────────────────────────────── │
│ New York, US         45ms       us-central1                │
│ London, UK           38ms       europe-west1               │
│ Tokyo, Japan         42ms       asia-east1                 │
│ São Paulo, Brazil    120ms      us-central1                │
│ Sydney, Australia    95ms       asia-east1                 │
│                                                             │
│ Cache hit rate: 67% (static assets)                        │
╰─────────────────────────────────────────────────────────────╯

$ ./global-app simulate-failover --region "us-central1"
Simulating failure of us-central1...
  ✓ us-central1 backends marked unhealthy
  ✓ Traffic rerouted:
      - US East Coast → europe-west1 (120ms latency)
      - US West Coast → asia-east1 (150ms latency)
  ✓ No dropped requests during failover

Recovering us-central1...
  ✓ Backends healthy
  ✓ Traffic restored to normal routing

Implementation Hints:

Global LB components:

  1. Global External IP: Anycast IP, routed to nearest Google edge
  2. Forwarding rule: Maps IP:port to target proxy
  3. Target HTTP(S) proxy: SSL termination, URL map
  4. URL map: Routes paths to backend services
  5. Backend service: Group of backends in one or more regions
  6. Instance groups: The actual compute resources

Traffic management:

  • Affinity: Session stickiness (client IP, cookie)
  • Balancing mode: Utilization vs. rate vs. connection
  • Capacity: Maximum % of backends to use

Questions to guide your implementation:

  • When would you use regional instead of global load balancing?
  • How do you configure cache-control headers for CDN?
  • How do you handle cache invalidation?
  • How do you rate-limit by IP with Cloud Armor?

Learning milestones:

  1. You create a global load balancer → You understand GCP’s edge network
  2. You configure CDN → You understand caching
  3. You implement Cloud Armor → You understand security
  4. You test failover → You understand high availability

Project 15: AI/ML Pipeline (Understand Vertex AI)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Jupyter Notebooks, TensorFlow, JAX
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Machine Learning / MLOps
  • Software or Tool: Vertex AI, TensorFlow, Kubeflow Pipelines
  • Main Book: “Machine Learning Engineering” by Andriy Burkov

What you’ll build: An end-to-end ML pipeline using Vertex AI—data preparation, model training on managed compute, hyperparameter tuning, model registry, online/batch prediction endpoints, and A/B testing of models.

Why it teaches AI/ML on GCP: Vertex AI unifies GCP’s ML services. Understanding managed training, AutoML, prediction endpoints, and ML pipelines is essential for production ML systems.

Core challenges you’ll face:

  • Custom training jobs → maps to containerized training on managed compute
  • Hyperparameter tuning → maps to Vertex AI Vizier integration
  • Model registry → maps to versioning and lineage
  • Prediction endpoints → maps to online vs. batch, autoscaling
  • ML Pipelines → maps to Kubeflow Pipelines on Vertex

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Python, ML basics (training, evaluation), Project 9 (BigQuery), understanding of containers

Real world outcome:

$ ./ml-pipeline create-dataset \
    --source "bq://my-project.analytics.training_data" \
    --type "tabular" \
    --split "80:10:10"

Dataset created: projects/my-project/locations/us-central1/datasets/12345
  Source: BigQuery table
  Rows: 1,234,567
  Split: 80% train, 10% validation, 10% test

$ ./ml-pipeline train \
    --dataset "12345" \
    --model-type "custom" \
    --container "us-central1-docker.pkg.dev/my-project/ml/trainer:v1" \
    --machine-type "n1-standard-8" \
    --accelerator "NVIDIA_TESLA_T4:1" \
    --hyperparameters "learning_rate=0.001,batch_size=64"

Training job submitted: train-job-12345
  Machine: n1-standard-8 + 1x T4 GPU
  Container: trainer:v1

Status: RUNNING (epoch 15/100)
Metrics:
  Training loss: 0.234
  Validation accuracy: 0.891

$ ./ml-pipeline tune \
    --base-config "./training_config.yaml" \
    --objective "maximize:accuracy" \
    --parameters "learning_rate[0.0001,0.01],batch_size[32,64,128]" \
    --max-trials 20

Hyperparameter tuning job started: tune-job-12345
  Trials: 20
  Algorithm: Bayesian optimization

Best trial (18/20):
  learning_rate: 0.0032
  batch_size: 64
  accuracy: 0.923

$ ./ml-pipeline deploy \
    --model "model-12345" \
    --endpoint "prediction-endpoint" \
    --machine-type "n1-standard-2" \
    --min-replicas 1 \
    --max-replicas 5

Model deployed: prediction-endpoint
  URL: https://us-central1-aiplatform.googleapis.com/.../endpoints/67890
  Autoscaling: 1-5 replicas

$ ./ml-pipeline predict \
    --endpoint "prediction-endpoint" \
    --instances '[{"feature1": 0.5, "feature2": "value"}]'

Predictions:
  [{"prediction": 1, "probability": 0.934}]

$ ./ml-pipeline monitor --endpoint "prediction-endpoint"
╭─────────────────────────────────────────────────────────────╮
│ Endpoint: prediction-endpoint (last 24h)                    │
├─────────────────────────────────────────────────────────────┤
│ Predictions: 45,678                                         │
│ Latency (p50/p95): 25ms / 85ms                             │
│ Error rate: 0.02%                                           │
│                                                             │
│ Feature drift detected:                                     │
│   ⚠️ feature1: distribution shift (KL divergence: 0.15)    │
│   Consider retraining with recent data                     │
╰─────────────────────────────────────────────────────────────╯

Implementation Hints:

Vertex AI components:

  1. Datasets: Managed data with splits
  2. Training: Custom containers or AutoML
  3. Models: Versioned model artifacts
  4. Endpoints: Serving infrastructure
  5. Pipelines: Orchestrated ML workflows

Custom training container:

  • Must read hyperparameters from command line or env vars
  • Must save model to $AIP_MODEL_DIR
  • Can use any ML framework

Questions to guide your implementation:

  • When should you use AutoML vs. custom training?
  • How do you handle feature engineering in Vertex AI?
  • How do you implement A/B testing of models?
  • How do you monitor for model drift?

Learning milestones:

  1. You create training jobs → You understand managed training
  2. You tune hyperparameters → You understand Vizier
  3. You deploy endpoints → You understand model serving
  4. You build pipelines → You understand MLOps

Project 16: Cost Optimization System (Understand Billing & Cost Management)

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, SQL (BigQuery)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: FinOps / Cost Management
  • Software or Tool: Billing Export, Recommender API, Budget Alerts
  • Main Book: “Cloud FinOps” by J.R. Storment & Mike Fuller

What you’ll build: A cost optimization tool that analyzes billing exports, identifies waste (idle VMs, oversized instances, unused storage), implements recommendations automatically, and sets up budget alerts with notifications.

Why it teaches cost management: Cloud costs can spiral out of control. Understanding GCP’s billing model, export options, and cost optimization tools is essential for any cloud architect.

Core challenges you’ll face:

  • Billing export to BigQuery → maps to detailed cost analysis
  • Recommender API → maps to automated optimization suggestions
  • Committed use discounts → maps to long-term planning
  • Budget alerts → maps to proactive cost control
  • Resource labeling → maps to cost allocation

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 9 (BigQuery), understanding of GCP services

Real world outcome:

$ ./cost-optimizer analyze --project "production-project" --period "30d"
╭─────────────────────────────────────────────────────────────╮
│ Cost Analysis: production-project (Last 30 days)            │
├─────────────────────────────────────────────────────────────┤
│ Total spend: $4,567.89                                      │
│                                                             │
│ By service:                                                 │
│   Compute Engine:    $2,345.00 (51%)                       │
│   BigQuery:          $890.00 (19%)                         │
│   Cloud Storage:     $456.00 (10%)                         │
│   Cloud SQL:         $345.00 (8%)                          │
│   Other:             $531.89 (12%)                         │
│                                                             │
│ By label (team):                                            │
│   platform:          $2,100.00                             │
│   data:              $1,200.00                             │
│   (unlabeled):       $1,267.89 ⚠️                          │
╰─────────────────────────────────────────────────────────────╯

$ ./cost-optimizer recommendations
╭─────────────────────────────────────────────────────────────╮
│ Cost Optimization Recommendations                            │
├─────────────────────────────────────────────────────────────┤
│ 1. IDLE VMs (save $456/month)                               │
│    - dev-server-1: 0% CPU for 14 days → Stop or delete     │
│    - test-env-3: 2% CPU average → Downsize to e2-small     │
│                                                             │
│ 2. OVERSIZED INSTANCES (save $234/month)                    │
│    - web-server-1: n2-standard-8 → n2-standard-4           │
│    - worker-2: n2-highmem-4 → n2-standard-4                │
│                                                             │
│ 3. UNUSED STORAGE (save $89/month)                          │
│    - old-backups/: 500GB not accessed in 90 days           │
│    - Recommend: Move to Coldline or delete                  │
│                                                             │
│ 4. COMMITTED USE DISCOUNTS (save $1,200/month)              │
│    - Stable workload of 20 vCPUs detected                  │
│    - Recommend: 1-year commitment for 37% discount         │
│                                                             │
│ Total potential savings: $1,979/month (43% reduction)       │
╰─────────────────────────────────────────────────────────────╯

$ ./cost-optimizer apply --recommendation "idle-vms"
Applying recommendations...
  ✓ dev-server-1: Stopped
  ✓ test-env-3: Resize scheduled (requires restart)

Estimated savings: $456/month

$ ./cost-optimizer create-budget \
    --amount 5000 \
    --period monthly \
    --thresholds "50,80,100" \
    --notification-channels "email:finance@company.com,pubsub:budget-alerts"

Budget created: monthly-5000
  Amount: $5,000/month
  Alerts at: 50%, 80%, 100%
  Notifications: email, Pub/Sub

Implementation Hints:

Billing export to BigQuery:

  • Standard export: Daily, summarized by project/service
  • Detailed export: Hourly, includes resource-level data
  • Pricing export: List prices for analysis

Recommender types:

  • google.compute.instance.MachineTypeRecommender
  • google.compute.instance.IdleResourceRecommender
  • google.iam.policy.Recommender
  • Many more…

Questions to guide your implementation:

  • How do you attribute costs to teams using labels?
  • When do committed use discounts make sense?
  • How do you forecast future costs?
  • How do you implement a chargeback model?

Learning milestones:

  1. You analyze billing data → You understand cost visibility
  2. You implement recommendations → You understand optimization
  3. You set up alerts → You understand proactive management
  4. You attribute costs → You understand FinOps

Final Project: Production SaaS Platform on GCP

  • File: GOOGLE_CLOUD_PLATFORM_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, TypeScript
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Full-Stack Cloud Architecture
  • Software or Tool: All GCP services integrated
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A complete, production-ready SaaS platform that combines everything you’ve learned—multi-region deployment, GKE with autoscaling, Cloud SQL and Firestore, Pub/Sub for events, BigQuery for analytics, Vertex AI for ML features, global load balancing, comprehensive observability, security best practices, infrastructure as code, and CI/CD.

Why this is the ultimate project: This is where everything comes together. You’ll face real architectural decisions, tradeoffs, and challenges that only emerge in complex systems. Completing this project proves you can architect and operate production GCP infrastructure.

Core challenges you’ll face:

  • Multi-tenancy architecture → maps to data isolation, resource quotas
  • Event-driven microservices → maps to Pub/Sub, eventual consistency
  • Data pipeline for analytics → maps to BigQuery, real-time dashboards
  • ML feature integration → maps to Vertex AI endpoints in production
  • Security and compliance → maps to audit logging, encryption, VPC SC
  • Disaster recovery → maps to backups, multi-region failover

Key Concepts:

  • System Design: “Designing Data-Intensive Applications” - Kleppmann
  • SaaS Architecture: “Building Multi-Tenant Applications” - various Google docs
  • Microservices: “Building Microservices” by Sam Newman
  • SRE: “Site Reliability Engineering” - Google

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects (1-16)

Real world outcome:

Production SaaS Platform Architecture:
┌─────────────────────────────────────────────────────────────────────────┐
│                         Global HTTP(S) Load Balancer                     │
│                         + Cloud Armor + Cloud CDN                        │
└──────────────────────────────────┬──────────────────────────────────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        │                          │                          │
┌───────▼───────┐         ┌────────▼───────┐         ┌────────▼───────┐
│  us-central1  │         │  europe-west1  │         │   asia-east1   │
├───────────────┤         ├────────────────┤         ├────────────────┤
│   GKE Cluster │         │   GKE Cluster  │         │   GKE Cluster  │
│   - API Pods  │         │   - API Pods   │         │   - API Pods   │
│   - Workers   │         │   - Workers    │         │   - Workers    │
│               │         │                │         │                │
│   Cloud SQL   │◄───────►│   Cloud SQL    │◄───────►│   Cloud SQL    │
│   (Primary)   │ Replica │   (Replica)    │ Replica │   (Replica)    │
└───────────────┘         └────────────────┘         └────────────────┘
        │                          │                          │
        └──────────────────────────┼──────────────────────────┘
                                   │
                          ┌────────▼────────┐
                          │     Pub/Sub     │
                          │  (Event Bus)    │
                          └────────┬────────┘
                                   │
        ┌────────────────┬─────────┴─────────┬────────────────┐
        ▼                ▼                   ▼                ▼
┌───────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Cloud Storage │ │  Firestore  │ │    BigQuery     │ │  Vertex AI   │
│ (Attachments) │ │ (Real-time) │ │  (Analytics)    │ │ (ML Models)  │
└───────────────┘ └─────────────┘ └─────────────────┘ └──────────────┘

Implementation Hints:

Architecture principles:

  1. Twelve-Factor App: Config in env, logs as streams, etc.
  2. Microservices: Single responsibility, API contracts
  3. Event sourcing: Pub/Sub as the source of truth
  4. CQRS: Separate read (BigQuery) and write (Cloud SQL) paths

Multi-tenancy patterns:

  • Pool model: Shared resources, tenant ID in data
  • Silo model: Separate projects per tenant (high isolation)
  • Bridge model: Shared compute, separate data stores

High-level implementation phases:

  1. Infrastructure: Terraform modules for entire platform
  2. Core services: Auth, API gateway, user management
  3. Business logic: Tenant-specific features
  4. Data layer: Cloud SQL + Firestore + BigQuery integration
  5. ML features: Vertex AI endpoints for recommendations
  6. Observability: SLOs, dashboards, alerting
  7. CI/CD: Automated deployments with approval gates
  8. Security: VPC SC, audit logging, penetration testing
  9. DR testing: Multi-region failover drills

Questions to guide your implementation:

  • How do you handle database migrations in a multi-tenant system?
  • How do you implement feature flags for gradual rollouts?
  • How do you handle backpressure in the event pipeline?
  • How do you implement rate limiting per tenant?
  • How do you ensure data isolation between tenants?

Learning milestones:

  1. Core platform running → You can integrate multiple GCP services
  2. Multi-tenancy working → You understand data isolation
  3. Analytics pipeline flowing → You understand real-time data
  4. ML features deployed → You understand MLOps
  5. Surviving chaos testing → You understand reliability
  6. Passing security audit → You understand compliance

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. GCP CLI & API Explorer Beginner Weekend ⭐⭐ ⭐⭐
2. VM Orchestrator Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐
3. Object Storage System Intermediate 1 week ⭐⭐⭐ ⭐⭐
4. VPC Network Architect Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
5. IAM Policy Manager Advanced 1-2 weeks ⭐⭐⭐⭐ ⭐⭐
6. GKE Cluster Manager Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
7. Cloud Run Platform Intermediate 1 week ⭐⭐⭐ ⭐⭐⭐⭐
8. Event-Driven Pipeline Intermediate 1-2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
9. BigQuery Analytics Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
10. Multi-Database App Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐
11. CI/CD Pipeline Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐
12. Observability Platform Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
13. Terraform IaC Advanced 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐
14. Global Application Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
15. Vertex AI Pipeline Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
16. Cost Optimization Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐
Final: SaaS Platform Master 2-3 months ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

For Complete Beginners to Cloud

  1. Start with Project 1 (CLI & API Explorer) - 1 weekend
  2. Then Project 2 (VM Orchestrator) - 1-2 weeks
  3. Then Project 3 (Cloud Storage) - 1 week
  4. Then Project 7 (Cloud Run) - 1 week

This gives you a foundation in compute, storage, and serverless in about 1 month.

For Developers with Some Cloud Experience

  1. Skim Projects 1-3 quickly
  2. Deep dive into Project 4 (VPC Networking) - 2 weeks
  3. Then Project 5 (IAM) - 1-2 weeks
  4. Then Project 6 (GKE) - 2-3 weeks
  5. Then Project 11 (CI/CD) - 1-2 weeks

This builds your infrastructure and DevOps skills in about 2 months.

For the GCP Certification Path

Follow projects in order 1-16, then attempt the final project. This comprehensive path will prepare you for:

  • Associate Cloud Engineer: Projects 1-7, 11
  • Professional Cloud Architect: All projects
  • Professional Data Engineer: Projects 1, 3, 8, 9, 15
  • Professional Cloud DevOps Engineer: Projects 1, 6-8, 11-12

Summary

# Project Name Main Language
1 GCP CLI & API Explorer Python
2 Virtual Machine Orchestrator Python
3 Object Storage System Python
4 VPC Network Architect Python
5 IAM Policy Manager Python
6 Kubernetes Cluster Manager Go
7 Serverless Container Platform Go
8 Event-Driven Data Pipeline Python
9 Data Warehouse Analytics Platform Python
10 Multi-Database Application Python
11 CI/CD Pipeline YAML (Cloud Build)
12 Observability Platform Python
13 Infrastructure as Code Terraform (HCL)
14 Multi-Region Global Application Python
15 AI/ML Pipeline Python
16 Cost Optimization System Python
Final Production SaaS Platform Go

Essential Resources

Official Documentation

Books

  • “Google Cloud Platform in Action” by JJ Geewax
  • “Google BigQuery: The Definitive Guide” by Valliappa Lakshmanan & Jordan Tigani
  • “Kubernetes Up & Running” by Burns, Beda, Hightower
  • “Site Reliability Engineering” by Google
  • “Designing Data-Intensive Applications” by Martin Kleppmann
  • “Terraform: Up & Running” by Yevgeniy Brikman

Certifications

Community