Deep Understanding of AWS Through Building

Goal: Master AWS cloud infrastructure by building real systems that break, fail, and teach you why networking, security, and scalability patterns exist. You’ll go from clicking through the console to understanding why a private subnet needs a NAT Gateway, how IAM policy evaluation actually works, and when to choose Lambda over ECS over EC2.

Why AWS Mastery Requires Building

You’re tackling a massive ecosystem with over 200 services across compute, storage, networking, databases, machine learning, and more. As of 2025, AWS dominates the cloud infrastructure market with a 30% market share (compared to Azure’s 20% and Google Cloud’s 13%), serving 4.19 million customers worldwide—a 357% increase since 2020.

The market reality: Cloud infrastructure spending hit $102.6 billion in Q3 2025 alone, with AWS earning $29 billion from cloud services in Q1 2025. The cloud market is projected to exceed $400 billion annually for the first time in 2025. This means AWS expertise is not just valuable—it’s increasingly critical as organizations migrate workloads to the cloud.

But here’s the challenge: AWS is not something you can understand by reading documentation—you need to build systems where misconfigured security groups break things, where wrong subnet routing causes timeouts, and where Lambda cold starts ruin your latency (typically 200-400ms for Python/Node.js, but can reach 1-3 seconds for Java/.NET). That’s when the concepts become real.

The AWS Learning Problem

Most developers learn AWS backwards:

Click through console tutorials
Copy-paste CloudFormation templates
Wonder why things break in production
Panic when asked “why did you choose this architecture?”

The right approach:

Understand the problem each service solves
Build it wrong first (feel the pain)
Fix it using IaC (understand every line)
Break it intentionally (learn failure modes)
Explain your architecture to others

The AWS Shared Responsibility Model

Before building anything, understand who is responsible for what:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        YOUR RESPONSIBILITY                                   │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Customer Data                                                         │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Platform, Applications, Identity & Access Management                  │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Operating System, Network & Firewall Configuration                   │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Client-Side Data Encryption │ Server-Side Encryption │ Network Traffic│  │
│  │  & Data Integrity Auth       │ (File System/Data)     │ Protection      │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
├─────────────────────────────────────────────────────────────────────────────┤
│                        AWS RESPONSIBILITY                                    │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Compute │ Storage │ Database │ Networking                            │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Hardware/AWS Global Infrastructure                                   │  │
│  │  (Regions, Availability Zones, Edge Locations)                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Key insight: AWS manages the physical infrastructure. YOU manage everything from the OS up. If your EC2 instance gets hacked because of a weak password, that’s on you. If an AWS data center floods, that’s on them.

Understanding AWS Architecture: The Mental Model

The Three Pillars of AWS

Every AWS architecture decision comes down to balancing three concerns:

                    ┌─────────────────┐
                    │                 │
                    │   RELIABILITY   │
                    │  (Multi-AZ,     │
                    │   Redundancy)   │
                    │                 │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
    ┌─────────────────┐           ┌─────────────────┐
    │                 │           │                 │
    │      COST       │◄─────────►│   PERFORMANCE   │
    │  (Right-sizing, │           │  (Low latency,  │
    │   Reserved,     │           │   High throughput│
    │   Spot)         │           │   Scaling)      │
    │                 │           │                 │
    └─────────────────┘           └─────────────────┘

    The AWS Well-Architected Trade-off Triangle

Every architecture decision involves trade-offs:

Multi-AZ RDS is more reliable but costs 2x
Spot instances are 70% cheaper but can be terminated anytime
Lambda has zero idle cost but cold starts add latency

AWS Global Infrastructure

Understanding the hierarchy is critical:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              AWS GLOBAL                                      │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  Region: us-east-1 (N. Virginia)                                        ││
│  │  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐               ││
│  │  │     AZ 1       │ │     AZ 2       │ │     AZ 3       │  ...          ││
│  │  │  us-east-1a    │ │  us-east-1b    │ │  us-east-1c    │               ││
│  │  │                │ │                │ │                │               ││
│  │  │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │               ││
│  │  │ │ Data Center│ │ │ │ Data Center│ │ │ │ Data Center│ │               ││
│  │  │ │            │ │ │ │            │ │ │ │            │ │               ││
│  │  │ │ • EC2      │ │ │ │ • EC2      │ │ │ │ • EC2      │ │               ││
│  │  │ │ • RDS      │ │ │ │ • RDS      │ │ │ │ • RDS      │ │               ││
│  │  │ │ • EBS      │ │ │ │ • EBS      │ │ │ │ • EBS      │ │               ││
│  │  │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │               ││
│  │  └────────────────┘ └────────────────┘ └────────────────┘               ││
│  │                                                                          ││
│  │  ← Low latency connections between AZs (< 2ms) →                        ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  Region: eu-west-1 (Ireland)                                            ││
│  │  ...similar AZ structure...                                              ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  Edge Locations (CloudFront): 400+ worldwide for content delivery           │
└─────────────────────────────────────────────────────────────────────────────┘

Key insight for projects:

Region: Contains all your resources, data residency compliance
AZ (Availability Zone): Isolated data centers, failure boundary
Multi-AZ: Deploy across 2+ AZs for high availability (if one AZ fails, others continue)

VPC Networking: The Foundation of Everything

Every AWS resource you deploy lives in a network. Understanding VPC is non-negotiable.

What is a VPC?

A Virtual Private Cloud (VPC) is your isolated network within AWS. Think of it as your own private data center in the cloud, with complete control over:

IP address ranges (CIDR blocks)
Subnets (network segments)
Route tables (traffic rules)
Gateways (internet access)
Security (firewalls)

The Anatomy of a Production VPC

┌─────────────────────────────────────────────────────────────────────────────────┐
│  VPC: 10.0.0.0/16 (65,536 IP addresses)                                         │
│                                                                                  │
│  ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐│
│  │  Availability Zone A (us-east-1a)   │ │  Availability Zone B (us-east-1b)   ││
│  │                                     │ │                                     ││
│  │  ┌─────────────────────────────┐   │ │  ┌─────────────────────────────┐   ││
│  │  │ PUBLIC SUBNET: 10.0.1.0/24  │   │ │  │ PUBLIC SUBNET: 10.0.2.0/24  │   ││
│  │  │ (256 IPs)                   │   │ │  │ (256 IPs)                   │   ││
│  │  │                             │   │ │  │                             │   ││
│  │  │  ┌───────┐    ┌───────┐    │   │ │  │  ┌───────┐    ┌───────┐    │   ││
│  │  │  │Bastion│    │  NAT  │    │   │ │  │  │Bastion│    │  NAT  │    │   ││
│  │  │  │ Host  │    │Gateway│    │   │ │  │  │ (HA)  │    │Gateway│    │   ││
│  │  │  └───────┘    └───┬───┘    │   │ │  │  └───────┘    └───┬───┘    │   ││
│  │  └───────────────────┼────────┘   │ │  └───────────────────┼────────┘   ││
│  │                      │            │ │                      │            ││
│  │  ┌───────────────────┼────────┐   │ │  ┌───────────────────┼────────┐   ││
│  │  │ PRIVATE SUBNET:   │        │   │ │  │ PRIVATE SUBNET:   │        │   ││
│  │  │ 10.0.10.0/24      │        │   │ │  │ 10.0.20.0/24      │        │   ││
│  │  │ (App Tier)        ▼        │   │ │  │ (App Tier)        ▼        │   ││
│  │  │                            │   │ │  │                            │   ││
│  │  │  ┌───────┐    ┌───────┐   │   │ │  │  ┌───────┐    ┌───────┐   │   ││
│  │  │  │  EC2  │    │  EC2  │   │   │ │  │  │  EC2  │    │  EC2  │   │   ││
│  │  │  │ (App) │    │ (App) │   │   │ │  │  │ (App) │    │ (App) │   │   ││
│  │  │  └───────┘    └───────┘   │   │ │  │  └───────┘    └───────┘   │   ││
│  │  └────────────────────────────┘   │ │  └────────────────────────────┘   ││
│  │                                     │ │                                     ││
│  │  ┌────────────────────────────┐   │ │  ┌────────────────────────────┐   ││
│  │  │ DATA SUBNET: 10.0.100.0/24 │   │ │  │ DATA SUBNET: 10.0.200.0/24 │   ││
│  │  │ (Database Tier)            │   │ │  │ (Database Tier)            │   ││
│  │  │                            │   │ │  │                            │   ││
│  │  │  ┌────────────────────┐   │   │ │  │  ┌────────────────────┐   │   ││
│  │  │  │  RDS Primary       │───┼───┼─┼──│  │  RDS Standby      │   │   ││
│  │  │  │  (PostgreSQL)      │   │   │ │  │  │  (Sync Replica)    │   │   ││
│  │  │  └────────────────────┘   │   │ │  │  └────────────────────┘   │   ││
│  │  └────────────────────────────┘   │ │  └────────────────────────────┘   ││
│  └─────────────────────────────────────┘ └─────────────────────────────────────┘│
│                                                                                  │
│  ┌─────────────────┐                                                            │
│  │ Internet Gateway│ ◄──── Allows public subnets to reach internet             │
│  └────────┬────────┘                                                            │
│           │                                                                      │
│           ▼                                                                      │
│      ┌─────────┐                                                                │
│      │ Internet│                                                                │
│      └─────────┘                                                                │
└─────────────────────────────────────────────────────────────────────────────────┘

Understanding CIDR Notation

CIDR (Classless Inter-Domain Routing) defines IP address ranges:

IP Address:  10.0.0.0
CIDR:        /16

Binary breakdown:
10.0.0.0/16 means:
├── First 16 bits are FIXED (network portion)
│   10      .    0       (00001010.00000000)
│
└── Last 16 bits are FREE (host portion)
    x.x (00000000.00000000 to 11111111.11111111)

Result: 10.0.0.0 to 10.0.255.255 = 65,536 addresses

Common CIDR blocks for VPC design:
┌─────────┬─────────────────┬──────────────────────────────┐
│ CIDR    │ # of IPs        │ Use Case                     │
├─────────┼─────────────────┼──────────────────────────────┤
│ /16     │ 65,536          │ Entire VPC                   │
│ /20     │ 4,096           │ Large subnet (production)    │
│ /24     │ 256             │ Standard subnet              │
│ /28     │ 16              │ Small subnet (bastion hosts) │
└─────────┴─────────────────┴──────────────────────────────┘

⚠️  AWS reserves 5 IPs per subnet:
   .0   Network address
   .1   VPC router
   .2   DNS server
   .3   Reserved for future use
   .255 Broadcast (not supported, but reserved)

So a /24 subnet (256 IPs) actually has 251 usable IPs.

Public vs Private Subnets: The Critical Difference

PUBLIC SUBNET                          PRIVATE SUBNET
─────────────                          ──────────────
Route Table:                           Route Table:
┌────────────────────────────┐        ┌────────────────────────────┐
│ Destination   │ Target     │        │ Destination   │ Target     │
├───────────────┼────────────┤        ├───────────────┼────────────┤
│ 10.0.0.0/16   │ local      │        │ 10.0.0.0/16   │ local      │
│ 0.0.0.0/0     │ igw-xxxxx  │◄──IGW │ 0.0.0.0/0     │ nat-xxxxx  │◄──NAT
└───────────────┴────────────┘        └───────────────┴────────────┘

Key differences:
┌─────────────────┬──────────────────────┬──────────────────────┐
│                 │ Public Subnet        │ Private Subnet       │
├─────────────────┼──────────────────────┼──────────────────────┤
│ Internet access │ Via Internet Gateway │ Via NAT Gateway      │
│ Inbound traffic │ Can receive from web │ Cannot receive       │
│ Public IP       │ Can have Elastic IP  │ No public IP         │
│ Use case        │ Load balancers,      │ App servers,         │
│                 │ bastion hosts        │ databases            │
└─────────────────┴──────────────────────┴──────────────────────┘

Security Groups vs NACLs: Two Layers of Defense

┌─────────────────────────────────────────────────────────────────────────────┐
│                            NACL (Subnet Level)                               │
│                            - Stateless                                       │
│                            - Explicit allow/deny                             │
│                            - Rule numbers (processed in order)               │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         Security Group (Instance Level)                │  │
│  │                         - Stateful (return traffic auto-allowed)       │  │
│  │                         - Allow rules only (implicit deny)             │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │                           EC2 Instance                          │  │  │
│  │  │                                                                  │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Traffic flow example (HTTP request to web server):

1. Request arrives at VPC
2. NACL checks inbound rules (allow port 80?)     → Yes? Continue
3. Security Group checks inbound rules            → Yes? Continue
4. Request reaches EC2 instance
5. Response leaves EC2 instance
6. Security Group: AUTOMATICALLY allows response  → Stateful!
7. NACL checks outbound rules (allow response?)   → Must be explicit
8. NACL checks ephemeral port range               → Rule needed!

Security Group (stateful):          NACL (stateless):
Inbound: Allow TCP 80               Inbound: Allow TCP 80
Outbound: (not needed for response) Outbound: Allow TCP 1024-65535 (ephemeral)

IAM: The Security Model You Must Master

IAM (Identity and Access Management) controls WHO can do WHAT to WHICH resources.

The IAM Policy Language

Every IAM policy answers these questions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",              ◄─── Allow or Deny?
            "Action": [                     ◄─── What actions?
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [                   ◄─── Which resources?
                "arn:aws:s3:::my-bucket/*"
            ],
            "Condition": {                  ◄─── Under what conditions?
                "StringEquals": {
                    "aws:RequestedRegion": "us-east-1"
                }
            }
        }
    ]
}

IAM Policy Evaluation Logic

When AWS evaluates permissions, it follows this order:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          IAM Policy Evaluation                               │
│                                                                              │
│  1. By default, all requests are DENIED (implicit deny)                      │
│                     │                                                        │
│                     ▼                                                        │
│  2. Check all applicable policies                                            │
│     ┌──────────────┬──────────────┬──────────────┬──────────────┐           │
│     │ Identity     │ Resource     │ Permission   │ Service      │           │
│     │ Policies     │ Policies     │ Boundaries   │ Control      │           │
│     │ (IAM user/   │ (S3 bucket,  │ (IAM)        │ Policies     │           │
│     │  role)       │  SQS queue)  │              │ (Org level)  │           │
│     └──────────────┴──────────────┴──────────────┴──────────────┘           │
│                     │                                                        │
│                     ▼                                                        │
│  3. If ANY policy has explicit DENY ──────────────► DENIED (final)          │
│                     │                                                        │
│                     ▼                                                        │
│  4. If ANY policy has explicit ALLOW ─────────────► ALLOWED                 │
│                     │                                                        │
│                     ▼                                                        │
│  5. If no explicit allow found ───────────────────► DENIED (implicit)       │
│                                                                              │
│  Remember: Explicit DENY always wins!                                        │
└─────────────────────────────────────────────────────────────────────────────┘

IAM Roles vs Users vs Groups

┌─────────────────────────────────────────────────────────────────────────────┐
│  IAM USERS                                                                   │
│  - Permanent credentials (access key + secret key)                          │
│  - Used for: Human users, CLI access                                        │
│  - Best practice: Use MFA, rotate keys regularly                            │
│                                                                              │
│  ┌─────────┐                                                                │
│  │  User   │───► Has access keys, belongs to groups                        │
│  └─────────┘                                                                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  IAM GROUPS                                                                  │
│  - Collection of users                                                       │
│  - Policies attached to group apply to all members                          │
│  - Used for: Organizing users by job function (Developers, Admins)          │
│                                                                              │
│  ┌─────────┐                                                                │
│  │  Group  │───► Contains users, has policies                              │
│  │(Devs)   │     ┌──────┐ ┌──────┐ ┌──────┐                                │
│  └─────────┘     │User A│ │User B│ │User C│                                │
│                  └──────┘ └──────┘ └──────┘                                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  IAM ROLES                                                                   │
│  - Temporary credentials (auto-rotated by AWS)                              │
│  - Can be ASSUMED by: EC2, Lambda, ECS, other AWS accounts, SAML users     │
│  - Used for: Service-to-service authentication, cross-account access        │
│                                                                              │
│  ┌─────────┐         ┌─────────────────────────────────────────┐           │
│  │  Role   │         │  Trust Policy: WHO can assume this role │           │
│  │(Lambda) │◄────────│  Permissions: WHAT they can do          │           │
│  └─────────┘         └─────────────────────────────────────────┘           │
│       │                                                                      │
│       ▼                                                                      │
│  Lambda function assumes role → Gets temp credentials → Accesses S3         │
└─────────────────────────────────────────────────────────────────────────────┘

Best Practice Hierarchy:
   ┌──────────────────────────────────────────────────────────────┐
   │  PREFER ROLES OVER USERS                                     │
   │                                                               │
   │  ✓ Roles: Temporary creds, auto-rotated, no keys to leak    │
   │  ✗ Users: Permanent creds, must rotate manually, can leak    │
   │                                                               │
   │  EC2 instances? Use Instance Profile (role wrapper)          │
   │  Lambda? Use Execution Role                                   │
   │  ECS tasks? Use Task Role                                     │
   │  Cross-account? Use AssumeRole                                │
   └──────────────────────────────────────────────────────────────┘

Serverless Architecture: Lambda, Step Functions & Event-Driven Design

The Lambda Execution Model

Understanding Lambda’s lifecycle is critical for performance:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        LAMBDA EXECUTION LIFECYCLE                            │
│                                                                              │
│  COLD START (first invocation or after idle)                                │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  1. Download code       2. Start runtime    3. Initialize handler      │ │
│  │     from S3                (Python, Node)      (your imports)          │ │
│  │     ~100ms                 ~200ms              ~500-3000ms             │ │
│  │                                                                         │ │
│  │  TOTAL COLD START: 800ms - 5s depending on package size               │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                          │                                                   │
│                          ▼                                                   │
│  WARM START (execution environment reused)                                  │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  Environment already running, just execute handler                      │ │
│  │  TOTAL WARM START: <100ms                                              │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                          │                                                   │
│                          ▼                                                   │
│  EXECUTION CONTEXT REUSE                                                    │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                                                                         │ │
│  │  # This code runs ONCE (cold start only)                               │ │
│  │  import boto3                                                           │ │
│  │  s3_client = boto3.client('s3')  # Reused across invocations!         │ │
│  │                                                                         │ │
│  │  # This code runs EVERY invocation                                     │ │
│  │  def handler(event, context):                                           │ │
│  │      s3_client.get_object(...)  # Uses pre-initialized client          │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  LAMBDA LIMITS:                                                              │
│  ┌──────────────────────────┬───────────────────────────────────────────┐  │
│  │ Max execution time       │ 15 minutes                                │  │
│  │ Max memory               │ 10 GB                                     │  │
│  │ Max /tmp storage         │ 10 GB (ephemeral, NOT persistent)         │  │
│  │ Max deployment package   │ 50 MB (zipped), 250 MB (unzipped)         │  │
│  │ Max concurrent executions│ 1000 (soft limit, can increase)           │  │
│  └──────────────────────────┴───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Step Functions: Orchestrating Serverless Workflows

┌─────────────────────────────────────────────────────────────────────────────┐
│                     STEP FUNCTIONS STATE MACHINE                             │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                           StartState                                 │   │
│  │                               │                                      │   │
│  │                               ▼                                      │   │
│  │                      ┌───────────────┐                              │   │
│  │                      │  ValidateInput │ (Task: Lambda)              │   │
│  │                      └───────┬───────┘                              │   │
│  │                              │                                       │   │
│  │                              ▼                                       │   │
│  │                      ┌───────────────┐                              │   │
│  │                      │    Choice     │ (Decision point)             │   │
│  │                      └───────┬───────┘                              │   │
│  │                     ┌────────┴────────┐                             │   │
│  │                     ▼                 ▼                              │   │
│  │             ┌─────────────┐   ┌─────────────┐                       │   │
│  │             │  Valid: Yes │   │  Valid: No  │                       │   │
│  │             └──────┬──────┘   └──────┬──────┘                       │   │
│  │                    │                 │                               │   │
│  │                    ▼                 ▼                               │   │
│  │           ┌───────────────┐  ┌───────────────┐                      │   │
│  │           │  ProcessData  │  │  SendError    │                      │   │
│  │           │   (Task)      │  │  Notification │                      │   │
│  │           └───────┬───────┘  └───────┬───────┘                      │   │
│  │                   │                  │                               │   │
│  │                   ▼                  ▼                               │   │
│  │           ┌───────────────┐         End                             │   │
│  │           │   Parallel    │ (Run multiple branches)                 │   │
│  │           │   ┌───┬───┐   │                                         │   │
│  │           │   │ A │ B │   │                                         │   │
│  │           │   └───┴───┘   │                                         │   │
│  │           └───────┬───────┘                                         │   │
│  │                   │                                                  │   │
│  │                   ▼                                                  │   │
│  │           ┌───────────────┐                                         │   │
│  │           │  SaveResults  │                                         │   │
│  │           │   (Task)      │                                         │   │
│  │           └───────┬───────┘                                         │   │
│  │                   │                                                  │   │
│  │                   ▼                                                  │   │
│  │                  End                                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  STATE TYPES:                                                                │
│  ┌──────────┬────────────────────────────────────────────────────────────┐ │
│  │ Task     │ Execute Lambda, ECS task, API call, etc.                   │ │
│  │ Choice   │ If/else branching based on input                           │ │
│  │ Parallel │ Execute multiple branches simultaneously                    │ │
│  │ Map      │ Iterate over array, execute state for each item            │ │
│  │ Wait     │ Delay for specified time                                   │ │
│  │ Pass     │ Pass input to output (useful for transformations)          │ │
│  │ Succeed  │ Terminal state indicating success                          │ │
│  │ Fail     │ Terminal state indicating failure                          │ │
│  └──────────┴────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

Compute Options: When to Use What

┌─────────────────────────────────────────────────────────────────────────────┐
│                     AWS COMPUTE DECISION TREE                                │
│                                                                              │
│  What are you running?                                                       │
│        │                                                                     │
│        ├──► Event-driven, short tasks (<15 min)?                            │
│        │         │                                                           │
│        │         └──► Lambda (serverless, pay per invocation)               │
│        │                                                                     │
│        ├──► Containerized application?                                       │
│        │         │                                                           │
│        │         ├──► Need Kubernetes? ─────► EKS                           │
│        │         │                                                           │
│        │         └──► AWS-native? ──────────► ECS                           │
│        │                   │                                                 │
│        │                   ├──► Don't want to manage servers? ──► Fargate   │
│        │                   └──► Need GPU/custom instances? ─────► EC2       │
│        │                                                                     │
│        └──► Traditional application, full OS control?                        │
│                  │                                                           │
│                  └──► EC2                                                    │
│                         │                                                    │
│                         ├──► Predictable workload? ──► Reserved Instances   │
│                         ├──► Flexible timing? ───────► Spot Instances       │
│                         └──► Unknown/variable? ──────► On-Demand            │
│                                                                              │
│  COST COMPARISON (approximate, varies by region):                           │
│  ┌──────────────────┬─────────────────────┬────────────────────────────┐   │
│  │ Service          │ Pricing Model       │ Best For                   │   │
│  ├──────────────────┼─────────────────────┼────────────────────────────┤   │
│  │ Lambda           │ $0.20/1M requests   │ Infrequent, event-driven   │   │
│  │                  │ + compute time      │ tasks                      │   │
│  │ Fargate          │ ~$0.04/vCPU-hour    │ Containers without EC2    │   │
│  │ EC2 On-Demand    │ ~$0.10/hour (t3.md) │ Unknown workloads          │   │
│  │ EC2 Reserved     │ ~40% cheaper        │ Predictable, 1-3 year      │   │
│  │ EC2 Spot         │ ~70% cheaper        │ Flexible, interruptible    │   │
│  └──────────────────┴─────────────────────┴────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Core Concept Analysis

AWS services break down into these fundamental building blocks:

Domain	Core Concepts to Internalize
Networking (VPC)	CIDR blocks, public/private subnets, route tables, NAT gateways, Internet gateways, security groups vs NACLs, VPC peering, Transit Gateway
Compute (EC2)	Instance types, AMIs, user data, auto-scaling groups, launch templates, Elastic IPs, placement groups
Containers (ECS/EKS)	Task definitions, services, clusters, Fargate vs EC2 launch types, service discovery, Kubernetes control plane
Serverless (Lambda)	Event sources, execution context, cold starts, layers, concurrency, IAM execution roles
Orchestration (Step Functions)	State machines, ASL (Amazon States Language), error handling, retries, parallel/map states
Storage (S3)	Buckets, objects, policies, versioning, lifecycle rules, storage classes, presigned URLs
Infrastructure as Code	CloudFormation, Terraform, resource dependencies, state management, drift detection

Concept Summary Table

Before diving into projects, internalize these AWS concept clusters. Each project will force you to understand multiple concepts simultaneously:

Concept Cluster	What You Need to Internalize
VPC & Networking	CIDR blocks and subnet math (how /16, /24 work), public vs private subnets (routing differences), route tables (0.0.0.0/0 means “default route”), Internet Gateway (public subnet requirement), NAT Gateway (private subnet internet access), security groups (stateful, instance-level), NACLs (stateless, subnet-level), VPC peering (connecting VPCs), Transit Gateway (hub-and-spoke networking)
Compute (EC2)	Instance types (compute vs memory vs storage optimized), AMIs (golden images), user data (bootstrap scripts on launch), instance profiles (IAM roles for EC2), auto-scaling groups (elasticity), launch templates (instance configuration), Elastic IPs (static public IPs), placement groups (low latency)
Serverless (Lambda)	Execution model (stateless, ephemeral), cold starts (initialization penalty), warm starts (reusing execution context), event sources (what triggers Lambda), layers (shared code/dependencies), concurrency (parallel executions), execution roles (IAM permissions), timeout limits (15 min max), memory allocation (128MB-10GB), /tmp storage (512MB ephemeral)
Containers (ECS/EKS)	Task definitions (container configuration), services (long-running tasks), clusters (logical grouping), Fargate vs EC2 launch types (serverless vs managed instances), awsvpc networking mode (each task gets ENI), service discovery (DNS-based), ALB target groups (container load balancing), EKS control plane (managed Kubernetes), ECR (container registry)
Orchestration (Step Functions)	State machines (workflow as code), ASL (Amazon States Language - JSON), states (Task, Choice, Parallel, Map, Wait, Pass, Fail, Succeed), error handling (Retry, Catch), parallel execution, map state (dynamic parallelism), Express vs Standard workflows
Storage (S3)	Buckets (global namespace), objects (key-value), S3 API operations (PUT, GET, DELETE, LIST), versioning (immutable history), lifecycle rules (transition/expiration), storage classes (Standard, IA, Glacier), presigned URLs (temporary access), CORS (cross-origin access), bucket policies vs IAM policies, S3 Select (query in place)
Security (IAM)	Policies (JSON documents), principals (who), actions (what), resources (where), conditions (when), identity-based policies (attached to users/roles), resource-based policies (attached to resources like S3 buckets), roles (temporary credentials), instance profiles (EC2 role wrapper), service-linked roles, policy evaluation logic (explicit deny wins)
Infrastructure as Code	Terraform state (source of truth), state locking (prevent concurrent modifications), CloudFormation stacks (grouped resources), drift detection (config vs reality), resource dependencies (implicit vs explicit), modules/nested stacks (reusability), workspaces/stack sets (multi-environment), import (existing resources into IaC)
Observability	CloudWatch Logs (log aggregation), log groups and streams, CloudWatch Metrics (time-series data), custom metrics, CloudWatch Alarms (threshold-based alerts), CloudWatch Dashboards (visualization), X-Ray tracing (distributed request tracking), segments and subsegments, service maps, VPC Flow Logs (network traffic), CloudTrail (API audit logs)

Why this matters: AWS services don’t exist in isolation. When you build a VPC, you’re also configuring security groups, IAM roles, and CloudWatch logs. When you deploy Lambda, you need to understand IAM execution roles, VPC networking (if accessing RDS), and CloudWatch for debugging. These concepts interconnect constantly.

Deep Dive Reading by Concept

Map your project work to specific reading. Don’t read these books cover-to-cover upfront—use them as references when you hit specific challenges in your projects.

VPC & Networking Fundamentals

Topic	Book/Resource	Chapter/Section	Why Read This
CIDR and IP Addressing	“The Linux Programming Interface” - Michael Kerrisk	Ch. 59 (Sockets: Internet Domains)	Understand IP addresses, subnetting, CIDR notation at the networking fundamentals level
VPC Architecture Patterns	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 3: Networking on AWS	Best comprehensive coverage of VPC design, multi-AZ patterns, and network segmentation
Security Groups vs NACLs	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 4: Security on AWS	Stateful vs stateless firewall rules, when to use each
NAT Gateway Design	AWS Architecture Blog: VPC Design Evolution	Full Article	Real-world patterns for scaling NAT, costs, and HA
Routing Deep Dive	AWS VPC User Guide	Route Tables Section	Official documentation on route table priority, longest prefix matching

Compute (EC2) Deep Dive

Topic	Book/Resource	Chapter/Section	Why Read This
EC2 Instance Types	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 6: Compute Services	When to choose compute-optimized vs memory-optimized vs storage-optimized
Auto Scaling Architecture	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 6: Compute Services	Scaling policies, health checks, target tracking vs step scaling
AMI Best Practices	AWS EC2 User Guide	AMIs Section	Golden images, versioning, cross-region copying
Instance Metadata Service	AWS EC2 User Guide	Instance Metadata Section	How user data works, IMDSv2, security implications

Serverless (Lambda & Step Functions)

Topic	Book/Resource	Chapter/Section	Why Read This
Lambda Execution Model	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 8: Serverless Architecture	Cold starts, execution context reuse, concurrency models
Event-Driven Architecture	“Designing Data-Intensive Applications” - Martin Kleppmann	Ch. 11: Stream Processing	Foundational understanding of event-driven systems, exactly-once processing
Step Functions State Machines	AWS Step Functions Developer Guide	ASL Specification	Learn the state machine language, error handling patterns
Lambda Best Practices	AWS Lambda Developer Guide	Best Practices Section	Performance optimization, error handling, testing strategies

Containers (ECS/EKS)

Topic	Book/Resource	Chapter/Section	Why Read This
ECS Architecture	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 9: Container Services	Task definitions, Fargate vs EC2 launch type, service discovery
Kubernetes Fundamentals	Amazon EKS Best Practices Guide	Full Guide	Networking, security, observability for EKS
Container Networking	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 9: Container Services	awsvpc mode, VPC integration, load balancer integration
Service Mesh Concepts	AWS App Mesh Documentation	What is App Mesh	Service-to-service communication, retries, circuit breakers

Storage (S3) Deep Dive

Topic	Book/Resource	Chapter/Section	Why Read This
S3 Data Model	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 5: Storage Services	Object storage concepts, consistency model, versioning
S3 Performance	S3 Performance Guidelines	Full Document	Request rate optimization, multipart upload, transfer acceleration
S3 Security	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 5: Storage Services	Bucket policies, ACLs, presigned URLs, encryption options
Lifecycle Management	S3 Lifecycle Configuration	Lifecycle Section	Storage class transitions, expiration policies, cost optimization

Security & IAM

Topic	Book/Resource	Chapter/Section	Why Read This
IAM Policy Fundamentals	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 4: Security on AWS	Policy structure, evaluation logic, least privilege
IAM Roles Deep Dive	AWS IAM User Guide	Roles Section	Trust policies, assume role, instance profiles, cross-account access
Security Best Practices	AWS Security Best Practices	Full Document	MFA, key rotation, policy conditions, service control policies
Secrets Management	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 4: Security on AWS	Secrets Manager vs Parameter Store, rotation strategies

Infrastructure as Code

Topic	Book/Resource	Chapter/Section	Why Read This
Terraform Fundamentals	HashiCorp Terraform Tutorials	Get Started on AWS	State management, resource dependencies, modules
CloudFormation Deep Dive	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 13: Infrastructure as Code	Stack operations, drift detection, change sets
Terraform State Management	Terraform State Documentation	State Section	Remote state, locking, workspaces, state migration
IaC Best Practices	Terraform Best Practices	Full Guide	Module structure, naming conventions, security scanning

Observability & Monitoring

Topic	Book/Resource	Chapter/Section	Why Read This
CloudWatch Fundamentals	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 12: Monitoring and Logging	Logs, metrics, alarms, dashboards
Distributed Tracing	“Designing Data-Intensive Applications” - Martin Kleppmann	Ch. 1: Reliable, Scalable, Maintainable	Understanding observability in distributed systems
X-Ray Deep Dive	AWS X-Ray Developer Guide	Full Guide	Trace segments, service maps, sampling rules
Log Analysis Patterns	CloudWatch Logs Insights Tutorial	Insights Query Syntax	Query language, aggregation, performance debugging

Multi-Service Architecture

Topic	Book/Resource	Chapter/Section	Why Read This
Well-Architected Framework	AWS Well-Architected Framework	All 6 Pillars	Operational Excellence, Security, Reliability, Performance, Cost, Sustainability
Distributed Systems Patterns	“Designing Data-Intensive Applications” - Martin Kleppmann	Ch. 8-12	Replication, partitioning, consistency, consensus, batch vs stream processing
SaaS Architecture	“AWS for Solutions Architects” - Saurabh Shrivastava	Ch. 10-11: Advanced Architectures	Multi-tenancy, isolation models, data partitioning strategies
Event-Driven Patterns	AWS Serverless Patterns Collection	Browse Patterns	Real-world architectures, EventBridge patterns, async workflows

How to use this table: When you hit a specific challenge in a project (e.g., “My Lambda can’t access RDS in my VPC”), consult the relevant sections rather than reading books sequentially. Learn just-in-time based on what you’re building.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Before starting these projects, you need:

Programming Fundamentals
- Comfortable with at least one language (Python, Node.js, or Go recommended)
- Understanding of HTTP, REST APIs, JSON
- Basic command-line proficiency (bash/zsh)
Networking Basics
- What IP addresses are (public vs private)
- Basic understanding of DNS
- TCP/IP fundamentals (ports, protocols)
- CIDR notation (e.g., what 10.0.0.0/16 means)
Linux/Unix Basics
- SSH into servers
- File permissions and ownership
- Environment variables
- Package managers (apt, yum)
Version Control
- Git fundamentals (commit, push, pull, branching)
- GitHub/GitLab experience

Helpful But Not Required

These topics will be learned through the projects:

Infrastructure as Code (Terraform/CloudFormation)
Container technologies (Docker, Kubernetes)
CI/CD pipelines
AWS-specific services (you’ll learn by building!)

Self-Assessment Questions

Check if you’re ready:

Can you explain what a subnet is and why it exists?
Have you deployed a web application (even locally)?
Do you understand what an API is and how to call one?
Can you read and understand JSON?
Have you used the command line to navigate directories and run scripts?
Do you know the difference between public and private IP addresses?
Have you worked with environment variables?
Can you SSH into a remote server?

If you answered “yes” to at least 6 of these, you’re ready to start. If not, spend 1-2 weeks on basic networking and Linux fundamentals first.

Development Environment Setup

Required Tools:

# AWS CLI (for interacting with AWS from terminal)
aws --version  # Should be v2.x or higher

# Terraform (for infrastructure as code)
terraform --version  # Should be v1.5+

# Git (version control)
git --version

# Code editor (VS Code recommended with AWS/Terraform extensions)

AWS Account Setup:

Create an AWS Free Tier account (12 months free for many services)
CRITICAL: Set up billing alerts immediately (prevent surprise charges)
- CloudWatch alarm when estimated charges exceed $10
- Budget alert for monthly spending
Create an IAM user with admin access (don’t use root account)
Configure AWS CLI: aws configure
Enable MFA on both root and IAM user accounts

Cost Expectations:

Most projects fit within Free Tier limits if you’re careful
Expect $5-20/month if you leave resources running
Golden rule: Destroy resources when not in use (terraform destroy)
Projects 3-4 may cost $20-50/month if left running

Time Investment

Realistic time estimates:

Project	Time to Complete	Why This Long
Project 1: VPC	1-2 weeks (15-25 hours)	Learning Terraform, understanding subnet math, debugging routing issues
Project 2: Serverless Pipeline	1-2 weeks (15-20 hours)	Event-driven architecture is a paradigm shift, Step Functions syntax learning curve
Project 3: Auto-Scaling Web App	2-3 weeks (25-35 hours)	Most complex, integrates 6+ services, debugging distributed systems
Project 4: Containers (ECS/EKS)	2-3 weeks (25-40 hours)	Kubernetes has steep learning curve, EKS networking is intricate
Project 5: Full SaaS Platform	3-4 weeks (40-60 hours)	Combines all previous projects, adds authentication, multi-tenancy

Total time for all 5 projects: 3-4 months working 10 hours/week

These are learning estimates, not “I already know this” estimates. If you’re building while understanding from first principles, it takes time.

Important Reality Check

What makes AWS hard to learn:

Interconnectedness: You can’t fully understand Lambda without understanding IAM roles, VPC networking, and CloudWatch logs
Hidden complexity: Many services have 50+ configuration options, only 5 of which you actually need
Cost anxiety: Fear of surprise bills slows down experimentation
Documentation overload: AWS docs are comprehensive but not tutorial-friendly
Rapid change: Services get updated, new features added constantly

Success strategies:

Build one project at a time, fully complete it before moving on
Use billing alarms religiously
Join AWS communities (r/aws, AWS Discord servers)
Use terraform destroy as a daily habit
Read the conceptual docs first, then the API reference
Don’t try to memorize—focus on understanding patterns

Quick Start: First 48 Hours for Overwhelmed Learners

Feeling overwhelmed by the 200+ services and 3000+ lines of documentation? Here’s your focused start.

Day 1: Get Your Hands Dirty (4 hours)

Goal: Deploy something to AWS and see it work.

Hour 1: Set up your AWS account
- Create Free Tier account
- Set up billing alert for $10
- Create IAM user with admin access
- Configure aws configure with access keys
Hour 2: Manual console exploration
- Launch an EC2 instance (t2.micro, free tier)
- SSH into it
- Install nginx: sudo apt install nginx
- See it running on public IP
- Then destroy it (terminate instance)

Hour 3: Your first S3 bucket

# Create bucket (use unique name)
aws s3 mb s3://my-learning-bucket-123456

# Upload file
echo "Hello AWS" > test.txt
aws s3 cp test.txt s3://my-learning-bucket-123456/

# List objects
aws s3 ls s3://my-learning-bucket-123456/

# Clean up
aws s3 rb s3://my-learning-bucket-123456/ --force

Hour 4: Your first Lambda function
- Go to AWS Console → Lambda
- Create function (Python 3.12, default settings)
- Replace code with:
```
def lambda_handler(event, context):
    return {'statusCode': 200, 'body': 'My first Lambda!'}
```
- Click “Test” and see it work
- Check CloudWatch logs (see your function’s output)
- Delete the function

What you just learned: Basic compute (EC2), object storage (S3), serverless (Lambda), and observability (CloudWatch). These are the foundations.

Day 2: Infrastructure as Code (4 hours)

Goal: Build the same S3 bucket, but with Terraform.

Install Terraform (if not done)

brew install terraform  # Mac
# or download from terraform.io

Create your first Terraform file (main.tf):

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "learning" {
  bucket = "my-terraform-bucket-123456"  # Use unique name
}

Run Terraform:

terraform init    # Download AWS provider
terraform plan    # See what will be created
terraform apply   # Create the bucket (type "yes")

Verify it worked:
```
aws s3 ls | grep terraform
```
Destroy it:
```
terraform destroy  # Type "yes"
```

What you just learned: Infrastructure as Code means your infrastructure configuration is versioned, reproducible, and reviewable. You’ll never click through the console again.

After 48 Hours: Choose Your Path

Now you’re ready to start Project 1: VPC. This is where real learning begins.

Recommended Learning Paths

Choose the path that matches your background and goals.

Path 1: “I’m a backend developer who wants to deploy my apps to AWS”

Focus: Learn just enough AWS to deploy and scale applications confidently.

Projects order:

Start with Project 1 (VPC) → Understand networking basics
Jump to Project 3 (Auto-Scaling Web App) → Deploy a real application with RDS, load balancing
Optional: Project 2 (Serverless Pipeline) → If you work with data/event-driven systems
Skip Project 4 (unless you’re using containers)
Optional: Project 5 → If building SaaS products

Time: 6-8 weeks

Why this path: You’ll quickly get to deploying real applications (Project 3) after understanding networking fundamentals. Most backend developers need EC2/RDS/ALB more urgently than Kubernetes.

Path 2: “I’m a DevOps engineer who needs to architect AWS infrastructure”

Focus: Understand every layer, from VPC to Kubernetes.

Projects order:

Project 1 (VPC) → Foundation
Project 2 (Serverless Pipeline) → Event-driven architecture
Project 4 (Containers - ECS then EKS) → Modern compute platforms
Project 3 (Auto-Scaling Web App) → Traditional web architecture
Project 5 (Full SaaS) → Pulling it all together

Time: 12-16 weeks

Why this path: You need deep understanding of all compute models (EC2, Lambda, containers) and how they fit into different architectural patterns. You’ll be making these trade-off decisions daily.

Path 3: “I’m coming from on-prem infrastructure and need to understand cloud”

Focus: Map your existing knowledge to AWS equivalents, understand the cloud-native differences.

Projects order:

Project 1 (VPC) → This is your “data center” in AWS
Project 3 (Auto-Scaling Web App) → Traditional architecture with cloud benefits (auto-scaling, managed databases)
Project 2 (Serverless Pipeline) → The paradigm shift to event-driven, serverless
Project 4 (ECS, not EKS) → Containers without Kubernetes complexity
Optional: Project 5 → Multi-tenancy patterns

Time: 10-12 weeks

Why this path: Starts with familiar concepts (networking, VMs, load balancers) then progressively introduces cloud-native patterns. ECS is easier than EKS for container beginners.

Path 4: “I want to build serverless/event-driven systems”

Focus: Lambda, Step Functions, EventBridge, S3, DynamoDB.

Projects order:

Project 1 (VPC) → Even serverless needs networking knowledge
Project 2 (Serverless Pipeline) → Your core competency
Project 3 (Auto-Scaling Web App) → Hybrid: Lambda + traditional components
Skip Project 4 (unless curious about containers)
Project 5 (Full SaaS) → Build it serverless-first

Time: 8-10 weeks

Why this path: Optimizes for Lambda/Step Functions mastery. You’ll learn RDS in Project 3, but can substitute DynamoDB if preferred.

Path 5: “I need AWS certification (Solutions Architect Associate)”

Focus: Breadth over depth, understand all core services.

Projects order:

Project 1 (VPC) → 25% of exam questions
Project 3 (Auto-Scaling Web App) → Most services in one project
Project 2 (Serverless Pipeline) → Lambda/Step Functions coverage
Project 4 (ECS path, skip EKS) → Container basics, EKS rarely on Associate exam
Read AWS Well-Architected Framework → Critical for exam

Time: 8 weeks + 2 weeks exam prep

Why this path: Covers VPC (critical), EC2/RDS/ALB (core), Lambda (serverless), S3 (storage), IAM (security), CloudWatch (monitoring). These are 80% of the exam.

Project 1: Production-Ready VPC from Scratch (with Terraform)

Attribute	Value
Language	HCL (Terraform)
Difficulty	Intermediate
Time	1–2 weeks
Knowledge Area	Cloud Networking / Infrastructure as Code
Coolness	★☆☆☆☆ Pure Corporate Snoozefest
Portfolio Value	Resume Gold

What you’ll build: A multi-AZ VPC with public/private subnets, NAT gateways, bastion host, and a deployed web application—all defined in Terraform that you can destroy and recreate at will.

Why it matters: You cannot understand VPCs by clicking through the console. You need to see what happens when a private subnet has no route to a NAT gateway, when a security group blocks outbound traffic, or when your CIDR blocks overlap with a peered VPC. Building with IaC forces you to explicitly declare every component and understand their relationships.

Core challenges:

CIDR planning and non-overlapping IP ranges that allow future growth and peering
Routing logic: understanding why private subnets route to NAT vs public subnets route to IGW
Security layers: configuring security groups (stateful) vs NACLs (stateless) and knowing when to use each
High availability: deploying across multiple AZs with redundant NAT gateways
Bastion access: SSH tunneling through a jump host to reach private instances

Key concepts to master:

VPC networking and subnetting (CIDR blocks)
Route tables and internet connectivity patterns
Security group vs NACL design
Multi-AZ architecture for high availability
Infrastructure as Code with Terraform

Prerequisites: Basic AWS console navigation, understanding of IP addresses, familiarity with any IaC tool.

Deliverable: A fully functional multi-AZ VPC with public/private subnets, NAT gateways, bastion host, and deployed web application—all managed by Terraform code that can be destroyed and recreated deterministically.

Implementation hints:

Start with a single public subnet and IGW before adding complexity
Use Terraform modules for reusable VPC components
Test network connectivity at each stage (ping, curl, traceroute)
Enable VPC Flow Logs from the start for debugging

Milestones:

Deploy VPC with public subnet only, web server reachable from internet → you understand IGW + route tables
Add private subnet with NAT, move app server there, bastion-only access → you understand public vs private architecture
Multi-AZ deployment with ALB → you understand high availability patterns
Add VPC Flow Logs and analyze traffic → you understand network observability

Real World Outcome

This is what your working VPC infrastructure looks like when you complete this project:

# 1. Deploy the infrastructure with Terraform
$ cd terraform/vpc-project
$ terraform init
Initializing provider plugins...
- Downloading hashicorp/aws v5.31.0...
Terraform has been successfully initialized!

$ terraform apply
Plan: 23 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Enter a value: yes

aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.

Outputs:
vpc_id = "vpc-0abc123def456789"
public_subnet_ids = ["subnet-0aaa111", "subnet-0bbb222"]
private_subnet_ids = ["subnet-0ccc333", "subnet-0ddd444"]
bastion_public_ip = "54.123.45.67"
web_server_private_ip = "10.0.10.50"
nat_gateway_ip = "52.87.123.45"

# 2. SSH into the bastion host
$ ssh -i ~/.ssh/aws-bastion.pem ec2-user@54.123.45.67
The authenticity of host '54.123.45.67' can't be established.
Are you sure you want to continue connecting (yes/no)? yes

       __|  __|_  )
       _|  (     /   Amazon Linux 2
      ___|\___|___|

[ec2-user@bastion ~]$ hostname
ip-10-0-1-25.ec2.internal

# 3. From bastion, SSH into private web server (SSH agent forwarding)
[ec2-user@bastion ~]$ ssh 10.0.10.50
[ec2-user@webserver ~]$ hostname
ip-10-0-10-50.ec2.internal

# 4. Verify the web server can reach internet via NAT
[ec2-user@webserver ~]$ curl -s https://ifconfig.me
52.87.123.45   # This is the NAT Gateway's IP, NOT the web server's private IP!

# 5. Verify the web server cannot be reached directly from internet
$ curl http://10.0.10.50  # From your local machine
curl: (7) Failed to connect to 10.0.10.50 port 80: No route to host
# CORRECT! Private subnet is not reachable from internet

# 6. Check VPC Flow Logs in CloudWatch
$ aws logs filter-log-events \
    --log-group-name /aws/vpc/flowlogs \
    --filter-pattern "[version, account, eni, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log_status]" \
    --limit 5 \
    --profile douglascorrea_io --no-cli-pager

{
    "events": [
        {
            "message": "2 123456789012 eni-0abc123 10.0.1.25 54.123.45.67 22 52341 6 25 4500 1703264400 1703264460 ACCEPT OK",
            "ingestionTime": 1703264465000
        },
        {
            "message": "2 123456789012 eni-0def456 10.0.10.50 52.87.123.45 443 32145 6 10 1500 1703264400 1703264460 ACCEPT OK",
            "ingestionTime": 1703264465000
        }
    ]
}
# Flow logs show: SSH to bastion (ACCEPT), HTTPS from private instance via NAT (ACCEPT)

# 7. Describe your VPC and subnets
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 \
    --query 'Vpcs[0].{VpcId:VpcId,CIDR:CidrBlock,State:State}' \
    --output table --no-cli-pager

-----------------------------------------
|            DescribeVpcs               |
+-------+------------------+------------+
| CIDR  |     State        |   VpcId    |
+-------+------------------+------------+
|10.0.0.0/16| available   |vpc-0abc123 |
+-------+------------------+------------+

$ aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
    --query 'Subnets[*].{SubnetId:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock,Public:MapPublicIpOnLaunch}' \
    --output table --no-cli-pager

------------------------------------------------------------
|                    DescribeSubnets                       |
+--------------+---------------+------------+--------------+
|     AZ       |     CIDR      |   Public   |   SubnetId   |
+--------------+---------------+------------+--------------+
|us-east-1a    |10.0.1.0/24    |   True     |subnet-0aaa111|
|us-east-1b    |10.0.2.0/24    |   True     |subnet-0bbb222|
|us-east-1a    |10.0.10.0/24   |   False    |subnet-0ccc333|
|us-east-1b    |10.0.20.0/24   |   False    |subnet-0ddd444|
+--------------+---------------+------------+--------------+

# 8. View route tables to understand routing
$ aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
    --query 'RouteTables[*].{RouteTableId:RouteTableId,Routes:Routes[*].{Destination:DestinationCidrBlock,Target:GatewayId||NatGatewayId}}' \
    --no-cli-pager

# Public route table: 0.0.0.0/0 → igw-xxxxx (Internet Gateway)
# Private route table: 0.0.0.0/0 → nat-xxxxx (NAT Gateway)

# 9. Clean up when done (saves money!)
$ terraform destroy
Plan: 0 to add, 0 to change, 23 to destroy.

Do you want to perform these actions?
  Enter a value: yes

aws_instance.web_server: Destroying...
aws_nat_gateway.main: Destroying...
...
Destroy complete! Resources: 23 destroyed.

AWS Console View:

VPC Dashboard showing your VPC with proper CIDR block
Subnet map showing public subnets with Internet Gateway icon, private subnets with NAT icon
Route tables tab showing explicit routes for each subnet type
Security Groups showing inbound/outbound rules
VPC Flow Logs showing real network traffic in CloudWatch

The Core Question You’re Answering

“What IS a VPC, and how does network traffic actually flow between the internet, public subnets, private subnets, and databases?”

Before you write any Terraform code, sit with this question. Most developers have a vague sense that “VPCs are networks” but can’t explain:

Why a public subnet can receive traffic from the internet but a private subnet cannot
What the difference between a route table entry and a security group rule is
Why you need a NAT Gateway (and why it costs money) for private instances to download packages
How data flows when a user hits your website: ALB → EC2 → RDS and back

This project forces you to confront these questions because misconfiguration means your infrastructure doesn’t work.

Concepts You Must Understand First

Stop and research these before coding:

CIDR Blocks and Subnetting
- What does 10.0.0.0/16 actually mean in binary?
- How many IP addresses are in a /24 subnet?
- Why does AWS reserve 5 IPs per subnet?
- What happens if two VPCs have overlapping CIDR blocks and you try to peer them?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 11 (Network Programming) - Bryant & O’Hallaron
Routing and the Default Gateway (0.0.0.0/0)
- What does 0.0.0.0/0 mean in a route table?
- When a packet leaves an EC2 instance, how does the VPC know where to send it?
- What’s the difference between local route and igw-xxxxx route?
- Why does longest prefix match matter?
- Book Reference: “TCP/IP Illustrated, Volume 1” Ch. 9 (IP Routing) - W. Richard Stevens
Internet Gateway vs NAT Gateway
- What does “stateful NAT” mean?
- Why can’t you just put an Internet Gateway on a private subnet?
- How does a NAT Gateway translate private IPs to public IPs?
- Why does the NAT Gateway need to be in a public subnet?
- Book Reference: “AWS for Solutions Architects” Ch. 3 - Saurabh Shrivastava
Security Groups (Stateful) vs NACLs (Stateless)
- What does “stateful” mean in terms of network connections?
- If you allow inbound port 80, do you need an outbound rule for the response?
- Why do NACLs need explicit rules for ephemeral ports?
- When would you use a NACL instead of just security groups?
- Book Reference: “AWS for Solutions Architects” Ch. 4 (Security on AWS) - Saurabh Shrivastava
High Availability and Availability Zones
- What is an Availability Zone physically?
- Why do you need subnets in multiple AZs?
- What happens if us-east-1a fails but you only deployed to that AZ?
- How does an ALB distribute traffic across AZs?
- Book Reference: “AWS for Solutions Architects” Ch. 2 (AWS Global Infrastructure)

Questions to Guide Your Design

Before implementing, think through these:

CIDR Planning
- How many IP addresses will you need now? In 5 years?
- What if you need to add a third AZ later?
- What if you need to peer with another VPC that uses 10.0.0.0/16?
- How will you segment: web tier, app tier, database tier?
Public vs Private Decisions
- What resources MUST be in public subnets? (Hint: very few)
- What resources should NEVER be in public subnets? (Hint: databases!)
- How will private resources get software updates from the internet?
Bastion Host Design
- Why use a bastion instead of putting EC2 in public subnet with SSH?
- How do you secure the bastion itself?
- Should you use Session Manager instead of SSH? (Yes, probably)
Security Group Strategy
- Can you reference one security group from another?
- How do you allow app servers to talk to databases without opening the database to everything?
- What’s the principle of least privilege in security group terms?
Cost Considerations
- How much does a NAT Gateway cost per hour? Per GB transferred?
- Do you need one NAT Gateway per AZ or can you share?
- What’s the trade-off between cost and availability?

Thinking Exercise

Before coding, draw this diagram on paper:

Your Task: Fill in the missing pieces

Internet
    │
    ▼
┌───────────────────────────────────────────────────────────────┐
│  VPC: 10.0.0.0/16                                              │
│                                                                │
│  ┌─────────────────────────┐  ┌─────────────────────────────┐ │
│  │ Public Subnet A         │  │ Public Subnet B              │ │
│  │ CIDR: ____________      │  │ CIDR: ____________           │ │
│  │                         │  │                              │ │
│  │ What goes here?         │  │ What goes here?              │ │
│  │ ___________________     │  │ ___________________          │ │
│  └─────────────────────────┘  └─────────────────────────────┘ │
│              │                            │                    │
│              ▼                            ▼                    │
│  ┌─────────────────────────┐  ┌─────────────────────────────┐ │
│  │ Private Subnet A        │  │ Private Subnet B             │ │
│  │ CIDR: ____________      │  │ CIDR: ____________           │ │
│  │                         │  │                              │ │
│  │ What goes here?         │  │ What goes here?              │ │
│  │ ___________________     │  │ ___________________          │ │
│  └─────────────────────────┘  └─────────────────────────────┘ │
│                                                                │
│  Route Table (Public):                                         │
│  ___________________ → ___________________                     │
│  ___________________ → ___________________                     │
│                                                                │
│  Route Table (Private):                                        │
│  ___________________ → ___________________                     │
│  ___________________ → ___________________                     │
└────────────────────────────────────────────────────────────────┘

Questions while drawing:

Why does each public subnet need a different CIDR?
What AWS resource creates the connection to the internet?
How does traffic from the private subnet reach the internet?
What happens to the response traffic?

The Interview Questions They’ll Ask

Prepare to answer these:

“Walk me through how a request from a user’s browser reaches your web server in a private subnet.”
- Expected: User → Internet → ALB (public subnet) → EC2 (private subnet) → Response reverses
“Why is your database in a private subnet? How does it get software updates?”
- Expected: Security - no direct internet access. Updates via NAT Gateway or VPC endpoints.
“Your app in us-east-1a can’t reach the database in us-east-1b. What do you check?”
- Expected: Security groups (allow from app SG?), NACLs (if modified), Route tables, VPC peering (if different VPCs)
“What’s the difference between an Internet Gateway and a NAT Gateway?”
- Expected: IGW provides two-way communication (public IPs). NAT provides outbound-only for private resources.
“Your NAT Gateway bill is $500/month. How do you reduce it?”
- Expected: Check if you have one per AZ (maybe share), use VPC endpoints for AWS services (S3, DynamoDB), check for excessive data transfer
“Explain security groups vs NACLs. When would you use each?”
- Expected: SG = stateful, instance-level, allow rules only. NACL = stateless, subnet-level, allow/deny rules. Use SG for app logic, NACL for subnet-wide rules or explicit denies.
“What CIDR block would you use for a VPC? Why?”
- Expected: RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16). Consider future growth and peering requirements.

Hints in Layers

Hint 1: Start with the VPC resource

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
  }
}

Hint 2: Create subnets with data source for AZs

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "public" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  map_public_ip_on_launch = true  # This makes it "public"

  tags = {
    Name = "public-${count.index + 1}"
  }
}

Hint 3: Internet Gateway requires explicit route

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "public" {
  count          = length(aws_subnet.public)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

Hint 4: NAT Gateway needs Elastic IP first

resource "aws_eip" "nat" {
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # NAT must be in PUBLIC subnet

  depends_on = [aws_internet_gateway.main]
}

Hint 5: Security Group allowing SSH from bastion only

resource "aws_security_group" "web" {
  name   = "web-server-sg"
  vpc_id = aws_vpc.main.id

  ingress {
    from_port       = 22
    to_port         = 22
    protocol        = "tcp"
    security_groups = [aws_security_group.bastion.id]  # Only from bastion!
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Books That Will Help

Topic	Book	Specific Chapters	Why It Helps
VPC Fundamentals	“AWS for Solutions Architects” by Saurabh Shrivastava	Ch. 3: Networking on AWS	Best comprehensive coverage of VPC architecture patterns and design decisions
Security Groups & IAM	“AWS for Solutions Architects” by Saurabh Shrivastava	Ch. 4: Security on AWS	Security group vs NACL differences, IAM roles for EC2
TCP/IP Fundamentals	“TCP/IP Illustrated, Volume 1” by W. Richard Stevens	Ch. 1-3, 9 (IP Routing)	Deep understanding of how packets flow and routing works
CIDR & IP Addressing	“Computer Networks” by Tanenbaum & Wetherall	Ch. 5: Network Layer	Mathematical foundation of IP addressing and subnetting
Terraform Basics	“Terraform: Up & Running” by Yevgeniy Brikman	Ch. 2-3: Terraform State	Managing infrastructure as code, state management
Network Security	“The Linux Programming Interface” by Michael Kerrisk	Ch. 59-61: Sockets	Understanding network connections at the OS level
AWS Well-Architected	AWS Well-Architected Framework (free)	Security Pillar	Official AWS best practices for secure VPC design

Reading strategy:

Start with “AWS for Solutions Architects” Ch. 3 (VPC overview)
Read “TCP/IP Illustrated” Ch. 9 if routing concepts are unclear
Refer to “Terraform: Up & Running” as you write your code
Use AWS Well-Architected Security Pillar as a checklist

Common Pitfalls & Debugging

Problem 1: “My Terraform apply fails with ‘InvalidVPCID.NotFound’“

Why: Terraform is trying to create a resource before its dependency (the VPC) exists
Fix: Ensure resources explicitly reference the VPC: vpc_id = aws_vpc.main.id (not a string)
Quick test: Run terraform plan and verify resource dependencies are correct

Problem 2: “I can SSH to the bastion but can’t reach the private EC2 instance”

Why: Security group on private instance doesn’t allow inbound SSH from bastion’s security group

Fix:

resource "aws_security_group" "private_ec2" {
  ingress {
    from_port       = 22
    to_port         = 22
    protocol        = "tcp"
    security_groups = [aws_security_group.bastion.id]  # Not CIDR!
  }
}

Quick test: aws ec2 describe-security-groups --group-ids <private-sg-id> --query 'SecurityGroups[0].IpPermissions'

Problem 3: “My private EC2 instance can’t access the internet (yum/apt fails)”

Why: Route table for private subnet isn’t routing 0.0.0.0/0 to NAT Gateway, or NAT Gateway isn’t in a public subnet
Fix:
1. Verify NAT Gateway is in a public subnet
2. Verify private route table has: 0.0.0.0/0 → nat-xxxxx
3. Verify NAT Gateway’s Elastic IP exists

Quick test:

# SSH to private instance via bastion
curl -v https://www.google.com  # Should work
# Check route table
aws ec2 describe-route-tables --filters "Name=vpc-id,Values=<vpc-id>"

Problem 4: “terraform destroy hangs when destroying NAT Gateway”

Why: ENIs (Elastic Network Interfaces) attached to NAT Gateway take time to detach
Fix: Wait 3-5 minutes. If still hanging, manually delete NAT Gateway in console, then re-run terraform destroy
Prevention: Always run terraform destroy when done to avoid hourly charges

Problem 5: “My CIDR blocks overlap and VPC peering fails”

Why: You used 10.0.0.0/16 for both VPCs
Fix: Plan CIDR blocks in advance:
- VPC A: 10.0.0.0/16
- VPC B: 10.1.0.0/16
- VPC C: 10.2.0.0/16
Quick test: Use cidrsubnet() Terraform function to mathematically avoid overlaps

Problem 6: “AWS bill is $50 after 1 day - I thought this was free tier!”

Why: NAT Gateway costs $0.045/hour ($32/month) + data transfer fees. NOT covered by free tier.
Fix:
1. Run terraform destroy when not using
2. Use VPC Endpoints for S3/DynamoDB (free, avoids NAT)
3. Consider one NAT Gateway instead of one per AZ for learning

Cost check:

aws ce get-cost-and-usage \
    --time-period Start=2024-12-01,End=2024-12-21 \
    --granularity DAILY \
    --metrics BlendedCost \
    --group-by Type=SERVICE

Problem 7: “terraform apply creates subnets in the same AZ”

Why: Using count.index directly on a potentially unordered AZ list

Fix:

data "aws_availability_zones" "available" {
  state = "available"
}

# Ensure you're using different indices
resource "aws_subnet" "public" {
  count             = 2
  availability_zone = data.aws_availability_zones.available.names[count.index]
  # ...
}

Quick test: aws ec2 describe-subnets --filters "Name=vpc-id,Values=<vpc-id>" --query 'Subnets[*].AvailabilityZone'

Problem 8: “VPC Flow Logs show all REJECT - but connections work fine”

Why: You’re looking at ephemeral port traffic that NACLs might be blocking (if you modified default NACL)
Fix: Default NACLs allow all traffic. If you customized NACLs, ensure ephemeral ports (1024-65535) are allowed for responses

Debugging flow logs:

# Filter for REJECTs only
aws logs filter-log-events \
    --log-group-name /aws/vpc/flowlogs \
    --filter-pattern "REJECT" \
    --limit 20

Project 2: Serverless Data Pipeline (Lambda + Step Functions + S3)

Attribute	Value
Language	Python (alt: TypeScript, Go, Java)
Difficulty	Intermediate
Time	1–2 weeks
Knowledge Area	Serverless, Event-Driven Architecture
Coolness	★★☆☆☆ Practical but Forgettable
Portfolio Value	Resume Gold

What you’ll build: An automated data processing pipeline that triggers when files land in S3, orchestrates multiple Lambda functions through Step Functions, handles errors gracefully, and outputs processed results—all without a single server to manage.

Why it matters: Serverless is not “just deploy a function.” It’s about understanding event-driven architecture, dealing with cold starts, managing state across stateless functions, and designing for failure. Step Functions force you to think about workflow as a first-class concept.

Core challenges:

Event-driven triggers: configuring S3 event notifications to invoke Lambda
State machine design: modeling your pipeline as explicit states with transitions
Error handling: implementing retries, catch blocks, and fallback logic in Step Functions
Lambda limits: working within 15-minute timeout, /tmp storage, memory limits
IAM execution roles: granting Lambda only the permissions it needs

Key concepts to master:

Event-driven architecture and decoupled systems
Step Functions state machine design
Lambda execution model and cold start optimization
S3 event notifications
Error handling and retry strategies

Prerequisites: Basic Python/Node.js, understanding of JSON, Project 1 completed (VPC knowledge).

Deliverable: An automated data processing pipeline that triggers on S3 upload, orchestrates multiple Lambda functions through Step Functions with proper error handling, and outputs processed results to a destination bucket—all observable through CloudWatch logs and metrics.

Implementation hints:

Start with a single Lambda function before adding Step Functions
Use Step Functions visual designer to prototype your workflow
Test error handling by intentionally failing Lambda functions
Use S3 event notification filters to avoid infinite loops

Milestones:

Single Lambda triggered by S3 upload → you understand event sources
Chain 3 Lambdas via Step Functions → you understand state machines
Add parallel processing and error handling → you understand resilience patterns
Add SNS notifications and CloudWatch alarms → you understand observability in serverless

Real World Outcome

This is what your working pipeline looks like in action:

# Upload a CSV file to trigger the pipeline
$ aws s3 cp sales_data.csv s3://my-pipeline-bucket/input/ --profile douglascorrea_io --no-cli-pager
upload: ./sales_data.csv to s3://my-pipeline-bucket/input/sales_data.csv

# Check Step Functions execution status
$ aws stepfunctions list-executions \
    --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:DataProcessingPipeline \
    --profile douglascorrea_io --no-cli-pager
{
    "executions": [
        {
            "executionArn": "arn:aws:states:us-east-1:123456789012:execution:DataProcessingPipeline:execution-2024-12-20-15-30-45",
            "name": "execution-2024-12-20-15-30-45",
            "status": "SUCCEEDED",
            "startDate": "2024-12-20T15:30:45.123Z",
            "stopDate": "2024-12-20T15:32:10.456Z"
        }
    ]
}

# View CloudWatch logs for the validation Lambda
$ aws logs tail /aws/lambda/validate-data --since 5m --profile douglascorrea_io --no-cli-pager
2024-12-20T15:30:46.234Z START RequestId: abc123-def456 Version: $LATEST
2024-12-20T15:30:46.567Z INFO Validating CSV structure for sales_data.csv
2024-12-20T15:30:46.789Z INFO Found 1247 records, all valid
2024-12-20T15:30:47.012Z END RequestId: abc123-def456
2024-12-20T15:30:47.045Z REPORT Duration: 811ms Billed Duration: 900ms Memory: 512MB Max Memory Used: 128MB

# View the processed output
$ aws s3 ls s3://my-pipeline-bucket/output/ --profile douglascorrea_io --no-cli-pager
2024-12-20 15:32:10      45632 processed_sales_data.json
2024-12-20 15:32:10       1248 processing_summary.json

Step Functions Visual Workflow (in AWS Console shows real-time execution):

ValidateData Lambda: SUCCESS (Duration: 811ms)
TransformData Lambda: SUCCESS (Duration: 58.2s)
Parallel Processing: Both branches SUCCESS
- CalculateStatistics: 2.3s
- GenerateReport: 1.8s
WriteOutput Lambda: SUCCESS (Duration: 245ms)
SNS Notification: Delivered

CloudWatch Logs Insights showing complete pipeline flow with timestamps, error-free execution, and performance metrics for each Lambda invocation.

The Core Question You’re Answering

“How do I build systems that react to events without managing servers, and how do I coordinate multiple functions reliably?”

More specifically:

How do I design a system where dropping a file automatically triggers processing without a cron job?
How do I chain multiple processing steps when each Lambda is stateless and limited to 15 minutes?
How do I handle failures gracefully when any step might time out or throw an error?
How do I pass data between Lambdas without coupling them tightly?
How do I make my pipeline idempotent so reprocessing the same file produces the same result?

Concepts You Must Understand First

1. Event-Driven Architecture vs Request-Response

Request-Response (traditional):

Client → Server → Database → Server → Client
(synchronous, client waits for entire operation)

Event-Driven (serverless):

Event Producer → Event → Event Consumer
(asynchronous, producer fires and forgets)

Key: Your S3 upload doesn’t “call” your Lambda. It emits an event. This decoupling makes serverless scalable but harder to debug.

2. Lambda Execution Model (Cold Starts, Execution Context Reuse)

Lambda lifecycle:

INIT Phase (cold start): Download code, start environment, initialize runtime (100ms-3s)
INVOKE Phase: Run your handler function
SHUTDOWN Phase: Environment terminated after idle timeout

Context reuse:

# Runs ONCE per execution environment (cold start)
import boto3
s3_client = boto3.client('s3')

# Runs EVERY invocation (warm or cold)
def handler(event, context):
    s3_client.get_object(Bucket='my-bucket', Key='file.csv')

Why it matters: Cold starts add latency. Initialize connections outside the handler to reuse them.

3. State Machines and Workflow Orchestration

With Step Functions:

Visual workflow in console
Declarative retry/error handling
State maintained between steps
Execution history for debugging

Step Functions is your distributed transaction coordinator.

4. Idempotency and Exactly-Once Processing

The problem: S3 events can be delivered more than once. Step Functions retries can duplicate processing.

Idempotent design:

# Generate deterministic ID from input
idempotency_key = hashlib.sha256(json.dumps(event, sort_keys=True).encode()).hexdigest()

# Check if already processed
try:
    s3_client.head_object(Bucket='results', Key=f'processed/{idempotency_key}.json')
    return {"status": "already_processed"}
except s3_client.exceptions.NoSuchKey:
    # Not processed yet, continue
    pass

5. IAM Execution Roles vs Resource-Based Policies

Execution Role (what Lambda can do): s3:GetObject, s3:PutObject
Resource-Based Policy (who can invoke Lambda): Allow S3 service to invoke

You need BOTH for S3 event triggers to work.

Questions to Guide Your Design

What triggers your first Lambda? S3 event notification? SNS? EventBridge?
How do you pass data between Lambda functions? Via Step Functions state? S3? SQS?
What happens when a Lambda times out mid-execution? Does Step Functions retry? From scratch or resume?
How do you handle partial failures? If 3 out of 5 parallel tasks succeed, proceed or fail?
What errors are transient vs permanent? ThrottlingException (retry) vs InvalidDataFormat (alert)?
What if the input file is 5GB? Lambda has 10GB /tmp max, 15-minute timeout.
How do you maintain data lineage? Which output came from which input?
How do you debug a failed execution? Step Functions history? CloudWatch Logs?

Thinking Exercise

Draw your event flow diagram:

S3 Bucket (input/)
     |
     | S3 Event Notification
     ↓
Lambda: ValidateData
     |
     ↓
Step Functions: DataProcessingStateMachine
     |
     ├→ Choice: Valid data?
     |   ├─ NO → Lambda: SendAlert → END
     |   └─ YES ↓
     |
     ├→ Lambda: TransformData
     ↓
     ├→ Parallel State:
     |   ├─ Lambda: CalculateStatistics
     |   └─ Lambda: GenerateReport
     ↓
     ├→ Lambda: WriteOutput
     ↓
     └→ SNS: SendSuccessNotification

Now answer:

Where does data live at each stage?
What happens if execution is paused for 1 year? (max execution time)
Maximum file size your pipeline can handle?
Cost for processing 1000 files?

The Interview Questions They’ll Ask

Q: “Explain synchronous vs asynchronous Lambda invocations.”

Expected answer:

Synchronous: Caller waits, gets return value (API Gateway, ALB)
Asynchronous: Caller gets 202 Accepted, Lambda processes later (S3, SNS)
Async has built-in retry (2x) and dead-letter queue support
In your pipeline: S3 invokes ValidateData async, Step Functions invokes Lambdas synchronously

Q: “How do you handle Lambda timeout mid-processing?”

Expected answer:

Chunked processing: Read in chunks, maintain cursor in DynamoDB
Recursive invocation: Lambda invokes itself with updated state
Step Functions Map state: Split into chunks, process in parallel
Alternative: Use Fargate for long-running tasks (no 15min limit)

Q: “What are Lambda cold starts and how did they affect your pipeline?”

Expected answer:

Cold start = INIT phase (100ms-3s) when scaling up or after idle
Mitigation: Provisioned concurrency, smaller packages, init code outside handler
In your pipeline: “Measured ~800ms cold starts, acceptable for batch processing”

Q: “Why use Step Functions instead of chaining Lambdas?”

Expected answer:

Visual workflow, declarative error handling, state management
Execution history shows inputs/outputs/durations
Parallel, Map, Wait states built-in

Q: “How do you pass large datasets between Step Functions states?”

Expected answer:

Step Functions has 256KB limit per state
Store data in S3, pass S3 URI in state
Use ResultPath to merge results without duplicating data

Q: “IAM permissions for S3 to trigger Lambda?”

Expected answer:

Lambda execution role (what Lambda can do): s3:GetObject, logs:CreateLogGroup
Lambda resource-based policy (who can invoke): Allow S3 service, condition on bucket ARN
S3 bucket notification config pointing to Lambda ARN

Q: “Cost for running pipeline 1000 times/day?”

Expected answer breakdown:

Lambda: Invocations ($0.20/1M) + Duration ($0.0000166667/GB-sec)
Step Functions: State transitions ($25/1M)
S3: Requests + Storage
Show calculation

Hints in Layers

Hint 1.1: What triggers the pipeline? Use S3 Event Notifications with object created events. Configure filter by prefix (input/) and suffix (.csv).

Hint 1.2: How should Lambdas communicate? Use Step Functions to orchestrate. Pass small metadata in state, store large data in S3.

Hint 2.1: Lambda function structure

import boto3
from aws_lambda_powertools import Logger, Tracer

logger = Logger()
tracer = Tracer()
s3_client = boto3.client('s3')  # Initialize outside handler

@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
    bucket = event['bucket']
    key = event['key']

    response = s3_client.get_object(Bucket=bucket, Key=key)
    # Process...

    return {"valid": True, "rowCount": 1247}

Hint 3.1: S3 event not triggering Lambda? Check:

CloudWatch Logs for Lambda
S3 notification config: aws s3api get-bucket-notification-configuration
Lambda resource-based policy: aws lambda get-policy
Event filter matches your file

Hint 3.2: Lambda cold starts too slow?

Check package size
Lazy import heavy libraries
Use Lambda Layers
Consider Provisioned Concurrency

Books That Will Help

Topic	Book	Specific Chapters	Why It Helps
Serverless Architecture	“AWS for Solutions Architects” by Saurabh Shrivastava	Ch. 8-9: Serverless, Containers vs Lambda	When to use Lambda vs Fargate, event-driven patterns
Distributed Systems	“Designing Data-Intensive Applications” by Martin Kleppmann	Ch. 11-12: Stream Processing	Event-driven architecture, idempotency, exactly-once semantics
Lambda Deep Dive	“AWS Lambda in Action” by Danilo Poccia	Ch. 2, 4, 8: First Lambda, Data Streams, Optimization	Cold start optimization, event source integration
Step Functions	AWS Step Functions Developer Guide	Error Handling, ASL Specification	State machine syntax, retry/catch patterns
IAM Security	“AWS Security” by Dylan Shield	Ch. 3, 5: IAM, Data Protection	Lambda execution roles, resource-based policies
Terraform	“Terraform: Up & Running” by Yevgeniy Brikman	Ch. 3, 7: State, Multiple Providers	Deploy Step Functions + Lambda as code
Observability	“Practical Monitoring” by Mike Julian	Ch. 4, 6: Applications, Alerting	Metrics and alerts for serverless
Cost Optimization	“AWS Cost Optimization” by Brandon Carroll	Ch. 5, 7: Lambda, Storage	Memory tuning, S3 lifecycle policies

Reading strategy:

Start with “AWS for Solutions Architects” Ch. 8 (serverless overview)
Read “Designing Data-Intensive Applications” Ch. 11 (event-driven concepts)
Dive into “AWS Lambda in Action” Ch. 4 (event sources) and Ch. 8 (optimization)
Refer to Step Functions Developer Guide when writing state machine
Use “Terraform: Up & Running” Ch. 3 as you build infrastructure

Common Pitfalls & Debugging

Problem 1: “Lambda timing out without entering handler or logging output”

Why: Initialization (Init) timeouts occur before handler execution, often due to heavy dependencies loading, VPC cold starts, or DNS resolution issues
Fix: Check CloudWatch Logs for INIT_REPORT entries, reduce package size by removing unused dependencies, use Lambda Layers for shared code
Quick test: aws logs tail /aws/lambda/YOUR_FUNCTION_NAME --follow and invoke the function to see initialization logs
Deep dive: If using VPC, ensure Lambda has ENI capacity and consider using VPC endpoints for AWS services to avoid internet routing

Problem 2: “S3 events not triggering Lambda function”

Why: Missing Lambda resource-based policy, incorrect event filter configuration, or S3 notification not properly configured
Fix: Verify S3 notification configuration: aws s3api get-bucket-notification-configuration --bucket YOUR_BUCKET
Quick test: Check Lambda’s resource policy allows S3 invocation: aws lambda get-policy --function-name YOUR_FUNCTION
Verification: Test event filter by uploading file matching your prefix/suffix pattern and checking CloudWatch Logs

Problem 3: “Step Functions execution fails with ‘States.TaskFailed’ or ‘States.Timeout’“

Why: Lambda function errors not caught, incorrect error handling in state machine, or timeout settings too restrictive
Fix: Add Catch blocks in state machine definition for error handling, increase TimeoutSeconds in task states, check Lambda logs for actual errors
Quick test: View execution history in Step Functions console, examine each state’s input/output to identify where it failed
Best practice: Use $.Cause and $.Error in Catch blocks to preserve error context for debugging

Problem 4: “Cold starts causing unacceptable latency (>3 seconds)”

Why: Large deployment package (>50MB), VPC configuration overhead (adds 1-10 seconds), or heavy runtime initialization (.NET/Java have longer cold starts than Python/Node.js)
Fix: Reduce package size (use Lambda Layers for dependencies), lazy-load heavy libraries only when needed, consider Provisioned Concurrency for critical paths ($0.000004 per GB-second)
Quick test: Check Lambda Insights metrics in CloudWatch for init duration and billed duration breakdown
Cost vs performance: Provisioned Concurrency keeps 1+ instances warm but costs ~$13/month per instance (1GB memory) - calculate if cold start impact justifies cost

Problem 5: “Step Functions execution fails with ‘Exceeded maximum allowed execution time’“

Why: Step Functions has a maximum execution time of 1 year for Standard workflows, but individual states might exceed Lambda’s 15-minute timeout
Fix: For long-running tasks, break them into smaller Lambda invocations or use Fargate tasks instead
Alternative: Use Step Functions Express workflows for short-lived, high-volume executions (<5 minutes)
Pattern: Implement a “chunking” pattern where Lambda processes batches and Step Functions Map state iterates

Problem 6: “Concurrent execution limit reached - throttling errors”

Why: AWS accounts have default concurrent execution limit of 1,000 across all Lambda functions in a region; sudden traffic spike can exhaust this
Fix: Request limit increase via AWS Support, or set reserved concurrency on critical functions to guarantee capacity
Quick test: Check CloudWatch metric ConcurrentExecutions and Throttles
Warning: Setting reserved concurrency on one function reduces available capacity for others - use carefully

Problem 7: “Lambda consuming more memory than expected - high costs”

Why: Memory setting too high for actual usage, memory leaks in function code, or inefficient data processing
Fix: Use Lambda Power Tuning tool to find optimal memory/cost balance: https://github.com/alexcasalboni/aws-lambda-power-tuning
Quick test: Check CloudWatch Logs for “Max Memory Used” vs “Memory Size” - if consistently using <50%, you’re over-provisioned
Reality check: Lambda is billed by GB-second; 128MB function running 100ms costs less than 1GB function running 100ms (6.4x cost difference)

Problem 8: “Step Functions state machine JSON is too large (>256KB)”

Why: Passing large data payloads directly in Step Functions state instead of using S3 references
Fix: Store large data in S3/DynamoDB, pass only S3 URI or keys in state machine
Pattern: Lambda 1 writes to S3 → passes {"s3Key": "data/processed/file.json"} → Lambda 2 reads from S3
Best practice: Step Functions state should contain metadata only, not actual data

Debugging Tools & Techniques:

CloudWatch Logs Insights: Query Lambda logs across multiple invocations to find patterns

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

AWS X-Ray: Enable active tracing on Lambda to see full request path through services (Lambda → S3 → DynamoDB)
Lambda Insights: Enhanced monitoring showing memory, CPU, network stats (requires CloudWatch agent layer)
Step Functions Graph View: Visual execution flow showing which states succeeded/failed with exact error messages
Remote Debugging (2025): Use AWS Toolkit for VS Code to set breakpoints and debug Lambda functions executing in the cloud

Sources:

[Issues to Avoid When Implementing Serverless Architecture with AWS Lambda

AWS Architecture Blog](https://aws.amazon.com/blogs/architecture/mistakes-to-avoid-when-implementing-serverless-architecture-with-lambda/)

AVOID these AWS Lambda MISTAKES (checklist + symptoms, solutions)
Accelerating local serverless development with console to IDE and remote debugging for AWS Lambda
4 AWS Serverless Security Traps & How to Fix Them

Project 3: Auto-Scaling Web Application (EC2 + ALB + RDS + S3)

Attribute	Value
Language	HCL (Terraform)
Difficulty	Intermediate
Time	2–3 weeks
Knowledge Area	Cloud Infrastructure / Scalability
Coolness	★☆☆☆☆ Pure Corporate Snoozefest
Portfolio Value	Resume Gold

What you’ll build: A traditional multi-tier web application with load-balanced EC2 instances that scale based on CPU/request metrics, backed by RDS (Aurora or PostgreSQL), with static assets served from S3/CloudFront.

Why it matters: This is the “classic” AWS architecture. Understanding auto-scaling groups, launch templates, health checks, and how ALB distributes traffic is foundational. You’ll also learn why RDS simplifies database ops and how S3+CloudFront offloads static content.

Core challenges:

Launch templates: defining AMI, instance type, user data scripts, IAM instance profiles
Auto scaling policies: configuring scaling based on CloudWatch metrics (CPU, request count, custom)
Health checks: understanding EC2 vs ELB health checks and how they affect scaling
Database connectivity: RDS in private subnet, security group rules from app tier only
Static asset optimization: S3 origin with CloudFront distribution, cache behaviors

Key concepts to master:

Auto Scaling Groups and launch templates
Application Load Balancer and target groups
RDS Multi-AZ deployments
CloudFront CDN and cache behaviors
Health checks and instance lifecycle

Prerequisites: Project 1 (VPC), basic web development, familiarity with databases.

Deliverable: A production-ready multi-tier web application with auto-scaling EC2 instances behind an ALB, RDS database in private subnets, and S3/CloudFront for static assets—all managed with Terraform and observable through CloudWatch metrics and dashboards.

Implementation hints:

Start with a single EC2 instance before adding auto-scaling
Use user data scripts to bootstrap instances with your application
Test health checks by manually stopping your web server
Use CloudWatch alarms to trigger scaling events

Milestones:

Single EC2 with user data script serving a web app → you understand instance bootstrapping
Add ALB + 2 instances in target group → you understand load balancing
Add Auto Scaling with CPU-based policy → you understand elasticity
Add RDS in private subnet → you understand data tier security
Add S3 + CloudFront for static assets → you understand CDN patterns

Real World Outcome

Here’s what success looks like when you complete this project:

# 1. Deploy the infrastructure
$ terraform apply
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.

Outputs:
alb_dns_name = "my-app-alb-123456789.us-east-1.elb.amazonaws.com"
rds_endpoint = "myapp-db.c9akl1.us-east-1.rds.amazonaws.com:5432"
cloudfront_domain = "d111111abcdef8.cloudfront.net"

# 2. Verify the application is running
$ curl http://my-app-alb-123456789.us-east-1.elb.amazonaws.com
<html>
  <head><title>My Scalable App</title></head>
  <body>
    <h1>Hello from instance i-0abc123def456!</h1>
    <p>Current time: 2025-12-22 14:32:15</p>
    <p>Database connection: OK</p>
  </body>
</html>

# 3. Load test to trigger auto-scaling (using 'hey' tool)
$ hey -n 10000 -c 100 http://my-app-alb-123456789.us-east-1.elb.amazonaws.com

Summary:
  Total:        45.3216 secs
  Slowest:      2.3451 secs
  Fastest:      0.0234 secs
  Average:      0.4532 secs
  Requests/sec: 220.75

Status code distribution:
  [200] 10000 responses

# 4. Watch instances scale up in real-time (in another terminal)
$ watch "aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-names my-app-asg \
    --query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].[InstanceId,LifecycleState]]' \
    --output table --no-cli-pager"

Every 2.0s: aws autoscaling...
---------------------------------------------
|  DescribeAutoScalingGroups                |
+-------------------------------------------+
||  2  | 1 | 5                             ||  # Started with 2 instances
+-------------------------------------------+
|||  i-0abc123def456  |  InService        |||
|||  i-0def789ghi012  |  InService        |||
+-------------------------------------------+

# After load increases, watch it scale to 4 instances:
+-------------------------------------------+
||  4  | 1 | 5                             ||  # Scaled up!
+-------------------------------------------+
|||  i-0abc123def456  |  InService        |||
|||  i-0def789ghi012  |  InService        |||
|||  i-0jkl345mno678  |  InService        |||
|||  i-0pqr901stu234  |  InService        |||
+-------------------------------------------+

# 5. Check CloudWatch metrics showing scaling activity
$ aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/my-app-alb/50dc6c495c0c9188 \
    --start-time 2025-12-22T14:00:00Z \
    --end-time 2025-12-22T15:00:00Z \
    --period 300 \
    --statistics Average \
    --no-cli-pager

Datapoints:
- Timestamp: 2025-12-22T14:00:00Z, Average: 0.032
- Timestamp: 2025-12-22T14:05:00Z, Average: 0.125  # Load increasing
- Timestamp: 2025-12-22T14:10:00Z, Average: 0.456  # Scaling triggered
- Timestamp: 2025-12-22T14:15:00Z, Average: 0.089  # Back to normal after scale-up

# 6. Check ALB target health
$ aws elbv2 describe-target-health \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app-tg/50dc6c495c0c9188 \
    --no-cli-pager

TargetHealthDescriptions:
- Target:
    Id: i-0abc123def456
    Port: 80
  HealthCheckPort: '80'
  TargetHealth:
    State: healthy
    Reason: Target passed health checks
- Target:
    Id: i-0def789ghi012
    Port: 80
  HealthCheckPort: '80'
  TargetHealth:
    State: healthy
    Reason: Target passed health checks

# 7. View Auto Scaling activity history
$ aws autoscaling describe-scaling-activities \
    --auto-scaling-group-name my-app-asg \
    --max-records 5 \
    --no-cli-pager

Activities:
- ActivityId: 1234abcd-5678-90ef-gh12-ijklmnopqrst
  Description: "Launching a new EC2 instance: i-0jkl345mno678"
  Cause: "At 2025-12-22T14:12:00Z a monitor alarm TargetTracking-my-app-asg-AlarmHigh-... in state ALARM triggered policy my-app-scaling-policy causing a change to the desired capacity from 2 to 4."
  StartTime: 2025-12-22T14:12:15Z
  EndTime: 2025-12-22T14:13:42Z
  StatusCode: Successful

# 8. Check RDS database connection
$ psql -h myapp-db.c9akl1.us-east-1.rds.amazonaws.com -U dbadmin -d myappdb
Password:
myappdb=> SELECT current_database(), current_user, inet_server_addr();
 current_database | current_user | inet_server_addr
------------------+--------------+------------------
 myappdb          | dbadmin      | 10.0.3.45
(1 row)

myappdb=> SELECT COUNT(*) FROM app_requests;
 count
-------
 10247  # Showing all the requests that were logged
(1 row)

# 9. View CloudFront cache statistics
$ aws cloudfront get-distribution-config \
    --id E1234ABCDEFGH \
    --query 'DistributionConfig.Origins[0].DomainName' \
    --output text --no-cli-pager

my-app-static-assets.s3.us-east-1.amazonaws.com

# Test CloudFront delivery
$ curl -I https://d111111abcdef8.cloudfront.net/images/logo.png
HTTP/2 200
content-type: image/png
content-length: 45678
x-cache: Hit from cloudfront  # Cache HIT - fast delivery!
x-amz-cf-pop: SFO5-C1
x-amz-cf-id: abc123...

# 10. After load subsides, watch instances scale down
$ watch "aws autoscaling describe-auto-scaling-groups ..."

+-------------------------------------------+
||  1  | 1 | 5                             ||  # Scaled back down to minimum
+-------------------------------------------+
|||  i-0abc123def456  |  InService        |||
+-------------------------------------------+

CloudWatch Dashboard View:

CPU Utilization graph showing spike from 20% → 75% → back to 25%
Request Count showing 50 req/sec → 800 req/sec → 60 req/sec
Target Response Time showing latency spike then recovery
Healthy Host Count showing 2 → 4 → 1 instances over time
RDS connections showing stable connection pool usage
ALB HTTP 200 responses at 99.8% (a few timeouts during initial spike)

What You’ll See in AWS Console:

EC2 Auto Scaling Groups showing scaling activities
CloudWatch alarms transitioning: OK → ALARM → OK
ALB target groups with health check status
RDS Performance Insights showing query patterns
S3 bucket with static assets
CloudFront distribution with cache statistics

The Core Question You’re Answering

“How do I build applications that automatically handle 10x traffic spikes without falling over, and intelligently shrink when traffic subsides to save costs?”

This is the fundamental problem that auto-scaling solves. Traditional infrastructure requires you to provision for peak load—meaning you’re paying for idle capacity 90% of the time. Auto-scaling lets you provision for average load and dynamically expand when needed.

Why this matters in the real world:

E-commerce sites during Black Friday sales
News sites during breaking news events
SaaS platforms during business hours (scale down at night)
API backends handling unpredictable mobile app traffic
Gaming servers during new release launches

Without auto-scaling, you have two bad options:

Over-provision: Run 10 servers 24/7 even though you only need them for 2 hours a day → wasted money
Under-provision: Run 2 servers and accept that your site crashes during traffic spikes → lost revenue

Auto-scaling gives you the best of both worlds: cost efficiency during normal periods, reliability during peaks.

Concepts You Must Understand First

Before diving into implementation, you need to internalize these foundational concepts:

1. Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up): Make your server bigger

EC2 instance: t3.medium → t3.large → t3.xlarge
More CPU, more RAM, same single instance
Limitation: Hard ceiling (largest EC2 instance), requires downtime, single point of failure

Horizontal Scaling (Scale Out): Add more servers

1 instance → 3 instances → 10 instances
Distribute load across multiple machines
Advantage: Theoretically unlimited, no downtime, fault-tolerant

This project teaches horizontal scaling, which is how modern cloud applications achieve massive scale.

2. Stateless Application Design

Stateless: Each request is independent, no memory of previous requests

# Stateless - GOOD for auto-scaling
@app.route('/api/user/<user_id>')
def get_user(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    return jsonify(user)
# Any instance can handle this request

Stateful: Application remembers information between requests

# Stateful - BAD for auto-scaling
user_sessions = {}  # In-memory storage

@app.route('/api/cart/add')
def add_to_cart(item_id):
    session_id = request.cookies.get('session')
    user_sessions[session_id].append(item_id)  # Only works if same instance
    return "Added"
# BREAKS when load balancer sends next request to different instance

Why stateless matters for auto-scaling:

Load balancer can send requests to any instance
Instances can be terminated without data loss
New instances can immediately serve traffic

How to handle state:

Store sessions in Redis/ElastiCache (shared across instances)
Store sessions in DynamoDB
Use JWT tokens (state stored client-side)
Use sticky sessions (not recommended, defeats auto-scaling benefits)

3. Load Balancer Algorithms

Round Robin (default for ALB):

Request 1 → Instance A
Request 2 → Instance B
Request 3 → Instance C
Request 4 → Instance A (back to start)

Pro: Simple, fair distribution Con: Doesn’t account for instance load

Least Outstanding Requests (ALB option):

Instance A: 5 active requests
Instance B: 2 active requests  ← Send here
Instance C: 8 active requests

Next request goes to Instance B (least busy)

Pro: Better for variable request durations Con: Requires tracking active connections

Weighted Target Groups:

Instance A: weight 100 (gets 50% of traffic)
Instance B: weight 50  (gets 25% of traffic)
Instance C: weight 50  (gets 25% of traffic)

Use case: Blue/green deployments, A/B testing

4. Health Checks: EC2 vs ELB Health Checks

EC2 Health Check (Auto Scaling Group):

Checks: Is the instance running? Is the status “OK”?
Fails if: Instance stopped, hardware failure, status check failures
Does NOT check: Is the application responding?

ELB Health Check (Application Load Balancer):

Checks: HTTP GET to /health endpoint, expects 200 response
Fails if: Application crashed, database unreachable, timeout (default 5 sec)
More comprehensive than EC2 check

Critical difference:

Scenario: EC2 instance is running, but your web app crashed

EC2 Health Check: PASS (instance is running)
ELB Health Check: FAIL (app not responding)

Result: Auto Scaling thinks instance is healthy, keeps it running
         Load balancer marks it unhealthy, stops sending traffic

Solution: Configure ASG to use ELB health checks!

Best practice configuration:

resource "aws_autoscaling_group" "app" {
  health_check_type         = "ELB"  # Use ELB checks, not EC2
  health_check_grace_period = 300    # Wait 5 min for app to start

  # ... other config
}

resource "aws_lb_target_group" "app" {
  health_check {
    path                = "/health"
    interval            = 30      # Check every 30 seconds
    timeout             = 5       # Fail if no response in 5 sec
    healthy_threshold   = 2       # 2 consecutive passes = healthy
    unhealthy_threshold = 3       # 3 consecutive fails = unhealthy
  }
}

5. Launch Templates vs Launch Configurations

Launch Configuration (DEPRECATED, but you’ll see it in old code):

Cannot be modified (must create new version)
Limited instance type options
No instance metadata service v2 support

Launch Template (USE THIS):

Versioned (can modify and rollback)
Supports mixed instance types
Supports Spot instances
Supports newer EC2 features

resource "aws_launch_template" "app" {
  name_prefix   = "my-app-"
  image_id      = "ami-0c55b159cbfafe1f0"  # Your AMI
  instance_type = "t3.medium"

  # User data script to configure instance on boot
  user_data = base64encode(<<-EOF
    #!/bin/bash
    cd /opt/myapp
    export DB_HOST=${aws_db_instance.main.endpoint}
    ./start-app.sh
  EOF
  )

  # IAM role for instance
  iam_instance_profile {
    arn = aws_iam_instance_profile.app.arn
  }

  # Security group
  vpc_security_group_ids = [aws_security_group.app.id]

  # Enable detailed monitoring
  monitoring {
    enabled = true
  }
}

Key understanding: Launch template is the blueprint for every instance that auto-scaling creates.

Questions to Guide Your Design

Ask yourself these questions as you build. If you can’t answer them, you don’t understand the architecture yet.

1. How does the ALB know which instances are healthy?

Answer: The ALB continuously sends HTTP requests to each instance’s health check endpoint (e.g., GET /health). If it receives a 200 OK response within the timeout period (default 5 seconds), the instance is marked healthy. If it fails the unhealthy threshold number of times (default 3 consecutive failures), it’s marked unhealthy and removed from rotation.

Follow-up: What should your /health endpoint check?

Database connectivity?
Disk space?
Memory available?
Dependency services reachable?

Best practice: Start simple (just return 200), then add checks for critical dependencies.

2. What metric should trigger scaling?

Common options:

CPU Utilization (most common starting point):

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "cpu-scale-up"
  scaling_adjustment     = 1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.app.name
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "cpu-utilization-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 70  # Scale up when CPU > 70%

  alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}

Problems with CPU-based scaling:

I/O-bound applications may never hit CPU threshold
Doesn’t capture actual user experience (latency)

Request Count Per Target (better for web apps):

resource "aws_cloudwatch_metric_alarm" "requests_high" {
  metric_name         = "RequestCountPerTarget"
  namespace           = "AWS/ApplicationELB"
  threshold           = 1000  # 1000 requests/target in 5 min
  # ... more config
}

Target Tracking Scaling (recommended - AWS manages it):

resource "aws_autoscaling_policy" "target_tracking" {
  name                   = "target-tracking-policy"
  policy_type            = "TargetTrackingScaling"
  autoscaling_group_name = aws_autoscaling_group.app.name

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 50.0  # Keep average CPU at 50%
  }
}

Decision criteria:

Web apps with predictable load per request: Request count
CPU-intensive tasks: CPU utilization
Most cases: Target tracking with CPU, let AWS handle it
Advanced: Custom metrics (latency from your app logs)

3. How do you handle session state with multiple instances?

Option 1: Don’t use sessions (best for APIs)

Use JWT tokens
All state in the token payload
Stateless, works with any instance

Option 2: Shared session store (for traditional web apps)

from flask import Flask, session
from flask_session import Session
import redis

app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.from_url('redis://elasticache-endpoint:6379')
Session(app)

# Now sessions are stored in Redis, not in-memory
# Any instance can read/write session data

Option 3: Sticky sessions (not recommended)

ALB always routes user to same instance
Defeats purpose of auto-scaling
Lose sessions when instance terminates

Real-world recommendation: Use ElastiCache (Redis) for sessions if you need them.

4. What’s the difference between desired, minimum, and maximum capacity?

resource "aws_autoscaling_group" "app" {
  min_size         = 2   # Never go below 2 (for high availability)
  max_size         = 10  # Never exceed 10 (cost protection)
  desired_capacity = 4   # Start with 4, let scaling adjust

  # ... more config
}

How they work together:

Scenario 1: Normal operation
Current: 4 instances (= desired)
Traffic increases, CPU hits 80%
Scaling policy triggers: desired_capacity = 4 + 2 = 6
Result: Launch 2 new instances

Scenario 2: Traffic spike
Current: 8 instances
Traffic spike continues, CPU still high
Scaling policy wants: desired_capacity = 8 + 2 = 10
Result: Launch 2 more instances (at max_size, stops)

Scenario 3: Traffic drops
Current: 6 instances
CPU drops to 20%, scale-down alarm triggers
Scaling policy: desired_capacity = 6 - 2 = 4
Result: Terminate 2 instances

Scenario 4: Major incident, all instances failing
Current: 0 healthy instances
Auto Scaling: "I need to meet min_size!"
Result: Launches 2 new instances (even if health checks failing)

Best practices:

min_size: Enough for high availability (at least 2 in different AZs)
max_size: Based on budget and realistic peak load
desired_capacity: Let auto-scaling manage it, or set initial value

5. What happens during a deployment to a running auto-scaled application?

Bad approach (causes downtime):

# Update launch template
# Terminate all instances
# New instances launch with new code
# → Downtime while instances come up

Good approach (rolling update):

resource "aws_autoscaling_group" "app" {
  # ...

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50  # Keep at least 50% healthy during update
    }
  }
}

Process:

Update launch template with new AMI/user data
Trigger instance refresh
Auto Scaling terminates one instance
New instance launches with new config
Wait for health checks to pass
Repeat until all instances replaced

Blue/Green deployment (zero downtime):

Create new ASG with new version
Attach to same target group
Gradually shift traffic (weighted target groups)
Monitor error rates
Fully switch or rollback

Thinking Exercise

Problem: You’re building a web application where each EC2 instance can reliably handle 100 requests per second. You expect peak traffic of 500 requests per second during business hours (9 AM - 5 PM), and only 50 requests per second during off-hours.

Question 1: How many instances do you need at minimum for peak load?

Click for answer

Answer: 500 req/sec ÷ 100 req/sec per instance = 5 instances minimum at peak

But you should add buffer for:

Instance failures
Traffic spikes beyond expected peak
Performance degradation under load

Recommended:

max_size = 8 (160% of calculated minimum, allows for 60% spike)
min_size = 2 (high availability during off-hours)
desired_capacity = 6 (20% buffer above minimum requirement)

Question 2: If each instance costs $0.05/hour, how much do you save per month with auto-scaling vs. running peak capacity 24/7?

Click for answer

Without auto-scaling (running 5 instances 24/7):

Cost = 5 instances × $0.05/hour × 24 hours × 30 days = $180/month

With auto-scaling:

Peak hours (9 AM - 5 PM = 8 hours): 5 instances
Off-hours (16 hours): 2 instances

Daily cost:

Peak: 5 instances × $0.05 × 8 hours = $2.00
Off-hours: 2 instances × $0.05 × 16 hours = $1.60
Total per day = $3.60

Monthly cost:

$3.60 × 30 days = $108/month

Savings: $180 - $108 = $72/month (40% reduction)

With realistic scaling (more granular adjustments), savings are often 50-70%.

Question 3: Your health check interval is 30 seconds, unhealthy threshold is 3. An instance’s application crashes. How long until the ALB stops sending traffic to it?

Click for answer

Answer:

Health check every 30 seconds
Need 3 consecutive failures
Time = 30 seconds × 3 = 90 seconds minimum

During this time, users may experience:

Timeout errors (if health check timeout is 5 sec, user requests timeout too)
Failed requests sent to unhealthy instance

Optimization: Reduce interval to 10 seconds, unhealthy threshold to 2:

Time to detect = 10 × 2 = 20 seconds
Trade-off: More frequent health checks = slightly more traffic

Question 4: You set scale-up to trigger at 70% CPU, scale-down at 30% CPU. What problem might you encounter?

Click for answer

Problem: Flapping (constant scale-up/scale-down oscillation)

Scenario:

3 instances at 75% CPU → scale up to 4 instances
Load distributed across 4 → CPU drops to 56% (still above 30%)
Stays stable… then one instance terminates unexpectedly
Load redistributes to 3 instances → CPU back to 75% → scale up again
Repeat forever

Or worse:

3 instances at 75% CPU → scale up to 4
CPU drops to 28% → scale down to 3
CPU jumps to 75% → scale up to 4
Infinite loop, costs money, destabilizes app

Solution: Add cooldown periods and wider gap:

resource "aws_autoscaling_policy" "scale_up" {
  cooldown = 300  # Wait 5 minutes before scaling again
  # ...
}

# Use thresholds with buffer:
# Scale up at 70% CPU
# Scale down at 20% CPU (50 point gap prevents flapping)

Better solution: Use target tracking scaling:

target_tracking_configuration {
  target_value = 50.0  # AWS automatically manages scale-up/down to maintain this
}

AWS’s algorithm is smarter, prevents flapping, considers cooldowns automatically.

The Interview Questions They’ll Ask

When you claim AWS auto-scaling experience on your resume, expect these questions:

1. Basic Understanding

Q: “Explain the difference between vertical and horizontal scaling. When would you use each?”

Expected answer:

Vertical = bigger instance (limited by hardware, downtime required)
Horizontal = more instances (unlimited, no downtime, requires stateless design)
Use vertical for: Legacy apps that can’t distribute, databases (until you move to RDS)
Use horizontal for: Web apps, APIs, stateless services

Q: “What’s the difference between an Application Load Balancer and a Network Load Balancer?”

Expected answer:

ALB: Layer 7 (HTTP/HTTPS), content-based routing, WebSockets, host/path routing
NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions of req/sec
Use ALB for: Web applications, microservices with path routing
Use NLB for: Non-HTTP protocols, extreme performance requirements, static IP needed

2. Design Scenarios

Q: “You’re seeing 5xx errors from your ALB. How do you troubleshoot?”

Expected approach:

Check target health in target group (unhealthy instances?)
Check ALB access logs (which endpoints returning 5xx?)
Check application logs on instances (app crashes? database timeouts?)
Check security group rules (instances can reach database?)
Check CloudWatch metrics (CPU/memory maxed out?)

Q: “Your application needs to maintain user sessions. How do you architect this with auto-scaling?”

Expected answer:

Option 1: ElastiCache (Redis/Memcached) as shared session store
Option 2: DynamoDB for session storage
Option 3: JWT tokens (no server-side sessions)
NOT sticky sessions (defeats auto-scaling benefits, data loss on instance termination)

3. Scaling Logic

Q: “You set your ASG to min=2, max=10, desired=5. You manually terminate an instance. What happens?”

Expected answer:

Current instances: 4 (after termination)
Desired capacity: still 5
Auto Scaling detects current < desired
Launches 1 new instance to reach desired=5

Q: “What’s the difference between target tracking scaling and step scaling?”

Expected answer:

Target tracking: Set a target (e.g., “maintain 50% CPU”), AWS automatically scales up/down to maintain it. Simpler, recommended for most use cases.
Step scaling: Define explicit rules (e.g., “if CPU > 70%, add 2 instances; if CPU > 85%, add 4 instances”). More control, more complex, use for non-linear scaling needs.

4. Real-World Problem Solving

Q: “Your auto-scaling isn’t triggering when you expect. How do you debug?”

Expected approach:

Check CloudWatch alarms (are they in ALARM state?)
Check alarm history (has threshold actually been crossed?)
Check alarm configuration (right metric? right threshold? evaluation periods?)
Check ASG configuration (is policy attached? cooldown preventing scale?)
Check instance metrics (is data actually being reported?)

Q: “You deployed a new version and now instances are failing health checks. What do you check?”

Expected approach:

SSH to instance, check application logs
Test health check endpoint manually: curl localhost:80/health
Check if app started correctly (check user data script logs)
Check security group (does instance allow traffic on health check port?)
Check health check configuration (path correct? timeout too short?)
Check grace period (is app given enough time to start before checks?)

5. Cost Optimization

Q: “How would you reduce costs for an auto-scaled application?”

Expected strategies:

Right-size instances (use smaller instance types if CPU consistently low)
Use Spot instances for fault-tolerant workloads (70-90% cheaper)
Implement aggressive scale-down (reduce min_size during known low-traffic periods)
Use scheduled scaling (scale down automatically at night/weekends)
Reserved Instances or Savings Plans for baseline capacity
Monitor and optimize unhealthy instance replacement (failing fast vs. retrying)

Q: “Explain Spot instances in the context of auto-scaling. What are the risks?”

Expected answer:

Spot = unused EC2 capacity at up to 90% discount
Risk: AWS can terminate with 2-minute notice if capacity needed
Use in ASG with mixed instance types (Spot + On-Demand)
Configure Spot allocation strategy (price-capacity-optimized)
Not suitable for: Stateful apps, databases, single-instance workloads
Perfect for: Batch processing, web front-ends (with On-Demand baseline)

Hints in Layers

When you get stuck, reveal hints progressively instead of jumping to the solution:

Problem: Instances launching but failing health checks immediately

Hint 1 (First check)

Check the health check grace period. Your application might need time to start up.

resource "aws_autoscaling_group" "app" {
  health_check_grace_period = 300  # Seconds to wait before health checks
}

If your app takes 3 minutes to initialize but grace period is 30 seconds, instances will be terminated before they’re ready.

Hint 2 (If still failing)

SSH into a failing instance and test the health check endpoint manually:

# From within the instance
curl -v http://localhost:80/health

# Check if the application is actually running
ps aux | grep your-app-name

# Check application logs
tail -f /var/log/your-app/app.log

Is the application even starting? Is it listening on the correct port?

Hint 3 (Security check)

Verify security group rules allow health checks:

aws ec2 describe-security-groups --group-ids sg-xxxxx --no-cli-pager

# Look for:
# - Inbound rule allowing ALB security group on port 80
# - Or inbound rule allowing the VPC CIDR on port 80

The ALB needs network access to perform health checks.

Hint 4 (Application check)

Check your user data script logs:

# On Amazon Linux/Ubuntu
cat /var/log/cloud-init-output.log

# Look for errors in your bootstrap script
# Did database connection fail?
# Did dependencies install correctly?

A failing user data script means your app never starts.

Solution (Last resort)

Common causes and fixes:

Application takes too long to start:

health_check_grace_period = 600  # Increase to 10 minutes

Wrong health check path:

resource "aws_lb_target_group" "app" {
  health_check {
    path = "/health"  # Make sure this endpoint exists!
  }
}

Health check endpoint requires database, database unreachable:
- Fix security group rules allowing app tier → database tier
- Or simplify health check to not require database

Application listening on wrong port:

# Your app
app.run(host='0.0.0.0', port=80)  # Must match target group port

User data script has errors, app never starts:
- Test user data script locally first
- Add error handling: set -e to fail fast
- Check logs: /var/log/cloud-init-output.log

Problem: Auto-scaling not triggering when CPU is high

Hint 1

Check if your CloudWatch alarm is actually in ALARM state:

aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager

Look at StateValue. If it’s OK, the threshold hasn’t been crossed.

Hint 2

Check your alarm configuration:

aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager

# Verify:
# - Threshold: Is it too high? (e.g., 99% vs 70%)
# - EvaluationPeriods: Does CPU need to be high for multiple periods?
# - Period: Is it too long? (e.g., 5 minutes vs 1 minute)
# - Statistic: Average vs Maximum vs Minimum

Example: If EvaluationPeriods=3 and Period=300, CPU must be high for 15 minutes before alarm triggers.

Hint 3

Check if scaling is in cooldown:

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg --no-cli-pager

# Look for recent scaling activities
aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg --max-records 5 --no-cli-pager

If a scaling action just happened, cooldown period prevents another one (default 300 seconds).

Solution

Common fixes:

Alarm threshold too high:
```
threshold = 70  # Not 90
```

Evaluation period too long:

evaluation_periods = 2  # Not 5
period             = 60 # 1 minute, not 5

Cooldown preventing scaling:

resource "aws_autoscaling_policy" "scale_up" {
  cooldown = 60  # Reduce from 300
}

Alarm not attached to scaling policy:

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  # ...
  alarm_actions = [aws_autoscaling_policy.scale_up.arn]  # Must be set!
}

Use target tracking instead:

resource "aws_autoscaling_policy" "target_tracking" {
  policy_type = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 50.0
  }
}
# AWS handles everything automatically

Books That Will Help

Book	Author	What It Teaches	Best Sections for This Project
AWS for Solutions Architects	Saurabh Shrivastava et al.	Comprehensive AWS architecture patterns	Ch. 6 (Auto Scaling Groups), Ch. 7 (RDS), Ch. 10 (High Availability)
AWS Certified Solutions Architect Study Guide	Ben Piper, David Clinton	Exam-focused AWS fundamentals	Ch. 4 (EC2), Ch. 5 (ELB), Ch. 7 (CloudWatch)
Designing Data-Intensive Applications	Martin Kleppmann	Distributed systems theory (applies to auto-scaling)	Ch. 1 (Scalability), Ch. 8 (Distributed Systems)
Amazon Web Services in Action	Michael Wittig, Andreas Wittig	Hands-on AWS with practical examples	Ch. 3 (Infrastructure Automation), Ch. 6 (Scaling Up and Down)
The Phoenix Project	Gene Kim et al.	DevOps principles (why auto-scaling matters)	Part 2 (First Way - Flow), Part 3 (Second Way - Feedback)
Site Reliability Engineering	Google SRE Team	Operational practices for scaled systems	Ch. 6 (Monitoring), Ch. 22 (Cascading Failures - why you need auto-scaling)
Terraform: Up & Running	Yevgeniy Brikman	Infrastructure as Code for AWS	Ch. 2 (Terraform Syntax), Ch. 5 (State Management), Ch. 7 (Modules)

Reading strategy:

Start with “AWS for Solutions Architects” Ch. 6-7 for AWS-specific patterns
Read “Designing Data-Intensive Applications” Ch. 1 to understand why systems need to scale
Use “Terraform: Up & Running” Ch. 2 as a reference while coding
Read “SRE” Ch. 22 after completing the project to understand failure modes you just protected against

Common Pitfalls & Debugging

Problem 1: “Auto Scaling not triggering when CPU is high”

Why: CloudWatch alarm incorrectly configured, insufficient evaluation periods, or metric not being published
Fix: Verify alarm state: aws cloudwatch describe-alarms --alarm-names YOUR_ALARM_NAME
Quick test: Check alarm history: aws cloudwatch describe-alarm-history --alarm-name YOUR_ALARM_NAME --max-records 5
Common cause: Using average CPU across all instances - if 1 instance is at 100% but others are idle, average might not trigger threshold
Best practice: Use target tracking policies instead of step scaling for more responsive behavior

Problem 2: “Instances launching but failing health checks immediately”

Why: Health check grace period too short, application taking longer to start than expected, or wrong health check endpoint configured
Fix: Increase health_check_grace_period in Auto Scaling Group (default 300 seconds often too short for complex apps)
Quick test: SSH into a newly launched instance and manually curl the health check endpoint to see actual response
Debugging: Check ALB target group health checks vs ASG health checks - they can conflict
Common issue: Application binds to localhost instead of 0.0.0.0, so ALB health checks from outside instance fail

Problem 3: “RDS connection pool exhausted - ‘too many connections’ error”

Why: Each EC2 instance creates its own connection pool; auto-scaling adds instances which multiply connections to RDS
Fix: Implement connection pooling properly - use PgBouncer/ProxySQL or RDS Proxy to multiplex connections
Quick test: Check RDS CloudWatch metric DatabaseConnections - compare to max_connections parameter
Calculation: If each instance uses 20 connections and you scale to 50 instances = 1,000 connections (most RDS instances max out at 150-500)
Solution: Use RDS Proxy (AWS managed connection pooler) or reduce per-instance connection pool size

Problem 4: “Instances can’t reach RDS database - connection timeout”

Why: Security group on RDS not allowing traffic from EC2 security group, or RDS in different VPC/subnet
Fix: RDS security group must have inbound rule allowing port 5432 (PostgreSQL) or 3306 (MySQL) from EC2 security group ID
Quick test: From EC2 instance, run nc -zv RDS_ENDPOINT 5432 to test TCP connectivity
Network debugging: Check route tables - private subnet route table must route to NAT gateway for outbound, but RDS is internal so no NAT needed
Common mistake: Allowing RDS access by IP range instead of security group ID - IPs change with auto-scaling

Problem 5: “ALB returning 502 Bad Gateway errors intermittently”

Why: Application crashing on some instances, instances deregistering during traffic, or connection timeout misconfiguration
Fix: Check ALB target health in console, examine CloudWatch Logs for unhealthy target count
Quick test: aws elbv2 describe-target-health --target-group-arn YOUR_TG_ARN shows which instances are unhealthy
Common cause: Application takes 30+ seconds to respond, but ALB target timeout is 30 seconds (default) - increase to 60s
Debugging: Enable ALB access logs to S3 to see exact error codes and target IP addresses

Problem 6: “Auto Scaling stuck - won’t scale in (decrease instances)”

Why: Scale-in protection enabled on instances, cooldown period preventing scale-in, or Application Auto Scaling suspended during ECS deployments
Fix: Check if “Turn off scale-in” option is enabled in scaling policy (should be disabled unless you have specific reason)
Quick test: Verify Auto Scaling Group activities: aws autoscaling describe-scaling-activities --auto-scaling-group-name YOUR_ASG --max-records 10
For ECS: Application Auto Scaling suspends scale-in during deployments and resumes after completion - this is by design
Cooldown: Default 300-second cooldown between scaling activities prevents rapid fluctuations - be patient or reduce cooldown

Problem 7: “Session state lost when traffic moves to different instance”

Why: Application stores session data in memory (not stateless), new instance doesn’t have session info
Fix: Use ElastiCache (Redis/Memcached) or DynamoDB for shared session storage across instances
Alternative: Enable ALB sticky sessions (session affinity) - but this defeats auto-scaling benefits if sessions are long-lived
Best practice: Make application truly stateless - session data in Redis, user uploads in S3, not on instance filesystem
Quick test: Hit ALB endpoint twice from browser - check if you’re logged out (lost session) between requests

Problem 8: “CloudWatch alarms not triggering - metrics show no data”

Why: CloudWatch agent not installed or configured on EC2 instances, IAM role missing CloudWatch permissions
Fix: Ensure EC2 IAM role has CloudWatchAgentServerPolicy managed policy attached
Quick test: SSH to instance and check CloudWatch agent status: sudo systemctl status amazon-cloudwatch-agent
Manual test: Push custom metric: aws cloudwatch put-metric-data --namespace MyApp --metric-name TestMetric --value 1
Common issue: Metric has wrong namespace or dimension name in alarm configuration vs what application publishes

Problem 9: “Deployment causes downtime - all instances replaced simultaneously”

Why: Launch template updated, Auto Scaling Group performs rolling replacement, but replacement too fast
Fix: Configure deployment settings: set max_unavailable to 1 and min_healthy_percentage to 90% to ensure gradual rollout
For Blue/Green: Use separate target groups and weighted ALB routing to shift traffic gradually
Best practice: Use instance refresh with checkpoint delays to validate new instances before continuing
Quick test: During deployment, watch target group: aws elbv2 describe-target-health --target-group-arn YOUR_TG_ARN every 10 seconds

Problem 10: “Costs skyrocketing - Auto Scaling scaling out but never in”

Why: Scale-in threshold not configured, metric stuck high, or zombie instances in ASG
Fix: Set both scale-out AND scale-in policies with appropriate thresholds (e.g., scale out at >70% CPU, scale in at <30% CPU)
Cost check: Run aws ce get-cost-and-usage to see EC2 vs other service costs
Quick audit: aws autoscaling describe-auto-scaling-groups and check current capacity vs desired vs max
Reality check: Auto Scaling should oscillate around desired capacity - if it’s always at max, your scale-out threshold is too sensitive

Debugging Tools & Techniques:

Auto Scaling Activity History: Shows why scaling happened and if it succeeded/failed

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name YOUR_ASG \
  --max-records 20 \
  --query 'Activities[*].[StartTime,Description,Cause]' \
  --output table

ALB Target Health Dashboard: Real-time view of which instances are healthy/unhealthy
CloudWatch Container Insights: If using ECS, shows task-level CPU/memory metrics
RDS Performance Insights: Identifies slow queries causing connection pool exhaustion
VPC Flow Logs: Trace network traffic between ALB ↔ EC2 ↔ RDS to find security group issues

Sources:

[Troubleshoot auto scaling issues in Amazon ECS AWS re:Post](https://repost.aws/knowledge-center/ecs-service-auto-scaling-issues)
Troubleshooting service auto scaling in Amazon ECS
Troubleshoot issues in Amazon EC2 Auto Scaling
Avoiding Common Pitfalls with ECS Capacity Providers and Auto Scaling

Project 4: Container Platform (ECS Fargate or EKS)

File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
Main Programming Language: Python
Alternative Programming Languages: Go, TypeScript, Terraform HCL
Coolness Level: Level 2: Practical but Forgettable
Business Potential: Level 3: The “Service & Support” Model
Difficulty: Level 3: Advanced (The Engineer)
Knowledge Area: Containers, Kubernetes, Orchestration
Software or Tool: Docker, ECS, EKS, Kubernetes
Main Book: “AWS for Solutions Architects” by Shrivastava et al.

What you’ll build: A containerized microservices application deployed on either ECS with Fargate (serverless containers) or EKS (managed Kubernetes), complete with service discovery, load balancing, and auto-scaling.

Why it teaches Containers on AWS: Containers are between EC2 and Lambda—more portable than EC2, more control than Lambda. ECS teaches you AWS’s native container orchestration; EKS teaches you Kubernetes. Both force you to understand task definitions, networking modes, and service mesh concepts.

Core challenges you’ll face:

Task Definitions (maps to container configuration): CPU/memory allocation, environment variables, port mappings, IAM task roles
Networking Modes (maps to container networking): awsvpc mode, service discovery, ALB integration
Service Scaling (maps to container orchestration): Target tracking, step scaling based on metrics
ECR Integration (maps to image management): Building, tagging, pushing container images
For EKS: Kubernetes Fundamentals (maps to orchestration): Pods, Deployments, Services, Ingress

Resources for key challenges:

Learn Amazon ECS over a Weekend - Fast workshop-style course
Provision an EKS cluster with Terraform - HashiCorp official tutorial
Terraform EKS Tutorial - Spacelift comprehensive guide
Build secure application networks with VPC Lattice - AWS Containers Blog

Key Concepts:

ECS Task Definitions: Amazon ECS Developer Guide
Fargate vs EC2 Launch Type: “AWS for Solutions Architects” Ch. 9
Kubernetes on AWS: Amazon EKS User Guide
Terraform EKS Module: terraform-aws-eks - GitHub

Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Docker basics, Projects 1-2 completed, some Kubernetes knowledge for EKS path

Real world outcome:

Multiple containerized services communicating with each other
A working application accessible via ALB
CloudWatch Container Insights showing metrics
Ability to deploy new versions with zero downtime
For EKS: kubectl commands working against your cluster

Learning milestones:

First milestone: Single container task running on Fargate → you understand task definitions
Second milestone: Add ALB target group with service → you understand container load balancing
Third milestone: Add second service with service discovery → you understand microservices communication
Fourth milestone: Add auto-scaling based on CPU → you understand container elasticity
Final milestone: CI/CD pipeline deploying to ECS/EKS → you understand container DevOps

Real World Outcome

When you complete this project, you’ll have tangible proof of a working container platform:

For ECS Fargate:

# View your running services
$ aws ecs list-services --cluster my-cluster --no-cli-pager
{
    "serviceArns": [
        "arn:aws:ecs:us-east-1:123456789:service/my-cluster/api-service",
        "arn:aws:ecs:us-east-1:123456789:service/my-cluster/worker-service"
    ]
}

# Check service health
$ aws ecs describe-services --cluster my-cluster --services api-service --no-cli-pager
{
    "services": [{
        "serviceName": "api-service",
        "runningCount": 2,
        "desiredCount": 2,
        "launchType": "FARGATE",
        "networkConfiguration": {
            "awsvpcConfiguration": {
                "subnets": ["subnet-abc123", "subnet-def456"],
                "securityGroups": ["sg-xyz789"]
            }
        }
    }]
}

# Your application responds via ALB
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "service": "api", "task_id": "abc123def456", "version": "1.2.0"}

# Service discovery working (containers finding each other by DNS)
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/api/worker-status
{"worker_service": "reachable", "tasks": 3, "queue_depth": 42}

# View container images in ECR
$ aws ecr describe-images --repository-name my-app --no-cli-pager
{
    "imageDetails": [
        {
            "imageDigest": "sha256:abc123...",
            "imageTags": ["latest", "v1.2.0"],
            "imagePushedAt": "2025-12-20T10:30:00+00:00"
        }
    ]
}

For EKS:

# Your Kubernetes cluster is accessible
$ kubectl cluster-info
Kubernetes control plane is running at https://ABC123.gr7.us-east-1.eks.amazonaws.com

# View your running workloads
$ kubectl get pods -n production
NAME                          READY   STATUS    RESTARTS   AGE
api-deployment-7d8f9c-abc12   1/1     Running   0          2h
api-deployment-7d8f9c-def34   1/1     Running   0          2h
worker-deployment-5e6f7-xyz   1/1     Running   0          1h

$ kubectl get services -n production
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP                           PORT(S)
api-service   LoadBalancer   10.100.50.25    a1b2c3.us-east-1.elb.amazonaws.com   80:30001/TCP
worker-svc    ClusterIP      10.100.75.10    <none>                                8080/TCP

# Application accessible via Kubernetes LoadBalancer
$ curl http://a1b2c3.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "pod": "api-deployment-7d8f9c-abc12", "node": "ip-10-0-1-50.ec2.internal"}

# Rolling deployment with zero downtime
$ kubectl set image deployment/api-deployment api=my-repo/api:v1.3.0 -n production
deployment.apps/api-deployment image updated

$ kubectl rollout status deployment/api-deployment -n production
Waiting for deployment "api-deployment" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
deployment "api-deployment" successfully rolled out

Container Insights Dashboard:

CPU/Memory utilization per service/pod
Network throughput and connections
Task/pod startup time and failure rates
Container-level logs aggregated in CloudWatch

Service Discovery Working:

ECS: Cloud Map DNS names (api-service.local, worker-service.local)
EKS: Kubernetes DNS (api-service.production.svc.cluster.local)
Containers automatically discover each other without hardcoded IPs

Rolling Deployments:

Deploy new version without downtime
Watch old tasks drain and new tasks become healthy
Automatic rollback if health checks fail

The Core Question You’re Answering

“When should I use containers vs Lambda vs EC2, and what’s the difference between ECS and EKS?”

This is THE fundamental architectural decision on AWS:

Lambda: Event-driven, sub-15-minute executions, no state, extreme auto-scaling. Use when you have unpredictable spiky traffic and stateless operations.
Containers (ECS/EKS): Long-running processes, stateful applications, need specific runtime control, want portability. Use when you need more than 15 minutes, WebSocket connections, background workers, or existing Dockerized apps.
EC2: Full OS control, legacy apps, specialized hardware needs, licensed software. Use when containers/Lambda don’t fit.

ECS vs EKS:

ECS: AWS-native, simpler, less operational overhead, great for teams new to containers. Task definitions are AWS-specific JSON.
EKS: Standard Kubernetes, portable across clouds, richer ecosystem, more complex. Use when you need Kubernetes features or multi-cloud portability.

Fargate vs EC2 Launch Type:

Fargate: Serverless containers—you define task CPU/memory, AWS manages infrastructure. Higher per-task cost, zero operational overhead.
EC2 Launch Type: You manage the EC2 instances in the cluster. Lower per-task cost if you have steady baseline load, more control, more ops work.

Concepts You Must Understand First

Container Fundamentals:
- Docker images are layered filesystems (each Dockerfile instruction = layer)
- Container registries store images (ECR, Docker Hub)
- Containers are isolated processes sharing a kernel (not VMs)
- Port mapping: container internal port → host port
- Environment variables and secrets injection
Container Orchestration:
- Scheduling: Placing containers on available compute resources based on CPU/memory requirements
- Service Discovery: How containers find each other (DNS-based)
- Load Balancing: Distributing traffic across container replicas
- Health Checks: Determining if a container is ready to receive traffic
- Auto-Scaling: Adjusting container count based on metrics
ECS Concepts:
- Task Definition: Blueprint for your container (image, CPU, memory, ports, environment)
- Task: Running instance of a task definition (one or more containers running together)
- Service: Maintains desired count of tasks, integrates with ALB, handles deployments
- Cluster: Logical grouping of tasks/services
- Task Role: IAM permissions for your application (what the container can do)
- Execution Role: IAM permissions for ECS agent (pulling ECR images, writing logs)
Kubernetes Concepts (for EKS):
- Pod: Smallest unit (one or more containers, shared network/storage)
- Deployment: Declarative pod management (replicas, rolling updates)
- Service: Stable endpoint for pods (ClusterIP, LoadBalancer, NodePort)
- Ingress: HTTP routing rules (maps URLs to services)
- Namespace: Logical cluster subdivision
- ConfigMap/Secret: Configuration and sensitive data injection
Fargate vs EC2 Launch Types:
- Fargate: Specify vCPU/memory, AWS provisions infrastructure, pay per task resource/time
- EC2: You provision EC2 instances, ECS schedules tasks on them, pay for instances (can be cheaper at scale)
- Trade-off: Fargate = simplicity, EC2 = control + potential cost savings with Reserved Instances

Questions to Guide Your Design

Architecture Decisions:

When would you choose Fargate over EC2 launch type? (Hint: variable workload vs steady baseline, ops overhead tolerance)
Should you use ECS or EKS? (Hint: team Kubernetes experience, multi-cloud needs, ecosystem requirements)
How many containers should run in a single task/pod? (Hint: tightly coupled = same task, independent = separate tasks)

Networking:

How do containers in the same task communicate? (Hint: localhost, shared network namespace)
How do containers in different tasks communicate? (Hint: service discovery via DNS)
What’s the difference between ECS service discovery (Cloud Map) and an ALB? (Hint: internal service-to-service vs external clients)
Which networking mode should you use (awsvpc, bridge, host)? (Hint: Fargate requires awsvpc)

Security:

How do you handle secrets in containerized applications? (Hint: Secrets Manager/Parameter Store + task definition secrets, NOT environment variables in plaintext)
What’s the difference between task role and execution role? (Hint: execution = ECS needs it, task = your app needs it)
How do you restrict which services can talk to each other? (Hint: security groups with awsvpc mode)

Scaling & Performance:

Should you scale based on CPU, memory, or request count? (Hint: depends on bottleneck—CPU-bound vs I/O-bound workloads)
How do you handle database connection pooling with auto-scaling containers? (Hint: RDS Proxy or application-level pooling)
What happens during a deployment? (Hint: rolling update drains old tasks while starting new ones)

Operational:

How do you get logs from containers? (Hint: awslogs driver → CloudWatch)
How do you debug a failing container startup? (Hint: CloudWatch logs, check execution role for ECR pull permissions)
How do you do zero-downtime deployments? (Hint: ALB health checks + rolling update strategy)

Thinking Exercise

Map Kubernetes concepts to ECS equivalents:

Kubernetes	ECS Equivalent	Notes
Pod	Task	Both are one or more containers with shared resources
Deployment	Service	Both maintain desired count and handle updates
Service (ClusterIP)	Service Discovery (Cloud Map)	Internal DNS-based discovery
Service (LoadBalancer)	ALB Target Group	External load balancing
Container in Pod spec	Container Definition in Task Definition	Both define image, ports, env vars
ConfigMap	SSM Parameter Store / Secrets Manager	Both inject configuration
Secret	Secrets Manager	Both handle sensitive data
Namespace	Cluster (loosely)	Logical separation (but ECS clusters are less strict)
Ingress	ALB Listener Rules	HTTP routing rules
HorizontalPodAutoscaler	Service Auto Scaling	Both scale based on metrics

Key differences:

Kubernetes is more declarative (desired state via YAML)
ECS is more imperative (API calls to create/update services)
Kubernetes has richer networking (network policies, service mesh)
ECS is simpler for AWS-only deployments

Design exercise: If you have a microservices app with 5 services, should they all go in one task definition or separate services?

Answer: Separate ECS services (or Kubernetes deployments). Each microservice should scale independently.
Exception: If two containers are tightly coupled (app + sidecar proxy), same task/pod makes sense.

The Interview Questions They’ll Ask

Basic Level:

Q: What’s the difference between a Docker image and a container?
- A: An image is a read-only template (layers of filesystem changes). A container is a running instance of an image with a writable layer on top. Analogy: image = class, container = object instance.
Q: Explain ECS task vs service.
- A: A task is a running instantiation of a task definition (one or more containers). A service maintains a desired count of tasks, integrates with load balancers, and handles deployments. Tasks are ephemeral; services ensure they keep running.
Q: What is Fargate?
- A: Serverless compute for containers. You specify CPU/memory in your task definition, AWS provisions and manages the underlying infrastructure. No EC2 instances to manage.
Q: How do containers in the same ECS task communicate?
- A: Via localhost. Containers in the same task share a network namespace, so they can reach each other on 127.0.0.1 using different ports.

Intermediate Level:

Q: When would you use ECS over Lambda?
- A: When you need: (1) longer than 15-minute execution, (2) WebSocket/long-lived connections, (3) stateful processing, (4) specific runtime dependencies not available in Lambda layers, (5) existing Dockerized applications.
Q: Explain the difference between task role and execution role in ECS.
- A: Execution role: permissions ECS needs to set up your task (pull ECR images, write CloudWatch logs). Task role: permissions your application code needs (read S3, query DynamoDB). Never confuse these—execution role is infrastructure, task role is application.
Q: How does ECS service discovery work?
- A: Uses AWS Cloud Map to create DNS records for tasks. When you enable service discovery, each task gets a DNS entry (e.g., api-service.local). Other services query this DNS name and get IPs of healthy tasks. Updates automatically as tasks start/stop.
Q: How do you implement zero-downtime deployments in ECS?
- A: Use rolling update deployment type with ALB. Configure minimum healthy percent (e.g., 100%) and maximum percent (e.g., 200%). ECS starts new tasks, waits for ALB health checks to pass, then drains and stops old tasks. If health checks fail, deployment rolls back.

Advanced Level:

Q: You have an ECS service that keeps failing health checks and restarting. How do you debug?
- A: (1) Check CloudWatch logs for application errors. (2) Verify health check endpoint in ALB target group matches application. (3) Check security groups allow ALB → tasks traffic. (4) Verify task role has permissions app needs. (5) Check container startup time vs health check interval (may need longer initial delay). (6) Exec into a running task to test manually: aws ecs execute-command --cluster X --task Y --container Z --command /bin/sh.
Q: How would you handle database connection pooling with auto-scaling ECS services?
- A: Options: (1) Use RDS Proxy (connection pooling/multiplexing at AWS layer). (2) Application-level pooling with conservative pool size per container (max_connections / expected_task_count). (3) Use connection pool libraries that handle connection reuse. Problem: Each task creates its own pool, so 10 tasks with 10 connections each = 100 DB connections. RDS Proxy solves this.
Q: Compare ECS awsvpc networking mode with bridge mode.
- A: awsvpc: Each task gets its own ENI with private IP from VPC subnet. Security groups apply at task level. Required for Fargate. Better isolation. Bridge: Containers share host’s network via Docker bridge. Port mapping required (host port != container port). Multiple tasks on same host need different host ports. awsvpc is recommended for new deployments.
Q: When would you choose EKS over ECS?
- A: When you need: (1) Kubernetes-specific features (CRDs, operators, Helm charts), (2) multi-cloud portability (same K8s manifests work on GKE/AKS), (3) existing Kubernetes expertise on team, (4) vendor-neutral orchestration. ECS is simpler and AWS-native; EKS is more complex but standard.
Q: How do you handle secrets in containerized apps on AWS?
- A: Store secrets in AWS Secrets Manager or SSM Parameter Store (SecureString). Reference them in task definition secrets field (NOT environment variables). ECS retrieves secrets at task startup using execution role permissions, injects them as environment variables into container. Secrets are encrypted at rest and in transit. Never hardcode secrets in Dockerfile or pass as plaintext env vars.
Q: Explain a scenario where you’d use Fargate EC2 launch type instead of Fargate.
- A: When you have: (1) steady baseline load (Reserved Instances cheaper than Fargate per-task pricing), (2) need for specific EC2 instance types (GPU, high memory), (3) tasks require privileged mode or host networking, (4) need to run your own AMI with custom configs. Trade-off: lower cost and more control, but you manage cluster capacity and OS patching.
Q: Your containerized application experiences cold start delays. How do you optimize?
- A: (1) Reduce image size (multi-stage builds, minimal base images like Alpine). (2) Optimize Dockerfile layer caching (COPY dependencies before code). (3) Pre-warm containers if predictable traffic spikes. (4) Use smaller task CPU/memory if over-provisioned (faster scheduling). (5) For EKS: Use cluster autoscaler with appropriate scaling configs. (6) Consider keeping minimum task count > 0 to avoid cold starts entirely.

Hints in Layers

Level 1: Getting Started

Start with a single-container task definition running nginx or a simple Python/Node.js app
Use Fargate to avoid EC2 cluster management
Deploy to public subnet first (simpler than private with NAT)
Use AWS Console to create your first task definition—you’ll see all the options
Put “latest” tag on your first image (iterate fast)

Level 2: Adding Realism

Move to private subnets with NAT gateway (production pattern)
Add ALB in front of your service for stable endpoint
Create second container that calls the first (understand inter-service communication)
Use ECR instead of Docker Hub (AWS-native, faster pulls)
Start using specific version tags (v1.0.0, not “latest”)

Level 3: Production Patterns

Enable service discovery (Cloud Map) for service-to-service DNS
Configure auto-scaling based on ALB request count or custom CloudWatch metrics
Set up proper health checks (readiness vs liveness)
Add Container Insights for metrics
Implement rolling deployments with deployment circuit breaker (auto-rollback on failure)

Level 4: Advanced Scenarios

Multi-container task with sidecar pattern (app + logging agent)
Task role granting only necessary permissions (least privilege)
Secrets injection from Secrets Manager (no plaintext env vars)
Blue/green deployments using CodeDeploy
For EKS: Implement Horizontal Pod Autoscaler and Cluster Autoscaler together

Level 5: Mastery

CI/CD pipeline: GitHub Actions → build image → push to ECR → update ECS service
Canary deployments (send 10% traffic to new version, monitor, then 100%)
Service mesh (App Mesh for ECS or Istio for EKS) for advanced routing/observability
Cross-region replication for DR (replicate ECR images, deploy to multiple regions)
Custom metrics from containers to CloudWatch for scaling (e.g., queue depth)

Books That Will Help

Book Title	Author(s)	Relevant Chapters	Why It Helps
AWS for Solutions Architects	Saurabh Shrivastava	Ch. 9: Containers on AWS	Best coverage of ECS architecture patterns, Fargate vs EC2 decisions
Docker Deep Dive	Nigel Poulton	Ch. 3-5, 8-9	Container fundamentals, image layers, networking modes
Kubernetes in Action	Marko Luksa	Ch. 3-7	Core K8s concepts (pods, services, deployments) for EKS path
Kubernetes Up & Running	Kelsey Hightower et al.	Ch. 5-6, 9-10	Practical K8s patterns, service discovery, load balancing
Amazon Web Services in Action	Andreas Wittig & Michael Wittig	Ch. 14: Containers	Step-by-step ECS tutorial with CloudFormation examples
The DevOps Handbook	Gene Kim et al.	Part IV: Technical Practices	CI/CD for containers, deployment strategies (blue/green, canary)
Site Reliability Engineering	Google	Ch. 7, 21	Load balancing, monitoring containerized systems at scale
Production Kubernetes	Josh Rosso et al.	Ch. 4-6, 11	Production-grade EKS: networking, security, observability
Container Security	Liz Rice	Ch. 2-4, 7	Securing container images, runtime, orchestrator (critical for prod)
Designing Data-Intensive Applications	Martin Kleppmann	Ch. 11: Stream Processing	Understanding when to use containers for stateful vs stateless workloads

Quick Reference:

New to containers? Start with “Docker Deep Dive” Ch. 3-5
Choosing ECS? Focus on “AWS for Solutions Architects” Ch. 9
Choosing EKS? Read “Kubernetes Up & Running” Ch. 5-6 first
Ready for production? “Container Security” is mandatory
Need CI/CD? “The DevOps Handbook” Part IV

Common Pitfalls & Debugging

Problem 1: “ECS tasks fail to start - stuck in PENDING or PROVISIONING state”

Why: Insufficient resources in cluster (EC2 launch type), ENI limit reached in VPC (Fargate), or invalid task definition
Fix: Check ECS events tab in AWS Console for exact error message
Quick test: aws ecs describe-tasks --cluster YOUR_CLUSTER --tasks TASK_ARN --query 'tasks[0].stoppedReason'
Common causes:
- ENI limit: Each Fargate task needs its own ENI - VPC subnet exhausted available IPs or hit ENI quota
- Resource constraints: For EC2 launch type, container instances don’t have enough CPU/memory
- Image pull failure: ECR permissions missing on execution role or image doesn’t exist
Solution: For ENI issues, use larger CIDR block for private subnets or request ENI quota increase

Problem 2: “Cannot pull image from ECR - ‘CannotPullContainerError’“

Why: ECS execution role lacks ECR permissions, or VPC doesn’t have route to ECR
Fix: Ensure execution role has AmazonECSTaskExecutionRolePolicy attached
Quick test: Manually pull image from EC2 instance in same VPC using task’s IAM role: aws ecr get-login-password | docker login --username AWS --password-stdin ECR_URI
VPC issue: If using private subnets without NAT, create VPC endpoints for: com.amazonaws.REGION.ecr.dkr, com.amazonaws.REGION.ecr.api, and com.amazonaws.REGION.s3
Cost savings: VPC endpoints eliminate NAT Gateway data transfer costs for ECR pulls ($0.045/GB vs free with endpoints)

Problem 3: “ECS service keeps restarting - tasks fail health checks repeatedly”

Why: Health check configuration mismatch, application not binding to correct port, or startup time exceeds grace period
Fix: Verify ALB target group health check settings match application endpoint
Quick test: Exec into running task and curl the health check endpoint: aws ecs execute-command --cluster X --task Y --container Z --interactive --command "/bin/sh"
Common mistakes:
- Application binds to localhost instead of 0.0.0.0 - ALB can’t reach it
- Health check path wrong (e.g., /health vs /healthz)
- Health check interval too aggressive - 5 seconds with 30-second app startup = guaranteed failure
Fix: Increase healthCheckGracePeriodSeconds in service definition to 60-120 seconds for slow-starting apps

Problem 4: “ECS task can’t connect to RDS database in same VPC”

Why: Security group misconfiguration - RDS security group doesn’t allow inbound from ECS task security group
Fix: RDS security group must have inbound rule: Type: PostgreSQL/MySQL, Source: ECS_TASK_SECURITY_GROUP_ID
Quick test: From ECS task, run nc -zv RDS_ENDPOINT 5432 to test connectivity
Network debugging: Check task’s ENI is in same VPC as RDS and route tables allow local VPC traffic
Common mistake: Using CIDR block instead of security group ID - tasks get dynamic IPs with awsvpc mode

Problem 5: “ECS service auto-scaling not working - tasks not scaling up/down”

Why: CloudWatch alarm not configured correctly, Application Auto Scaling policies not attached, or metrics not being published
Fix: Verify target tracking policy exists: aws application-autoscaling describe-scaling-policies --service-namespace ecs
Quick test: Check CloudWatch metric is being published: aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name CPUUtilization --dimensions Name=ServiceName,Value=YOUR_SERVICE
ECS-specific issue: Application Auto Scaling turns OFF scale-in during deployments and resumes after - this is expected behavior
Capacity provider issue: If using EC2 launch type with capacity providers, ensure managed scaling is enabled and target capacity is set (typically 100%)

Problem 6: “Container logs not appearing in CloudWatch Logs”

Why: Execution role lacks CloudWatch Logs permissions, log group doesn’t exist, or awslogs driver not configured in task definition
Fix: Check execution role has logs:CreateLogStream and logs:PutLogEvents permissions
Quick test: Verify log group exists: aws logs describe-log-groups --log-group-name-prefix /ecs/YOUR_TASK_FAMILY
Task definition fix: Ensure logConfiguration uses awslogs driver with correct region and log group name
Auto-create logs: Set awslogs-create-group: true in task definition to auto-create log groups

Problem 7: “ECS task ‘CannotStartContainerError’ - container keeps crashing on startup”

Why: Application error in container code, missing environment variables, incorrect entrypoint/command, or resource limits too restrictive
Fix: Check stopped task reason: aws ecs describe-tasks --tasks TASK_ARN --query 'tasks[0].containers[0].reason'
Debug locally: Run same image locally with same env vars: docker run -e VAR=value YOUR_IMAGE
Memory issue: If Essential container in task exited, increase task memory - JVM applications often need 2-4x more than you think
ECS Exec: For running tasks, use aws ecs execute-command to exec into container and debug live

Problem 8: “EKS pods pending - ‘Insufficient cpu’ or ‘Insufficient memory’“

Why: Node group doesn’t have capacity for pod resource requests, cluster autoscaler not configured, or resource requests too high
Fix: Check pending pod events: kubectl describe pod POD_NAME | grep -A 10 Events
Quick fix: Reduce pod resource requests or add more nodes to cluster
Long-term solution: Configure Cluster Autoscaler or Karpenter for automatic node provisioning based on pending pods
Cost optimization: Use spot instances in node group for non-critical workloads (70% cost savings)

Problem 9: “Fargate tasks costing more than expected - bill is huge”

Why: Over-provisioned CPU/memory, tasks running 24/7 when not needed, or data transfer charges from NAT Gateway
Fix: Analyze CloudWatch Container Insights to see actual CPU/memory usage vs allocated
Right-sizing: If consistently using <50% of allocated resources, reduce task size (Fargate charges per vCPU-hour and GB-hour)
Cost comparison: Fargate costs ~$0.04/hour for 0.25 vCPU + 0.5GB; EC2 t3.micro costs ~$0.01/hour - but you manage EC2 instances
Data transfer: Use VPC endpoints for ECR/S3/other AWS services to eliminate NAT Gateway charges ($0.045/GB)

Problem 10: “Docker image push to ECR failing - ‘denied: Your authorization token has expired’“

Why: ECR login token is valid for only 12 hours and expired
Fix: Re-authenticate: aws ecr get-login-password --region REGION | docker login --username AWS --password-stdin ECR_URI
CI/CD automation: In pipelines, always run aws ecr get-login-password before pushing images
Alternative: Use Docker credential helper to automatically refresh tokens: install amazon-ecr-credential-helper

Problem 11: “ECS service deployment stuck - new tasks start but old tasks don’t drain”

Why: ALB connection draining taking too long, application not handling SIGTERM gracefully, or deployment configuration issues
Fix: Check deployment circuit breaker events in ECS console
Quick test: View service events: aws ecs describe-services --cluster X --services Y --query 'services[0].events[:10]'
Graceful shutdown: Ensure application handles SIGTERM and closes connections within stop timeout (default 30 seconds)
Deployment config: Verify minimumHealthyPercent (e.g., 100%) and maximumPercent (e.g., 200%) allow room for new tasks

Problem 12: “EKS nodes unable to join cluster - nodes in NotReady state”

Why: IAM role for nodes missing policies, security groups blocking kubelet communication, or wrong AMI/userdata
Fix: Check node IAM role has: AmazonEKSWorkerNodePolicy, AmazonEC2ContainerRegistryReadOnly, AmazonEKS_CNI_Policy
Quick test: SSH to node and check kubelet status: systemctl status kubelet
Security groups: Cluster security group must allow inbound from node security group on ports 443 (API) and 10250 (kubelet)
Logs: Check /var/log/cloud-init-output.log on node for userdata errors

Debugging Tools & Techniques:

ECS Exec: Interactive debugging in running containers without SSH

aws ecs execute-command \
  --cluster my-cluster \
  --task TASK_ID \
  --container app \
  --interactive \
  --command "/bin/bash"

CloudWatch Container Insights: CPU, memory, network, disk metrics at task/pod level
AWS X-Ray: Distributed tracing for containerized microservices (requires X-Ray daemon sidecar)

kubectl logs/describe: Essential EKS debugging

kubectl logs POD_NAME -f --tail=50
kubectl describe pod POD_NAME
kubectl get events --sort-by='.lastTimestamp'

ECS Service Events: Check ECS console Events tab - shows deployment progress and errors in real-time
VPC Flow Logs: Diagnose network connectivity issues between containers and databases/services

Sources:

[Troubleshoot scaling issues with Amazon ECS capacity providers

AWS re:Post](https://repost.aws/knowledge-center/ecs-capacity-provider-scaling)

Troubleshooting service auto scaling in Amazon ECS
Amazon ECS troubleshooting
Amazon EKS troubleshooting

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
VPC from Scratch	Intermediate	1-2 weeks	⭐⭐⭐⭐⭐ (Foundational)	⭐⭐⭐
Serverless Pipeline	Intermediate	1-2 weeks	⭐⭐⭐⭐ (Event-driven)	⭐⭐⭐⭐
Auto-Scaling Web App	Intermediate	2-3 weeks	⭐⭐⭐⭐⭐ (Classic AWS)	⭐⭐⭐
Container Platform	Advanced	2-4 weeks	⭐⭐⭐⭐⭐ (Modern AWS)	⭐⭐⭐⭐⭐

Recommendation

Start with Project 1 (VPC from Scratch). Everything else depends on understanding networking. A misconfigured VPC will break Lambda VPC access, ECS service discovery, RDS connectivity—everything. Once you can confidently explain why your private subnet routes to a NAT gateway and your public subnet routes to an IGW, move on.

Then do Project 2 (Serverless Pipeline) to understand event-driven architecture—this is where modern AWS shines. Step Functions + Lambda is incredibly powerful once you “get it.”

Then Project 3 (Auto-Scaling Web App) for the “traditional” AWS architecture that many existing systems use. This teaches you fundamentals that apply everywhere.

Finally Project 4 (Containers) brings it all together—you need VPC knowledge, you can integrate with Lambda/Step Functions, and you’ll build on auto-scaling concepts.

Project 5: Full-Stack SaaS Platform

What you’ll build: A complete SaaS application with:

Multi-tenant architecture with isolated VPCs per environment (dev/staging/prod)
API Gateway + Lambda for REST/GraphQL endpoints
ECS Fargate running background workers
Step Functions orchestrating complex business workflows
RDS Aurora for relational data
DynamoDB for high-speed session/cache data
S3 + CloudFront for static assets and file uploads
Cognito for authentication
CI/CD with CodePipeline or GitHub Actions
Infrastructure defined entirely in Terraform
CloudWatch dashboards, X-Ray tracing, and alarms

Why this is the ultimate AWS learning project: This forces you to make real architectural decisions—when to use Lambda vs ECS, how to structure your VPCs for multi-environment deployments, how services talk to each other across boundaries. You’ll hit every AWS gotcha: Lambda cold starts affecting user experience, ECS task role permissions, RDS connection pooling, S3 CORS issues, CloudFront cache invalidation timing.

Core challenges you’ll face:

Multi-Environment Architecture: Terraform workspaces or separate state files, environment-specific configurations
API Design: REST vs GraphQL, API Gateway vs ALB, authentication flows
Data Architecture: When to use RDS vs DynamoDB, cross-service data access patterns
Async Processing: SQS queues, dead-letter queues, exactly-once processing
Security: IAM policies, secrets management (Secrets Manager/Parameter Store), encryption at rest and in transit
Observability: Distributed tracing, centralized logging, meaningful metrics
Cost Optimization: Right-sizing, Reserved Instances vs Savings Plans, S3 lifecycle policies

Key Concepts:

Well-Architected Framework: AWS Well-Architected - AWS Official
SaaS Architecture: “AWS for Solutions Architects” Ch. 10-12 - Shrivastava et al.
Infrastructure as Code at Scale: Terraform Best Practices - HashiCorp
Serverless Patterns: “Designing Data-Intensive Applications” Ch. 11-12 - Martin Kleppmann (for async patterns)

Difficulty: Advanced Time estimate: 1-2 months Prerequisites: Projects 1-4 completed

Real world outcome:

A working SaaS application you can demo to employers
User registration, login, and authenticated API calls
Background job processing visible in Step Functions console
Multi-environment deployment from a single codebase
Cost monitoring dashboard showing your AWS spend
A portfolio piece that demonstrates comprehensive AWS knowledge

Learning milestones:

First milestone: Auth working (Cognito + API Gateway) → you understand identity on AWS
Second milestone: Core API + database working → you understand data tier patterns
Third milestone: Background processing working → you understand async architecture
Fourth milestone: Multi-environment deployment → you understand infrastructure management
Final milestone: Observability + alerting working → you understand production operations

Real World Outcome

When you complete this project, you’ll have a production-ready SaaS application that demonstrates mastery of AWS:

# 1. Multi-environment infrastructure deployment
$ cd terraform
$ terraform workspace list
  default
  dev
* staging
  production

$ terraform workspace select production
Switched to workspace "production".

$ terraform apply
Plan: 47 to add, 0 to change, 0 to destroy.

Outputs:
api_gateway_url = "https://api.yourapp.com"
cloudfront_url = "https://d1234abcd.cloudfront.net"
cognito_user_pool_id = "us-east-1_AbCd1234"
rds_endpoint = "prod-db.cluster-xyz.us-east-1.rds.amazonaws.com"
alb_dns = "prod-alb-123456789.us-east-1.elb.amazonaws.com"

# 2. User registration and authentication flow
$ curl -X POST https://api.yourapp.com/auth/signup \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com", "password": "SecurePass123!"}'

{
  "message": "User created successfully",
  "userId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
  "confirmationRequired": true
}

# 3. Confirm user and get JWT token
$ curl -X POST https://api.yourapp.com/auth/confirm \
  -d '{"email": "user@example.com", "code": "123456"}'

{
  "accessToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
  "idToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expiresIn": 3600
}

# 4. Make authenticated API call
$ TOKEN="eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9..."
$ curl -H "Authorization: Bearer $TOKEN" \
  https://api.yourapp.com/v1/users/profile

{
  "userId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
  "email": "user@example.com",
  "createdAt": "2025-12-28T10:30:00Z",
  "subscription": "free"
}

# 5. Trigger background job processing
$ curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.yourapp.com/v1/reports/generate \
  -d '{"reportType": "analytics", "dateRange": "last_30_days"}'

{
  "jobId": "job-9876-5432-1234",
  "status": "QUEUED",
  "message": "Report generation started. Check status at /v1/jobs/job-9876-5432-1234"
}

# 6. Monitor Step Functions execution
$ aws stepfunctions list-executions \
    --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:ReportGenerator \
    --max-results 5 \
    --no-cli-pager

{
    "executions": [
        {
            "executionArn": "arn:aws:states:us-east-1:123456789012:execution:ReportGenerator:job-9876-5432-1234",
            "stateMachineArn": "arn:aws:states:us-east-1:123456789012:stateMachine:ReportGenerator",
            "name": "job-9876-5432-1234",
            "status": "RUNNING",
            "startDate": "2025-12-28T10:35:00.000Z"
        }
    ]
}

# 7. Check ECS Fargate background workers
$ aws ecs list-tasks --cluster production-workers --service-name report-processor
{
    "taskArns": [
        "arn:aws:ecs:us-east-1:123456789012:task/production-workers/abc123def456"
    ]
}

$ aws ecs describe-tasks --cluster production-workers --tasks abc123def456 \
    --query 'tasks[0].{Status:lastStatus,Health:healthStatus,CPU:cpu,Memory:memory}'

{
    "Status": "RUNNING",
    "Health": "HEALTHY",
    "CPU": "512",
    "Memory": "1024"
}

# 8. View CloudWatch metrics and logs
$ aws cloudwatch get-metric-statistics \
    --namespace AWS/ApiGateway \
    --metric-name Count \
    --dimensions Name=ApiName,Value=YourAppAPI \
    --start-time 2025-12-28T00:00:00Z \
    --end-time 2025-12-28T23:59:59Z \
    --period 3600 \
    --statistics Sum

# API Gateway handled 1,247 requests today

$ aws logs tail /aws/lambda/api-handler --follow
2025-12-28T10:35:12 START RequestId: abc-123-def-456
2025-12-28T10:35:12 [INFO] User a1b2c3d4 requested report generation
2025-12-28T10:35:13 [INFO] Job queued to SQS: job-9876-5432-1234
2025-12-28T10:35:14 END RequestId: abc-123-def-456 Duration: 342ms

# 9. Cost monitoring dashboard
$ aws ce get-cost-and-usage \
    --time-period Start=2025-12-01,End=2025-12-28 \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --group-by Type=SERVICE

{
    "ResultsByTime": [
        {
            "Groups": [
                {"Keys": ["Amazon RDS"], "Metrics": {"BlendedCost": {"Amount": "45.23"}}},
                {"Keys": ["AWS Lambda"], "Metrics": {"BlendedCost": {"Amount": "12.67"}}},
                {"Keys": ["Amazon S3"], "Metrics": {"BlendedCost": {"Amount": "8.34"}}},
                {"Keys": ["Amazon CloudFront"], "Metrics": {"BlendedCost": {"Amount": "5.12"}}},
                {"Keys": ["Amazon ECS"], "Metrics": {"BlendedCost": {"Amount": "22.89"}}}
            ]
        }
    ]
}
# Total monthly cost: ~$94 for a production SaaS platform

# 10. X-Ray distributed tracing
$ aws xray get-trace-summaries \
    --start-time 2025-12-28T10:00:00 \
    --end-time 2025-12-28T11:00:00 \
    --filter-expression 'service("api-handler")'

# Shows complete request path: API Gateway → Lambda → DynamoDB → Step Functions → SQS

What you’ll see in the AWS Console:

API Gateway: REST API with Cognito authorizer, request/response logs, throttling metrics
Cognito: User pool with users, MFA settings, identity providers
Lambda Functions: Multiple functions (auth, API handlers, processors), invocation metrics, error rates
Step Functions: State machine executions with visual workflow, input/output at each step
ECS Fargate: Running tasks processing background jobs, CloudWatch Container Insights showing CPU/memory
RDS Aurora: Multi-AZ PostgreSQL cluster with read replicas, Performance Insights showing query performance
DynamoDB: Tables with on-demand billing, global secondary indexes, point-in-time recovery enabled
S3 + CloudFront: Buckets for uploads/static assets, CloudFront distribution with HTTPS, cache hit ratio
CloudWatch: Custom dashboards showing API latency, error rates, Lambda cold starts, database connections
X-Ray: Service map showing all interconnections, trace timeline showing bottlenecks

The Core Question You’re Answering

“How do I architect a production-ready, multi-service AWS application that is secure, observable, cost-effective, and can scale from 10 to 10,000 users?”

This is the question AWS Solutions Architects answer daily. By building this project, you’ll make every decision they make: compute choices, data storage patterns, security boundaries, monitoring strategies.

Concepts You Must Understand First

Before starting this capstone project, ensure you’ve mastered these concepts from Projects 1-4:

1. VPC Networking (from Project 1)

Why you need separate VPCs for dev/staging/prod (or VPC peering if shared)
How to design CIDR blocks that don’t overlap between environments
Book Reference: Review Project 1 concepts before starting

2. Serverless Architecture (from Project 2)

Lambda cold starts and how they affect user-facing APIs
Step Functions orchestration patterns for complex workflows
S3 event-driven processing
Book Reference: “AWS Lambda in Action” by Danilo Poccia, Ch. 4, 8

3. Auto-Scaling and Load Balancing (from Project 3)

When to use Lambda vs ECS for different workloads
RDS connection pooling with auto-scaled services
Book Reference: “Designing Data-Intensive Applications” Ch. 1

4. Container Orchestration (from Project 4)

ECS Fargate for background workers vs Lambda for short tasks
Container security and IAM task roles
Book Reference: “AWS for Solutions Architects” Ch. 9

5. IAM Security Model

Principle of least privilege for service-to-service communication
Resource-based policies vs identity-based policies
Book Reference: “AWS Security” by Dylan Shield, Ch. 3

6. Cognito Authentication

User pools vs identity pools
JWT token validation in API Gateway
Reference: Cognito Developer Guide

Questions to Guide Your Design

Architecture Decisions:

When should you use Lambda vs ECS Fargate?
- Hint: Lambda for < 15 min, stateless, event-driven. ECS for long-running, WebSockets, background workers
How do you handle secrets across multiple services?
- Hint: Secrets Manager with automatic rotation, reference in Lambda/ECS via ARN
Should you use RDS or DynamoDB for your data tier?
- Hint: RDS for relational queries (users, orders), DynamoDB for high-throughput key-value (sessions, caching)
How do you structure Terraform for multi-environment deployments?
- Option 1: Terraform workspaces (shared code, different state)
- Option 2: Separate directories per environment (more explicit)

Security Questions:

How do you prevent API Gateway from being called without authentication?
- Hint: Cognito authorizer on API Gateway, validates JWT on every request
What IAM permissions does Lambda need to call Step Functions?
- Hint: states:StartExecution on the state machine ARN
How do you secure S3 buckets while allowing CloudFront access?
- Hint: Origin Access Identity (OAI) or Origin Access Control (OAC), bucket policy allowing only CloudFront

Cost Questions:

What’s the most expensive component in your architecture?
- Likely answer: Multi-AZ RDS Aurora (can be $100+/month), NAT Gateway ($0.045/GB), Fargate tasks running 24/7
How do you reduce Lambda costs?
- Right-size memory (Lambda Power Tuning), reduce package size, use ARM64 Graviton2 processors (20% cost savings)
Should you use on-demand or provisioned capacity for DynamoDB?
- On-demand for unpredictable traffic, provisioned for steady predictable loads (cheaper at scale)

Thinking Exercise

The “Multi-Tenant Data Isolation” Problem

Your SaaS has 100 customers. Should you:

A) Use one RDS database with a tenant_id column in every table?
B) Create a separate RDS instance per customer?
C) Create a separate database per customer on a shared RDS cluster?
D) Use DynamoDB with partition keys including tenant_id?

Trace the trade-offs: Cost, isolation, backup complexity, query performance, schema changes.

Most teams choose: C for high-value customers (compliance/isolation), A for small customers (cost-effective). This is called a “tiered” multi-tenant architecture.

The Interview Questions They’ll Ask

Basic Level:

“Explain the difference between Cognito User Pools and Identity Pools.”
- User Pools: Authentication (sign up, sign in, get JWTs). Identity Pools: Authorization (exchange tokens for AWS credentials to access S3/DynamoDB directly)
“What’s the maximum Lambda execution time and how do you handle longer tasks?”
- 15 minutes. For longer: Use ECS Fargate, or chain multiple Lambda invocations via Step Functions
“How does API Gateway integrate with Lambda?”
- Proxy integration: API Gateway passes HTTP request as event to Lambda. Lambda returns HTTP response format

Intermediate Level:

“Your API has 1000 req/sec. Should you use Lambda or ECS?”
- Lambda can handle it (burst concurrency 1000-3000), but if consistent: consider Provisioned Concurrency or ECS for cost-effectiveness
“How do you implement zero-downtime database migrations in RDS?”
- Blue/Green deployments with RDS, or use Aurora Serverless v2 with instant scaling
“Explain how CloudFront caching works with S3 origins.”
- CloudFront caches at edge locations based on TTL. S3 bucket policy allows only CloudFront OAI. Cache invalidations cost $0.005 per path
“What’s the difference between Step Functions Standard and Express workflows?”
- Standard: Long-running (up to 1 year), exactly-once execution, audit history. Express: Short-lived (<5 min), high-volume (100k/sec), at-least-once, cheaper

Advanced Level:

“Your Lambda functions in VPC have 10-second cold starts. Debug and fix.”
- VPC cold starts = ENI provisioning. Solutions: Use VPC endpoints for AWS services, increase Lambda memory, consider Hyperplane ENIs (modern approach), or move Lambda out of VPC if possible
“You have 100 concurrent Lambda functions connecting to RDS. What breaks?”
- RDS max_connections exhausted (typically 150-500). Fix: Use RDS Proxy for connection pooling/multiplexing
“Design a disaster recovery strategy for this SaaS platform (RTO < 1 hour, RPO < 15 minutes).”
- Multi-region: Aurora Global Database (replicates to secondary region), S3 cross-region replication, Route53 health checks for failover. RPO: Aurora replication lag typically <1 second. RTO: Automated failover + Terraform apply in DR region
“How do you handle API versioning (v1, v2) without breaking existing clients?”
- API Gateway stages (v1, v2) pointing to different Lambda versions, or path-based routing (/v1/, /v2/). Use Lambda aliases for gradual rollout
“Your CloudFront distribution has a 20% cache hit ratio (should be 80%+). Why?”
- Query strings/headers breaking cache keys, dynamic content not cacheable, TTL too short, incorrect Cache-Control headers from origin

Hints in Layers

Hint 1: Start with one environment (dev) in one region Don’t build all 3 environments at once. Get dev working end-to-end first.

Hint 2: Use Terraform modules for reusable components

module "vpc" {
  source = "./modules/vpc"
  environment = var.environment
  cidr_block = var.vpc_cidr
}

module "api" {
  source = "./modules/api-gateway"
  cognito_user_pool_id = module.auth.user_pool_id
}

Hint 3: Enable CORS on API Gateway for web apps If your frontend calls API from browser, CORS is required:

responses:
  default:
    headers:
      Access-Control-Allow-Origin: "'*'"
      Access-Control-Allow-Headers: "'Content-Type,Authorization'"

Hint 4: Use Lambda Powertools for structured logging

from aws_lambda_powertools import Logger, Tracer
logger = Logger()
tracer = Tracer()

@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
    logger.info("Processing request", extra={"userId": event['userId']})

Hint 5: Implement health checks for all services

Lambda: Create /health endpoint that checks database connectivity
ECS: Configure ALB health checks with reasonable intervals (30s) and unhealthy threshold (3)
RDS: Use CloudWatch Enhanced Monitoring

Books That Will Help

Book	Author(s)	Relevant Chapters	Why It Helps
AWS for Solutions Architects	Saurabh Shrivastava	Ch. 8 (Serverless), Ch. 9 (Containers), Ch. 10-12 (SaaS Architecture)	Complete AWS architecture patterns for multi-service apps
Designing Data-Intensive Applications	Martin Kleppmann	Ch. 1 (Scalability), Ch. 5-7 (Replication, Partitioning), Ch. 11 (Streams)	Foundational understanding of distributed systems, data architecture decisions
Terraform: Up & Running	Yevgeniy Brikman	Ch. 3 (State), Ch. 5 (Modules), Ch. 7 (Multi-Environment)	Infrastructure as Code best practices, Terraform workspaces vs directories
AWS Security	Dylan Shield	Ch. 3 (IAM), Ch. 5 (Data Protection), Ch. 7 (Monitoring)	Security architecture, secrets management, least privilege principles
AWS Lambda in Action	Danilo Poccia	Ch. 4 (Event Sources), Ch. 8 (Performance), Ch. 10 (Deployment)	Lambda optimization, API Gateway integration patterns
Site Reliability Engineering	Google	Ch. 4 (Service Level Objectives), Ch. 6 (Monitoring), Ch. 22 (Cascading Failures)	Production operations mindset, what to monitor and why
The DevOps Handbook	Gene Kim et al.	Part III (Feedback), Part IV (CI/CD)	Continuous delivery, deployment strategies, monitoring and observability
Production-Ready Microservices	Susan Fowler	Ch. 2-4 (Stability, Reliability, Scalability)	Production readiness checklist for each microservice

Reading Strategy:

Start with “AWS for Solutions Architects” Ch. 10-12 for SaaS architecture overview
Read “Terraform: Up & Running” Ch. 5-7 while building infrastructure
Reference “AWS Security” Ch. 3 when implementing IAM policies
Use “Designing Data-Intensive Applications” Ch. 5-7 when designing data tier
Review “SRE” Ch. 4, 6 after deployment for observability insights

Common Pitfalls & Debugging

Problem 1: “Terraform apply fails with ‘Error creating VPC: VpcLimitExceeded’“

Why: AWS default limit is 5 VPCs per region, you’ve hit the limit
Fix: Request VPC quota increase via AWS Service Quotas, or delete unused VPCs
Quick test: aws ec2 describe-vpcs --query 'length(Vpcs)' shows current VPC count
Best practice: Use VPC peering or Transit Gateway instead of creating VPC per environment if hitting limits

Problem 2: “API Gateway returns 403 Forbidden with Cognito authorizer”

Why: JWT token expired (default 1 hour), invalid token, or authorizer misconfigured
Fix: Verify token hasn’t expired, check Cognito authorizer is pointing to correct user pool
Quick test: Decode JWT at jwt.io and check exp (expiration) claim
Common mistake: Using ID token instead of access token, or vice versa - API Gateway authorizer expects ID token

Problem 3: “Lambda in VPC can’t access DynamoDB - timeout errors”

Why: No route to DynamoDB from VPC, or NAT Gateway not configured for outbound internet
Fix: Create VPC endpoint for DynamoDB: com.amazonaws.REGION.dynamodb (free, no NAT needed)
Quick test: From Lambda CloudWatch Logs, check if timeout occurs at 3 seconds (DNS) or 30 seconds (connectivity)
Cost savings: VPC endpoints eliminate NAT Gateway charges for AWS service access

Problem 4: “RDS connection pool exhausted - ‘too many connections’“

Why: Each Lambda invocation creates new connection, doesn’t reuse across invocations
Fix: Use RDS Proxy (connection pooling service) between Lambda and RDS
Alternative: Initialize database connection OUTSIDE handler function to reuse across warm starts
Math: 100 concurrent Lambdas × 5 connections each = 500 connections (exceeds most RDS instance limits)

Problem 5: “S3 CORS errors when uploading from browser - ‘No Access-Control-Allow-Origin header’“

Why: S3 bucket CORS policy not configured for your frontend domain
Fix: Add CORS configuration to S3 bucket allowing your CloudFront distribution or API Gateway domain
Quick test: Check browser console Network tab for preflight OPTIONS request response

Bucket policy:

{
  "CORSRules": [{
    "AllowedOrigins": ["https://yourapp.com"],
    "AllowedMethods": ["GET", "PUT", "POST"],
    "AllowedHeaders": ["*"],
    "MaxAgeSeconds": 3000
  }]
}

Problem 6: “CloudFront not serving updated files - still showing old version”

Why: CloudFront edge caches haven’t expired (default TTL can be 24 hours)
Fix: Create CloudFront invalidation: aws cloudfront create-invalidation --distribution-id ID --paths "/*"
Cost: $0.005 per path invalidated (first 1000 free per month)
Best practice: Use versioned file names (app.v123.js) instead of invalidations for deployments

Problem 7: “Step Functions execution fails - ‘Lambda function returned unexpected result’“

Why: Lambda return value doesn’t match expected JSON format, or Lambda threw unhandled exception
Fix: Check Lambda CloudWatch Logs for actual error, ensure Lambda returns proper JSON
Quick test: Test Lambda independently via AWS Console Test function
Step Functions debugging: View execution graph in AWS Console, check input/output of failed state

Problem 8: “Cognito sign-up fails - ‘UsernameExistsException’ but user doesn’t exist in pool”

Why: User exists in unconfirmed state (signed up but didn’t verify email/phone)
Fix: Resend confirmation code or delete unconfirmed user via Cognito Console
Prevention: Implement auto-confirmation for development environments, or send reminder emails for unconfirmed users

Problem 9: “Multi-environment deployment overwrites production accidentally”

Why: Wrong Terraform workspace selected, or statefile conflict
Fix: Always verify workspace before apply: terraform workspace show
Safety: Use Terraform backend with workspace prefix in S3 key, enable versioning on state bucket
Best practice: Require approval for production deploys, implement CI/CD gates

Problem 10: “AWS bill is $500/month - expected $50”

Why: Common culprits: NAT Gateway ($32/month + $0.045/GB), Multi-AZ RDS running 24/7, CloudWatch Logs retention
Fix: Analyze with AWS Cost Explorer, check top 5 services
Quick wins:
- Use VPC endpoints instead of NAT for AWS services
- Reduce RDS instance size or use Aurora Serverless v2
- Set CloudWatch Logs retention to 7 days (default is never expire)
- Use S3 lifecycle policies to transition old data to Glacier
Monitoring: Set up billing alarms in CloudWatch for early warning

Debugging Tools & Techniques:

AWS X-Ray Service Map: Visual graph showing all service dependencies and error rates

CloudWatch Logs Insights: Query logs across all services

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

CloudWatch Dashboards: Custom dashboard showing API latency, Lambda errors, RDS connections, Step Functions executions
Terraform Plan: Always run terraform plan before apply to see what will change
AWS CloudFormation StackSets: For true multi-account/multi-region deployments beyond this project

Key Learning Resources Summary

Resource	Best For
“AWS for Solutions Architects” - Shrivastava et al.	Comprehensive coverage of all services
AWS DevOps Zero to Hero	Free hands-on projects
HashiCorp Terraform Tutorials	Infrastructure as Code
AWS Official Tutorials	Service-specific guides
“Designing Data-Intensive Applications” - Kleppmann	Distributed systems concepts

Sources

Market Statistics & Trends (2025)

AWS Lambda Performance & Cold Starts

AWS Learning Resources & Projects

AWS Best Practices & Architecture

Summary

This learning path covers AWS cloud infrastructure mastery through 5 comprehensive hands-on projects. Here’s the complete list:

#	Project Name	Main Language	Difficulty	Time Estimate	Key Services
1	Production-Ready VPC from Scratch	HCL (Terraform)	Intermediate	1-2 weeks	VPC, Subnets, NAT Gateway, Security Groups
2	Serverless Data Pipeline	Python	Intermediate	1-2 weeks	Lambda, Step Functions, S3, CloudWatch
3	Auto-Scaling Web Application	Python/Node.js	Intermediate	2-3 weeks	EC2, ALB, RDS, Auto Scaling, S3
4	Container Platform	YAML/Docker	Advanced	2-3 weeks	ECS Fargate/EKS, ECR, Service Mesh
5	Full-Stack SaaS Platform	Python/TypeScript	Advanced	1-2 months	Multi-service architecture combining all above

Recommended Learning Paths

For AWS beginners (Cloud-native path):

Start with Project 1 (VPC) - Foundation of everything
Move to Project 2 (Serverless) - Modern AWS patterns
Try Project 3 (Auto-Scaling Web App) - Traditional architectures
Advance to Project 4 (Containers) - Industry-standard deployments
Build Project 5 (Full-Stack SaaS) - Comprehensive capstone

For infrastructure engineers (Traditional to cloud):

Project 1 (VPC) - Translate network knowledge to cloud
Project 3 (Auto-Scaling Web App) - Familiar EC2-based patterns
Project 4 (Containers) - Containerization in cloud
Project 2 (Serverless) - Event-driven paradigm shift
Project 5 (Full-Stack SaaS) - Architecting at scale

For developers (Application-focused):

Project 2 (Serverless) - Code without infrastructure
Project 1 (VPC) - Networking essentials (can’t skip)
Project 4 (Containers) - Package and deploy apps
Project 3 (Auto-Scaling Web App) - Full-stack deployment
Project 5 (Full-Stack SaaS) - Production-ready system

For solutions architects (Design-focused):

Project 1 (VPC) - Network architecture decisions
Project 3 (Auto-Scaling Web App) - Scalability patterns
Project 4 (Containers) - Orchestration trade-offs
Project 2 (Serverless) - Event-driven design
Project 5 (Full-Stack SaaS) - Well-Architected Framework application

Expected Outcomes

After completing these projects, you will:

Technical Skills

Design and deploy production VPCs with proper subnet segmentation, routing, and security layers
Build serverless event-driven pipelines using Lambda, Step Functions, and S3 with comprehensive error handling
Architect auto-scaling web applications across multiple availability zones with load balancing and database replication
Deploy containerized workloads on ECS Fargate or EKS with service discovery and monitoring
Construct full-stack SaaS platforms integrating multiple AWS services with proper security and observability

Conceptual Understanding

VPC networking fundamentals: CIDR blocks, route tables, Internet Gateways, NAT Gateways, security groups vs NACLs
IAM security model: Policy evaluation logic, least privilege, role-based access, cross-account access patterns
Serverless execution model: Cold starts, concurrent limits, event sources, state management, orchestration
Compute decision framework: When to use EC2 vs Lambda vs Fargate vs EKS - trade-offs and appropriate use cases
Data architecture patterns: RDS vs DynamoDB, read replicas, caching strategies, backup and disaster recovery
Infrastructure as Code: Terraform workflow, state management, module design, multi-environment deployments
AWS Well-Architected Framework: Operational excellence, security, reliability, performance efficiency, cost optimization

Real-World Capabilities

Troubleshoot networking issues: Understand VPC Flow Logs, trace packets, diagnose connectivity problems
Optimize AWS costs: Right-size resources, leverage spot instances, implement lifecycle policies, monitor spending
Design for failure: Multi-AZ deployments, retry logic, circuit breakers, graceful degradation
Implement comprehensive observability: CloudWatch metrics/logs/alarms, X-Ray tracing, custom dashboards
Secure cloud infrastructure: Defense in depth, encryption at rest/in transit, secrets management, audit logging
Deploy with confidence: CI/CD pipelines, blue-green deployments, canary releases, rollback strategies

Interview & Career Readiness

Answer AWS architecture questions confidently with real implementation experience
Discuss trade-offs intelligently: Cost vs performance vs reliability, managed services vs self-managed
Explain real-world failures you’ve debugged and lessons learned
Demonstrate hands-on expertise through portfolio projects that employers can review
Navigate AWS console and CLI efficiently for both development and troubleshooting
Read and write CloudFormation/Terraform to understand infrastructure definitions in job codebases

Time Investment & Commitment

Total estimated time: 3-6 months depending on:

Your current AWS/cloud experience (more if starting from zero)
How deep you go into each project (minimum viable vs production-grade)
Whether you build all 5 projects or focus on 2-3 most relevant to your goals
Time spent on debugging, research, and experimentation (the real learning!)

Realistic weekly commitment:

10 hours/week: 4-6 months to complete all 5 projects
20 hours/week: 2-3 months for comprehensive coverage
Full-time (40+ hours/week): 1-2 months intensive learning

Remember: The goal isn’t speed—it’s deep understanding. You’re not just deploying resources; you’re internalizing AWS patterns, failure modes, and architectural decision-making that takes most engineers years to develop.

What Makes This Learning Path Different

Unlike tutorials that show you how to click through the AWS console or copy-paste CloudFormation templates:

You understand the “why” - Every architectural decision is explained with real-world context
You experience failure - Projects include common pitfalls and debugging sections because breaking things teaches more than success
You build from scratch - Infrastructure as Code forces you to understand every component explicitly
You verify understanding - Interview questions, thinking exercises, and self-assessment ensure you’re not just following steps
You connect to fundamentals - Each project maps to computer science concepts (networking, distributed systems, security)
You use real tools - Terraform, AWS CLI, CloudWatch—the same stack professional engineers use daily

The AWS Shared Responsibility in Practice

By the end of these projects, you’ll viscerally understand what “shared responsibility model” means:

AWS manages: Physical infrastructure, hypervisor, availability zones, service APIs
YOU manage: Everything from the OS up—patching, security groups, IAM policies, data encryption, application code, network topology

When an EC2 instance is compromised because of a weak SSH key, that’s your responsibility. When an AWS region experiences an outage, that’s AWS’s responsibility—but your multi-AZ architecture is YOUR responsibility for surviving that outage.

Your Next Steps

Set up AWS account with MFA enabled, create IAM user (never use root), configure AWS CLI
Enable billing alerts to avoid surprise charges ($10 threshold for learning projects)
Start with Project 1 - Build your first production VPC with Terraform
Document your journey - Take notes on what you learn, capture screenshots, write blog posts
Join the community - AWS subreddit, Discord servers, local meetups to discuss challenges and solutions
Build in public - Push code to GitHub, write README files explaining your architecture decisions
Iterate and improve - Come back to early projects after finishing later ones and refactor with new knowledge

The Reality Check

AWS is vast. You won’t master all 200+ services, and you don’t need to. These 5 projects give you:

Deep understanding of 15-20 core services (the 80/20 rule applies)
Mental models that apply to the other 180+ services
Confidence to learn new AWS services as needed
Battle-tested debugging skills from building real systems

Most importantly: After these projects, you won’t be intimidated by AWS. You’ll know how to:

Read AWS documentation efficiently
Design architectures that match business requirements
Debug production issues systematically
Estimate costs and optimize spending
Make informed build-vs-buy decisions

You’ll have built 5 working systems you can demo, explain, and expand. That’s infinitely more valuable than “I completed an AWS certification course.”

Now go build something.

This learning path represents approximately 150-200 hours of hands-on building, debugging, and learning. The projects are ordered by dependency, but feel free to adjust based on your interests and career goals. Remember: breaking things is part of the process. Every error message is a teacher.