← Back to all projects

AWS DEEP DIVE LEARNING PROJECTS

Deep Understanding of AWS Through Building

Goal: Master AWS cloud infrastructure by building real systems that break, fail, and teach you why networking, security, and scalability patterns exist. You’ll go from clicking through the console to understanding why a private subnet needs a NAT Gateway, how IAM policy evaluation actually works, and when to choose Lambda over ECS over EC2.


Why AWS Mastery Requires Building

You’re tackling a massive ecosystem with 200+ services. AWS is not something you can understand by reading documentation—you need to build systems where misconfigured security groups break things, where wrong subnet routing causes timeouts, and where Lambda cold starts ruin your latency. That’s when the concepts become real.

The AWS Learning Problem

Most developers learn AWS backwards:

  1. Click through console tutorials
  2. Copy-paste CloudFormation templates
  3. Wonder why things break in production
  4. Panic when asked “why did you choose this architecture?”

The right approach:

  1. Understand the problem each service solves
  2. Build it wrong first (feel the pain)
  3. Fix it using IaC (understand every line)
  4. Break it intentionally (learn failure modes)
  5. Explain your architecture to others

The AWS Shared Responsibility Model

Before building anything, understand who is responsible for what:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        YOUR RESPONSIBILITY                                   │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Customer Data                                                         │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Platform, Applications, Identity & Access Management                  │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Operating System, Network & Firewall Configuration                   │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Client-Side Data Encryption │ Server-Side Encryption │ Network Traffic│  │
│  │  & Data Integrity Auth       │ (File System/Data)     │ Protection      │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
├─────────────────────────────────────────────────────────────────────────────┤
│                        AWS RESPONSIBILITY                                    │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  Compute │ Storage │ Database │ Networking                            │  │
│  ├───────────────────────────────────────────────────────────────────────┤  │
│  │  Hardware/AWS Global Infrastructure                                   │  │
│  │  (Regions, Availability Zones, Edge Locations)                        │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Key insight: AWS manages the physical infrastructure. YOU manage everything from the OS up. If your EC2 instance gets hacked because of a weak password, that’s on you. If an AWS data center floods, that’s on them.


Understanding AWS Architecture: The Mental Model

The Three Pillars of AWS

Every AWS architecture decision comes down to balancing three concerns:

                    ┌─────────────────┐
                    │                 │
                    │   RELIABILITY   │
                    │  (Multi-AZ,     │
                    │   Redundancy)   │
                    │                 │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
    ┌─────────────────┐           ┌─────────────────┐
    │                 │           │                 │
    │      COST       │◄─────────►│   PERFORMANCE   │
    │  (Right-sizing, │           │  (Low latency,  │
    │   Reserved,     │           │   High throughput│
    │   Spot)         │           │   Scaling)      │
    │                 │           │                 │
    └─────────────────┘           └─────────────────┘

    The AWS Well-Architected Trade-off Triangle

Every architecture decision involves trade-offs:

  • Multi-AZ RDS is more reliable but costs 2x
  • Spot instances are 70% cheaper but can be terminated anytime
  • Lambda has zero idle cost but cold starts add latency

AWS Global Infrastructure

Understanding the hierarchy is critical:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              AWS GLOBAL                                      │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  Region: us-east-1 (N. Virginia)                                        ││
│  │  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐               ││
│  │  │     AZ 1       │ │     AZ 2       │ │     AZ 3       │  ...          ││
│  │  │  us-east-1a    │ │  us-east-1b    │ │  us-east-1c    │               ││
│  │  │                │ │                │ │                │               ││
│  │  │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │               ││
│  │  │ │ Data Center│ │ │ │ Data Center│ │ │ │ Data Center│ │               ││
│  │  │ │            │ │ │ │            │ │ │ │            │ │               ││
│  │  │ │ • EC2      │ │ │ │ • EC2      │ │ │ │ • EC2      │ │               ││
│  │  │ │ • RDS      │ │ │ │ • RDS      │ │ │ │ • RDS      │ │               ││
│  │  │ │ • EBS      │ │ │ │ • EBS      │ │ │ │ • EBS      │ │               ││
│  │  │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │               ││
│  │  └────────────────┘ └────────────────┘ └────────────────┘               ││
│  │                                                                          ││
│  │  ← Low latency connections between AZs (< 2ms) →                        ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │  Region: eu-west-1 (Ireland)                                            ││
│  │  ...similar AZ structure...                                              ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  Edge Locations (CloudFront): 400+ worldwide for content delivery           │
└─────────────────────────────────────────────────────────────────────────────┘

Key insight for projects:

  • Region: Contains all your resources, data residency compliance
  • AZ (Availability Zone): Isolated data centers, failure boundary
  • Multi-AZ: Deploy across 2+ AZs for high availability (if one AZ fails, others continue)

VPC Networking: The Foundation of Everything

Every AWS resource you deploy lives in a network. Understanding VPC is non-negotiable.

What is a VPC?

A Virtual Private Cloud (VPC) is your isolated network within AWS. Think of it as your own private data center in the cloud, with complete control over:

  • IP address ranges (CIDR blocks)
  • Subnets (network segments)
  • Route tables (traffic rules)
  • Gateways (internet access)
  • Security (firewalls)

The Anatomy of a Production VPC

┌─────────────────────────────────────────────────────────────────────────────────┐
│  VPC: 10.0.0.0/16 (65,536 IP addresses)                                         │
│                                                                                  │
│  ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐│
│  │  Availability Zone A (us-east-1a)   │ │  Availability Zone B (us-east-1b)   ││
│  │                                     │ │                                     ││
│  │  ┌─────────────────────────────┐   │ │  ┌─────────────────────────────┐   ││
│  │  │ PUBLIC SUBNET: 10.0.1.0/24  │   │ │  │ PUBLIC SUBNET: 10.0.2.0/24  │   ││
│  │  │ (256 IPs)                   │   │ │  │ (256 IPs)                   │   ││
│  │  │                             │   │ │  │                             │   ││
│  │  │  ┌───────┐    ┌───────┐    │   │ │  │  ┌───────┐    ┌───────┐    │   ││
│  │  │  │Bastion│    │  NAT  │    │   │ │  │  │Bastion│    │  NAT  │    │   ││
│  │  │  │ Host  │    │Gateway│    │   │ │  │  │ (HA)  │    │Gateway│    │   ││
│  │  │  └───────┘    └───┬───┘    │   │ │  │  └───────┘    └───┬───┘    │   ││
│  │  └───────────────────┼────────┘   │ │  └───────────────────┼────────┘   ││
│  │                      │            │ │                      │            ││
│  │  ┌───────────────────┼────────┐   │ │  ┌───────────────────┼────────┐   ││
│  │  │ PRIVATE SUBNET:   │        │   │ │  │ PRIVATE SUBNET:   │        │   ││
│  │  │ 10.0.10.0/24      │        │   │ │  │ 10.0.20.0/24      │        │   ││
│  │  │ (App Tier)        ▼        │   │ │  │ (App Tier)        ▼        │   ││
│  │  │                            │   │ │  │                            │   ││
│  │  │  ┌───────┐    ┌───────┐   │   │ │  │  ┌───────┐    ┌───────┐   │   ││
│  │  │  │  EC2  │    │  EC2  │   │   │ │  │  │  EC2  │    │  EC2  │   │   ││
│  │  │  │ (App) │    │ (App) │   │   │ │  │  │ (App) │    │ (App) │   │   ││
│  │  │  └───────┘    └───────┘   │   │ │  │  └───────┘    └───────┘   │   ││
│  │  └────────────────────────────┘   │ │  └────────────────────────────┘   ││
│  │                                     │ │                                     ││
│  │  ┌────────────────────────────┐   │ │  ┌────────────────────────────┐   ││
│  │  │ DATA SUBNET: 10.0.100.0/24 │   │ │  │ DATA SUBNET: 10.0.200.0/24 │   ││
│  │  │ (Database Tier)            │   │ │  │ (Database Tier)            │   ││
│  │  │                            │   │ │  │                            │   ││
│  │  │  ┌────────────────────┐   │   │ │  │  ┌────────────────────┐   │   ││
│  │  │  │  RDS Primary       │───┼───┼─┼──│  │  RDS Standby      │   │   ││
│  │  │  │  (PostgreSQL)      │   │   │ │  │  │  (Sync Replica)    │   │   ││
│  │  │  └────────────────────┘   │   │ │  │  └────────────────────┘   │   ││
│  │  └────────────────────────────┘   │ │  └────────────────────────────┘   ││
│  └─────────────────────────────────────┘ └─────────────────────────────────────┘│
│                                                                                  │
│  ┌─────────────────┐                                                            │
│  │ Internet Gateway│ ◄──── Allows public subnets to reach internet             │
│  └────────┬────────┘                                                            │
│           │                                                                      │
│           ▼                                                                      │
│      ┌─────────┐                                                                │
│      │ Internet│                                                                │
│      └─────────┘                                                                │
└─────────────────────────────────────────────────────────────────────────────────┘

Understanding CIDR Notation

CIDR (Classless Inter-Domain Routing) defines IP address ranges:

IP Address:  10.0.0.0
CIDR:        /16

Binary breakdown:
10.0.0.0/16 means:
├── First 16 bits are FIXED (network portion)
│   10      .    0       (00001010.00000000)
│
└── Last 16 bits are FREE (host portion)
    x.x (00000000.00000000 to 11111111.11111111)

Result: 10.0.0.0 to 10.0.255.255 = 65,536 addresses

Common CIDR blocks for VPC design:
┌─────────┬─────────────────┬──────────────────────────────┐
│ CIDR    │ # of IPs        │ Use Case                     │
├─────────┼─────────────────┼──────────────────────────────┤
│ /16     │ 65,536          │ Entire VPC                   │
│ /20     │ 4,096           │ Large subnet (production)    │
│ /24     │ 256             │ Standard subnet              │
│ /28     │ 16              │ Small subnet (bastion hosts) │
└─────────┴─────────────────┴──────────────────────────────┘

⚠️  AWS reserves 5 IPs per subnet:
   .0   Network address
   .1   VPC router
   .2   DNS server
   .3   Reserved for future use
   .255 Broadcast (not supported, but reserved)

So a /24 subnet (256 IPs) actually has 251 usable IPs.

Public vs Private Subnets: The Critical Difference

PUBLIC SUBNET                          PRIVATE SUBNET
─────────────                          ──────────────
Route Table:                           Route Table:
┌────────────────────────────┐        ┌────────────────────────────┐
│ Destination   │ Target     │        │ Destination   │ Target     │
├───────────────┼────────────┤        ├───────────────┼────────────┤
│ 10.0.0.0/16   │ local      │        │ 10.0.0.0/16   │ local      │
│ 0.0.0.0/0     │ igw-xxxxx  │◄──IGW │ 0.0.0.0/0     │ nat-xxxxx  │◄──NAT
└───────────────┴────────────┘        └───────────────┴────────────┘

Key differences:
┌─────────────────┬──────────────────────┬──────────────────────┐
│                 │ Public Subnet        │ Private Subnet       │
├─────────────────┼──────────────────────┼──────────────────────┤
│ Internet access │ Via Internet Gateway │ Via NAT Gateway      │
│ Inbound traffic │ Can receive from web │ Cannot receive       │
│ Public IP       │ Can have Elastic IP  │ No public IP         │
│ Use case        │ Load balancers,      │ App servers,         │
│                 │ bastion hosts        │ databases            │
└─────────────────┴──────────────────────┴──────────────────────┘

Security Groups vs NACLs: Two Layers of Defense

┌─────────────────────────────────────────────────────────────────────────────┐
│                            NACL (Subnet Level)                               │
│                            - Stateless                                       │
│                            - Explicit allow/deny                             │
│                            - Rule numbers (processed in order)               │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         Security Group (Instance Level)                │  │
│  │                         - Stateful (return traffic auto-allowed)       │  │
│  │                         - Allow rules only (implicit deny)             │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │                           EC2 Instance                          │  │  │
│  │  │                                                                  │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Traffic flow example (HTTP request to web server):

1. Request arrives at VPC
2. NACL checks inbound rules (allow port 80?)     → Yes? Continue
3. Security Group checks inbound rules            → Yes? Continue
4. Request reaches EC2 instance
5. Response leaves EC2 instance
6. Security Group: AUTOMATICALLY allows response  → Stateful!
7. NACL checks outbound rules (allow response?)   → Must be explicit
8. NACL checks ephemeral port range               → Rule needed!

Security Group (stateful):          NACL (stateless):
Inbound: Allow TCP 80               Inbound: Allow TCP 80
Outbound: (not needed for response) Outbound: Allow TCP 1024-65535 (ephemeral)

IAM: The Security Model You Must Master

IAM (Identity and Access Management) controls WHO can do WHAT to WHICH resources.

The IAM Policy Language

Every IAM policy answers these questions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",              ◄─── Allow or Deny?
            "Action": [                     ◄─── What actions?
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [                   ◄─── Which resources?
                "arn:aws:s3:::my-bucket/*"
            ],
            "Condition": {                  ◄─── Under what conditions?
                "StringEquals": {
                    "aws:RequestedRegion": "us-east-1"
                }
            }
        }
    ]
}

IAM Policy Evaluation Logic

When AWS evaluates permissions, it follows this order:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          IAM Policy Evaluation                               │
│                                                                              │
│  1. By default, all requests are DENIED (implicit deny)                      │
│                     │                                                        │
│                     ▼                                                        │
│  2. Check all applicable policies                                            │
│     ┌──────────────┬──────────────┬──────────────┬──────────────┐           │
│     │ Identity     │ Resource     │ Permission   │ Service      │           │
│     │ Policies     │ Policies     │ Boundaries   │ Control      │           │
│     │ (IAM user/   │ (S3 bucket,  │ (IAM)        │ Policies     │           │
│     │  role)       │  SQS queue)  │              │ (Org level)  │           │
│     └──────────────┴──────────────┴──────────────┴──────────────┘           │
│                     │                                                        │
│                     ▼                                                        │
│  3. If ANY policy has explicit DENY ──────────────► DENIED (final)          │
│                     │                                                        │
│                     ▼                                                        │
│  4. If ANY policy has explicit ALLOW ─────────────► ALLOWED                 │
│                     │                                                        │
│                     ▼                                                        │
│  5. If no explicit allow found ───────────────────► DENIED (implicit)       │
│                                                                              │
│  Remember: Explicit DENY always wins!                                        │
└─────────────────────────────────────────────────────────────────────────────┘

IAM Roles vs Users vs Groups

┌─────────────────────────────────────────────────────────────────────────────┐
│  IAM USERS                                                                   │
│  - Permanent credentials (access key + secret key)                          │
│  - Used for: Human users, CLI access                                        │
│  - Best practice: Use MFA, rotate keys regularly                            │
│                                                                              │
│  ┌─────────┐                                                                │
│  │  User   │───► Has access keys, belongs to groups                        │
│  └─────────┘                                                                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  IAM GROUPS                                                                  │
│  - Collection of users                                                       │
│  - Policies attached to group apply to all members                          │
│  - Used for: Organizing users by job function (Developers, Admins)          │
│                                                                              │
│  ┌─────────┐                                                                │
│  │  Group  │───► Contains users, has policies                              │
│  │(Devs)   │     ┌──────┐ ┌──────┐ ┌──────┐                                │
│  └─────────┘     │User A│ │User B│ │User C│                                │
│                  └──────┘ └──────┘ └──────┘                                │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  IAM ROLES                                                                   │
│  - Temporary credentials (auto-rotated by AWS)                              │
│  - Can be ASSUMED by: EC2, Lambda, ECS, other AWS accounts, SAML users     │
│  - Used for: Service-to-service authentication, cross-account access        │
│                                                                              │
│  ┌─────────┐         ┌─────────────────────────────────────────┐           │
│  │  Role   │         │  Trust Policy: WHO can assume this role │           │
│  │(Lambda) │◄────────│  Permissions: WHAT they can do          │           │
│  └─────────┘         └─────────────────────────────────────────┘           │
│       │                                                                      │
│       ▼                                                                      │
│  Lambda function assumes role → Gets temp credentials → Accesses S3         │
└─────────────────────────────────────────────────────────────────────────────┘

Best Practice Hierarchy:
   ┌──────────────────────────────────────────────────────────────┐
   │  PREFER ROLES OVER USERS                                     │
   │                                                               │
   │  ✓ Roles: Temporary creds, auto-rotated, no keys to leak    │
   │  ✗ Users: Permanent creds, must rotate manually, can leak    │
   │                                                               │
   │  EC2 instances? Use Instance Profile (role wrapper)          │
   │  Lambda? Use Execution Role                                   │
   │  ECS tasks? Use Task Role                                     │
   │  Cross-account? Use AssumeRole                                │
   └──────────────────────────────────────────────────────────────┘

Serverless Architecture: Lambda, Step Functions & Event-Driven Design

The Lambda Execution Model

Understanding Lambda’s lifecycle is critical for performance:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        LAMBDA EXECUTION LIFECYCLE                            │
│                                                                              │
│  COLD START (first invocation or after idle)                                │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  1. Download code       2. Start runtime    3. Initialize handler      │ │
│  │     from S3                (Python, Node)      (your imports)          │ │
│  │     ~100ms                 ~200ms              ~500-3000ms             │ │
│  │                                                                         │ │
│  │  TOTAL COLD START: 800ms - 5s depending on package size               │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                          │                                                   │
│                          ▼                                                   │
│  WARM START (execution environment reused)                                  │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │  Environment already running, just execute handler                      │ │
│  │  TOTAL WARM START: <100ms                                              │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                          │                                                   │
│                          ▼                                                   │
│  EXECUTION CONTEXT REUSE                                                    │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                                                                         │ │
│  │  # This code runs ONCE (cold start only)                               │ │
│  │  import boto3                                                           │ │
│  │  s3_client = boto3.client('s3')  # Reused across invocations!         │ │
│  │                                                                         │ │
│  │  # This code runs EVERY invocation                                     │ │
│  │  def handler(event, context):                                           │ │
│  │      s3_client.get_object(...)  # Uses pre-initialized client          │ │
│  │                                                                         │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  LAMBDA LIMITS:                                                              │
│  ┌──────────────────────────┬───────────────────────────────────────────┐  │
│  │ Max execution time       │ 15 minutes                                │  │
│  │ Max memory               │ 10 GB                                     │  │
│  │ Max /tmp storage         │ 10 GB (ephemeral, NOT persistent)         │  │
│  │ Max deployment package   │ 50 MB (zipped), 250 MB (unzipped)         │  │
│  │ Max concurrent executions│ 1000 (soft limit, can increase)           │  │
│  └──────────────────────────┴───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Step Functions: Orchestrating Serverless Workflows

┌─────────────────────────────────────────────────────────────────────────────┐
│                     STEP FUNCTIONS STATE MACHINE                             │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                           StartState                                 │   │
│  │                               │                                      │   │
│  │                               ▼                                      │   │
│  │                      ┌───────────────┐                              │   │
│  │                      │  ValidateInput │ (Task: Lambda)              │   │
│  │                      └───────┬───────┘                              │   │
│  │                              │                                       │   │
│  │                              ▼                                       │   │
│  │                      ┌───────────────┐                              │   │
│  │                      │    Choice     │ (Decision point)             │   │
│  │                      └───────┬───────┘                              │   │
│  │                     ┌────────┴────────┐                             │   │
│  │                     ▼                 ▼                              │   │
│  │             ┌─────────────┐   ┌─────────────┐                       │   │
│  │             │  Valid: Yes │   │  Valid: No  │                       │   │
│  │             └──────┬──────┘   └──────┬──────┘                       │   │
│  │                    │                 │                               │   │
│  │                    ▼                 ▼                               │   │
│  │           ┌───────────────┐  ┌───────────────┐                      │   │
│  │           │  ProcessData  │  │  SendError    │                      │   │
│  │           │   (Task)      │  │  Notification │                      │   │
│  │           └───────┬───────┘  └───────┬───────┘                      │   │
│  │                   │                  │                               │   │
│  │                   ▼                  ▼                               │   │
│  │           ┌───────────────┐         End                             │   │
│  │           │   Parallel    │ (Run multiple branches)                 │   │
│  │           │   ┌───┬───┐   │                                         │   │
│  │           │   │ A │ B │   │                                         │   │
│  │           │   └───┴───┘   │                                         │   │
│  │           └───────┬───────┘                                         │   │
│  │                   │                                                  │   │
│  │                   ▼                                                  │   │
│  │           ┌───────────────┐                                         │   │
│  │           │  SaveResults  │                                         │   │
│  │           │   (Task)      │                                         │   │
│  │           └───────┬───────┘                                         │   │
│  │                   │                                                  │   │
│  │                   ▼                                                  │   │
│  │                  End                                                 │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  STATE TYPES:                                                                │
│  ┌──────────┬────────────────────────────────────────────────────────────┐ │
│  │ Task     │ Execute Lambda, ECS task, API call, etc.                   │ │
│  │ Choice   │ If/else branching based on input                           │ │
│  │ Parallel │ Execute multiple branches simultaneously                    │ │
│  │ Map      │ Iterate over array, execute state for each item            │ │
│  │ Wait     │ Delay for specified time                                   │ │
│  │ Pass     │ Pass input to output (useful for transformations)          │ │
│  │ Succeed  │ Terminal state indicating success                          │ │
│  │ Fail     │ Terminal state indicating failure                          │ │
│  └──────────┴────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

Compute Options: When to Use What

┌─────────────────────────────────────────────────────────────────────────────┐
│                     AWS COMPUTE DECISION TREE                                │
│                                                                              │
│  What are you running?                                                       │
│        │                                                                     │
│        ├──► Event-driven, short tasks (<15 min)?                            │
│        │         │                                                           │
│        │         └──► Lambda (serverless, pay per invocation)               │
│        │                                                                     │
│        ├──► Containerized application?                                       │
│        │         │                                                           │
│        │         ├──► Need Kubernetes? ─────► EKS                           │
│        │         │                                                           │
│        │         └──► AWS-native? ──────────► ECS                           │
│        │                   │                                                 │
│        │                   ├──► Don't want to manage servers? ──► Fargate   │
│        │                   └──► Need GPU/custom instances? ─────► EC2       │
│        │                                                                     │
│        └──► Traditional application, full OS control?                        │
│                  │                                                           │
│                  └──► EC2                                                    │
│                         │                                                    │
│                         ├──► Predictable workload? ──► Reserved Instances   │
│                         ├──► Flexible timing? ───────► Spot Instances       │
│                         └──► Unknown/variable? ──────► On-Demand            │
│                                                                              │
│  COST COMPARISON (approximate, varies by region):                           │
│  ┌──────────────────┬─────────────────────┬────────────────────────────┐   │
│  │ Service          │ Pricing Model       │ Best For                   │   │
│  ├──────────────────┼─────────────────────┼────────────────────────────┤   │
│  │ Lambda           │ $0.20/1M requests   │ Infrequent, event-driven   │   │
│  │                  │ + compute time      │ tasks                      │   │
│  │ Fargate          │ ~$0.04/vCPU-hour    │ Containers without EC2    │   │
│  │ EC2 On-Demand    │ ~$0.10/hour (t3.md) │ Unknown workloads          │   │
│  │ EC2 Reserved     │ ~40% cheaper        │ Predictable, 1-3 year      │   │
│  │ EC2 Spot         │ ~70% cheaper        │ Flexible, interruptible    │   │
│  └──────────────────┴─────────────────────┴────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Core Concept Analysis

AWS services break down into these fundamental building blocks:

Domain Core Concepts to Internalize
Networking (VPC) CIDR blocks, public/private subnets, route tables, NAT gateways, Internet gateways, security groups vs NACLs, VPC peering, Transit Gateway
Compute (EC2) Instance types, AMIs, user data, auto-scaling groups, launch templates, Elastic IPs, placement groups
Containers (ECS/EKS) Task definitions, services, clusters, Fargate vs EC2 launch types, service discovery, Kubernetes control plane
Serverless (Lambda) Event sources, execution context, cold starts, layers, concurrency, IAM execution roles
Orchestration (Step Functions) State machines, ASL (Amazon States Language), error handling, retries, parallel/map states
Storage (S3) Buckets, objects, policies, versioning, lifecycle rules, storage classes, presigned URLs
Infrastructure as Code CloudFormation, Terraform, resource dependencies, state management, drift detection

Concept Summary Table

Before diving into projects, internalize these AWS concept clusters. Each project will force you to understand multiple concepts simultaneously:

Concept Cluster What You Need to Internalize
VPC & Networking CIDR blocks and subnet math (how /16, /24 work), public vs private subnets (routing differences), route tables (0.0.0.0/0 means “default route”), Internet Gateway (public subnet requirement), NAT Gateway (private subnet internet access), security groups (stateful, instance-level), NACLs (stateless, subnet-level), VPC peering (connecting VPCs), Transit Gateway (hub-and-spoke networking)
Compute (EC2) Instance types (compute vs memory vs storage optimized), AMIs (golden images), user data (bootstrap scripts on launch), instance profiles (IAM roles for EC2), auto-scaling groups (elasticity), launch templates (instance configuration), Elastic IPs (static public IPs), placement groups (low latency)
Serverless (Lambda) Execution model (stateless, ephemeral), cold starts (initialization penalty), warm starts (reusing execution context), event sources (what triggers Lambda), layers (shared code/dependencies), concurrency (parallel executions), execution roles (IAM permissions), timeout limits (15 min max), memory allocation (128MB-10GB), /tmp storage (512MB ephemeral)
Containers (ECS/EKS) Task definitions (container configuration), services (long-running tasks), clusters (logical grouping), Fargate vs EC2 launch types (serverless vs managed instances), awsvpc networking mode (each task gets ENI), service discovery (DNS-based), ALB target groups (container load balancing), EKS control plane (managed Kubernetes), ECR (container registry)
Orchestration (Step Functions) State machines (workflow as code), ASL (Amazon States Language - JSON), states (Task, Choice, Parallel, Map, Wait, Pass, Fail, Succeed), error handling (Retry, Catch), parallel execution, map state (dynamic parallelism), Express vs Standard workflows
Storage (S3) Buckets (global namespace), objects (key-value), S3 API operations (PUT, GET, DELETE, LIST), versioning (immutable history), lifecycle rules (transition/expiration), storage classes (Standard, IA, Glacier), presigned URLs (temporary access), CORS (cross-origin access), bucket policies vs IAM policies, S3 Select (query in place)
Security (IAM) Policies (JSON documents), principals (who), actions (what), resources (where), conditions (when), identity-based policies (attached to users/roles), resource-based policies (attached to resources like S3 buckets), roles (temporary credentials), instance profiles (EC2 role wrapper), service-linked roles, policy evaluation logic (explicit deny wins)
Infrastructure as Code Terraform state (source of truth), state locking (prevent concurrent modifications), CloudFormation stacks (grouped resources), drift detection (config vs reality), resource dependencies (implicit vs explicit), modules/nested stacks (reusability), workspaces/stack sets (multi-environment), import (existing resources into IaC)
Observability CloudWatch Logs (log aggregation), log groups and streams, CloudWatch Metrics (time-series data), custom metrics, CloudWatch Alarms (threshold-based alerts), CloudWatch Dashboards (visualization), X-Ray tracing (distributed request tracking), segments and subsegments, service maps, VPC Flow Logs (network traffic), CloudTrail (API audit logs)

Why this matters: AWS services don’t exist in isolation. When you build a VPC, you’re also configuring security groups, IAM roles, and CloudWatch logs. When you deploy Lambda, you need to understand IAM execution roles, VPC networking (if accessing RDS), and CloudWatch for debugging. These concepts interconnect constantly.


Deep Dive Reading by Concept

Map your project work to specific reading. Don’t read these books cover-to-cover upfront—use them as references when you hit specific challenges in your projects.

VPC & Networking Fundamentals

Topic Book/Resource Chapter/Section Why Read This
CIDR and IP Addressing “The Linux Programming Interface” - Michael Kerrisk Ch. 59 (Sockets: Internet Domains) Understand IP addresses, subnetting, CIDR notation at the networking fundamentals level
VPC Architecture Patterns “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 3: Networking on AWS Best comprehensive coverage of VPC design, multi-AZ patterns, and network segmentation
Security Groups vs NACLs “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 4: Security on AWS Stateful vs stateless firewall rules, when to use each
NAT Gateway Design AWS Architecture Blog: VPC Design Evolution Full Article Real-world patterns for scaling NAT, costs, and HA
Routing Deep Dive AWS VPC User Guide Route Tables Section Official documentation on route table priority, longest prefix matching

Compute (EC2) Deep Dive

Topic Book/Resource Chapter/Section Why Read This
EC2 Instance Types “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 6: Compute Services When to choose compute-optimized vs memory-optimized vs storage-optimized
Auto Scaling Architecture “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 6: Compute Services Scaling policies, health checks, target tracking vs step scaling
AMI Best Practices AWS EC2 User Guide AMIs Section Golden images, versioning, cross-region copying
Instance Metadata Service AWS EC2 User Guide Instance Metadata Section How user data works, IMDSv2, security implications

Serverless (Lambda & Step Functions)

Topic Book/Resource Chapter/Section Why Read This
Lambda Execution Model “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 8: Serverless Architecture Cold starts, execution context reuse, concurrency models
Event-Driven Architecture “Designing Data-Intensive Applications” - Martin Kleppmann Ch. 11: Stream Processing Foundational understanding of event-driven systems, exactly-once processing
Step Functions State Machines AWS Step Functions Developer Guide ASL Specification Learn the state machine language, error handling patterns
Lambda Best Practices AWS Lambda Developer Guide Best Practices Section Performance optimization, error handling, testing strategies

Containers (ECS/EKS)

Topic Book/Resource Chapter/Section Why Read This
ECS Architecture “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 9: Container Services Task definitions, Fargate vs EC2 launch type, service discovery
Kubernetes Fundamentals Amazon EKS Best Practices Guide Full Guide Networking, security, observability for EKS
Container Networking “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 9: Container Services awsvpc mode, VPC integration, load balancer integration
Service Mesh Concepts AWS App Mesh Documentation What is App Mesh Service-to-service communication, retries, circuit breakers

Storage (S3) Deep Dive

Topic Book/Resource Chapter/Section Why Read This
S3 Data Model “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 5: Storage Services Object storage concepts, consistency model, versioning
S3 Performance S3 Performance Guidelines Full Document Request rate optimization, multipart upload, transfer acceleration
S3 Security “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 5: Storage Services Bucket policies, ACLs, presigned URLs, encryption options
Lifecycle Management S3 Lifecycle Configuration Lifecycle Section Storage class transitions, expiration policies, cost optimization

Security & IAM

Topic Book/Resource Chapter/Section Why Read This
IAM Policy Fundamentals “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 4: Security on AWS Policy structure, evaluation logic, least privilege
IAM Roles Deep Dive AWS IAM User Guide Roles Section Trust policies, assume role, instance profiles, cross-account access
Security Best Practices AWS Security Best Practices Full Document MFA, key rotation, policy conditions, service control policies
Secrets Management “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 4: Security on AWS Secrets Manager vs Parameter Store, rotation strategies

Infrastructure as Code

Topic Book/Resource Chapter/Section Why Read This
Terraform Fundamentals HashiCorp Terraform Tutorials Get Started on AWS State management, resource dependencies, modules
CloudFormation Deep Dive “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 13: Infrastructure as Code Stack operations, drift detection, change sets
Terraform State Management Terraform State Documentation State Section Remote state, locking, workspaces, state migration
IaC Best Practices Terraform Best Practices Full Guide Module structure, naming conventions, security scanning

Observability & Monitoring

Topic Book/Resource Chapter/Section Why Read This
CloudWatch Fundamentals “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 12: Monitoring and Logging Logs, metrics, alarms, dashboards
Distributed Tracing “Designing Data-Intensive Applications” - Martin Kleppmann Ch. 1: Reliable, Scalable, Maintainable Understanding observability in distributed systems
X-Ray Deep Dive AWS X-Ray Developer Guide Full Guide Trace segments, service maps, sampling rules
Log Analysis Patterns CloudWatch Logs Insights Tutorial Insights Query Syntax Query language, aggregation, performance debugging

Multi-Service Architecture

Topic Book/Resource Chapter/Section Why Read This
Well-Architected Framework AWS Well-Architected Framework All 6 Pillars Operational Excellence, Security, Reliability, Performance, Cost, Sustainability
Distributed Systems Patterns “Designing Data-Intensive Applications” - Martin Kleppmann Ch. 8-12 Replication, partitioning, consistency, consensus, batch vs stream processing
SaaS Architecture “AWS for Solutions Architects” - Saurabh Shrivastava Ch. 10-11: Advanced Architectures Multi-tenancy, isolation models, data partitioning strategies
Event-Driven Patterns AWS Serverless Patterns Collection Browse Patterns Real-world architectures, EventBridge patterns, async workflows

How to use this table: When you hit a specific challenge in a project (e.g., “My Lambda can’t access RDS in my VPC”), consult the relevant sections rather than reading books sequentially. Learn just-in-time based on what you’re building.


Project 1: Production-Ready VPC from Scratch (with Terraform)

  • File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
  • Programming Language: HCL (Terraform)
  • Coolness Level: Level 1: Pure Corporate Snoozefest
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Cloud Networking / Infrastructure as Code
  • Software or Tool: AWS / Terraform
  • Main Book: “AWS for Solutions Architects” by Saurabh Shrivastava

What you’ll build: A multi-AZ VPC with public/private subnets, NAT gateways, bastion host, and a deployed web application—all defined in Terraform that you can destroy and recreate at will.

Why it teaches AWS Networking: You cannot understand VPCs by clicking through the console. You need to see what happens when a private subnet has no route to a NAT gateway, when a security group blocks outbound traffic, or when your CIDR blocks overlap with a peered VPC. Building with IaC forces you to explicitly declare every component and understand their relationships.

Core challenges you’ll face:

  • CIDR Planning (maps to IP addressing): Designing non-overlapping IP ranges that allow future growth and peering
  • Routing Logic (maps to network architecture): Understanding why private subnets route to NAT vs public subnets route to IGW
  • Security Layers (maps to defense in depth): Configuring security groups (stateful) vs NACLs (stateless) and knowing when to use each
  • High Availability (maps to AZ architecture): Deploying across multiple AZs with redundant NAT gateways
  • Bastion Access (maps to secure access patterns): SSH tunneling through a jump host to reach private instances

Resources for key challenges:

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic AWS console navigation, understanding of IP addresses, familiarity with any IaC tool

Real world outcome:

  • A fully functional VPC you can SSH into via bastion host
  • A simple web server accessible via public IP
  • Terraform code you can terraform destroy and terraform apply to recreate everything
  • VPC Flow Logs showing actual traffic patterns in CloudWatch

Learning milestones:

  1. First milestone: Deploy VPC with public subnet only, web server reachable from internet → you understand IGW + route tables
  2. Second milestone: Add private subnet with NAT, move app server there, bastion-only access → you understand public vs private architecture
  3. Third milestone: Multi-AZ deployment with ALB → you understand high availability patterns
  4. Final milestone: Add VPC Flow Logs and analyze traffic → you understand network observability

Real World Outcome

This is what your working VPC infrastructure looks like when you complete this project:

# 1. Deploy the infrastructure with Terraform
$ cd terraform/vpc-project
$ terraform init
Initializing provider plugins...
- Downloading hashicorp/aws v5.31.0...
Terraform has been successfully initialized!

$ terraform apply
Plan: 23 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Enter a value: yes

aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.

Outputs:
vpc_id = "vpc-0abc123def456789"
public_subnet_ids = ["subnet-0aaa111", "subnet-0bbb222"]
private_subnet_ids = ["subnet-0ccc333", "subnet-0ddd444"]
bastion_public_ip = "54.123.45.67"
web_server_private_ip = "10.0.10.50"
nat_gateway_ip = "52.87.123.45"

# 2. SSH into the bastion host
$ ssh -i ~/.ssh/aws-bastion.pem ec2-user@54.123.45.67
The authenticity of host '54.123.45.67' can't be established.
Are you sure you want to continue connecting (yes/no)? yes

       __|  __|_  )
       _|  (     /   Amazon Linux 2
      ___|\___|___|

[ec2-user@bastion ~]$ hostname
ip-10-0-1-25.ec2.internal

# 3. From bastion, SSH into private web server (SSH agent forwarding)
[ec2-user@bastion ~]$ ssh 10.0.10.50
[ec2-user@webserver ~]$ hostname
ip-10-0-10-50.ec2.internal

# 4. Verify the web server can reach internet via NAT
[ec2-user@webserver ~]$ curl -s https://ifconfig.me
52.87.123.45   # This is the NAT Gateway's IP, NOT the web server's private IP!

# 5. Verify the web server cannot be reached directly from internet
$ curl http://10.0.10.50  # From your local machine
curl: (7) Failed to connect to 10.0.10.50 port 80: No route to host
# CORRECT! Private subnet is not reachable from internet

# 6. Check VPC Flow Logs in CloudWatch
$ aws logs filter-log-events \
    --log-group-name /aws/vpc/flowlogs \
    --filter-pattern "[version, account, eni, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log_status]" \
    --limit 5 \
    --profile douglascorrea_io --no-cli-pager

{
    "events": [
        {
            "message": "2 123456789012 eni-0abc123 10.0.1.25 54.123.45.67 22 52341 6 25 4500 1703264400 1703264460 ACCEPT OK",
            "ingestionTime": 1703264465000
        },
        {
            "message": "2 123456789012 eni-0def456 10.0.10.50 52.87.123.45 443 32145 6 10 1500 1703264400 1703264460 ACCEPT OK",
            "ingestionTime": 1703264465000
        }
    ]
}
# Flow logs show: SSH to bastion (ACCEPT), HTTPS from private instance via NAT (ACCEPT)

# 7. Describe your VPC and subnets
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 \
    --query 'Vpcs[0].{VpcId:VpcId,CIDR:CidrBlock,State:State}' \
    --output table --no-cli-pager

-----------------------------------------
|            DescribeVpcs               |
+-------+------------------+------------+
| CIDR  |     State        |   VpcId    |
+-------+------------------+------------+
|10.0.0.0/16| available   |vpc-0abc123 |
+-------+------------------+------------+

$ aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
    --query 'Subnets[*].{SubnetId:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock,Public:MapPublicIpOnLaunch}' \
    --output table --no-cli-pager

------------------------------------------------------------
|                    DescribeSubnets                       |
+--------------+---------------+------------+--------------+
|     AZ       |     CIDR      |   Public   |   SubnetId   |
+--------------+---------------+------------+--------------+
|us-east-1a    |10.0.1.0/24    |   True     |subnet-0aaa111|
|us-east-1b    |10.0.2.0/24    |   True     |subnet-0bbb222|
|us-east-1a    |10.0.10.0/24   |   False    |subnet-0ccc333|
|us-east-1b    |10.0.20.0/24   |   False    |subnet-0ddd444|
+--------------+---------------+------------+--------------+

# 8. View route tables to understand routing
$ aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
    --query 'RouteTables[*].{RouteTableId:RouteTableId,Routes:Routes[*].{Destination:DestinationCidrBlock,Target:GatewayId||NatGatewayId}}' \
    --no-cli-pager

# Public route table: 0.0.0.0/0 → igw-xxxxx (Internet Gateway)
# Private route table: 0.0.0.0/0 → nat-xxxxx (NAT Gateway)

# 9. Clean up when done (saves money!)
$ terraform destroy
Plan: 0 to add, 0 to change, 23 to destroy.

Do you want to perform these actions?
  Enter a value: yes

aws_instance.web_server: Destroying...
aws_nat_gateway.main: Destroying...
...
Destroy complete! Resources: 23 destroyed.

AWS Console View:

  • VPC Dashboard showing your VPC with proper CIDR block
  • Subnet map showing public subnets with Internet Gateway icon, private subnets with NAT icon
  • Route tables tab showing explicit routes for each subnet type
  • Security Groups showing inbound/outbound rules
  • VPC Flow Logs showing real network traffic in CloudWatch

The Core Question You’re Answering

“What IS a VPC, and how does network traffic actually flow between the internet, public subnets, private subnets, and databases?”

Before you write any Terraform code, sit with this question. Most developers have a vague sense that “VPCs are networks” but can’t explain:

  • Why a public subnet can receive traffic from the internet but a private subnet cannot
  • What the difference between a route table entry and a security group rule is
  • Why you need a NAT Gateway (and why it costs money) for private instances to download packages
  • How data flows when a user hits your website: ALB → EC2 → RDS and back

This project forces you to confront these questions because misconfiguration means your infrastructure doesn’t work.

Concepts You Must Understand First

Stop and research these before coding:

  1. CIDR Blocks and Subnetting
    • What does 10.0.0.0/16 actually mean in binary?
    • How many IP addresses are in a /24 subnet?
    • Why does AWS reserve 5 IPs per subnet?
    • What happens if two VPCs have overlapping CIDR blocks and you try to peer them?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 11 (Network Programming) - Bryant & O’Hallaron
  2. Routing and the Default Gateway (0.0.0.0/0)
    • What does 0.0.0.0/0 mean in a route table?
    • When a packet leaves an EC2 instance, how does the VPC know where to send it?
    • What’s the difference between local route and igw-xxxxx route?
    • Why does longest prefix match matter?
    • Book Reference: “TCP/IP Illustrated, Volume 1” Ch. 9 (IP Routing) - W. Richard Stevens
  3. Internet Gateway vs NAT Gateway
    • What does “stateful NAT” mean?
    • Why can’t you just put an Internet Gateway on a private subnet?
    • How does a NAT Gateway translate private IPs to public IPs?
    • Why does the NAT Gateway need to be in a public subnet?
    • Book Reference: “AWS for Solutions Architects” Ch. 3 - Saurabh Shrivastava
  4. Security Groups (Stateful) vs NACLs (Stateless)
    • What does “stateful” mean in terms of network connections?
    • If you allow inbound port 80, do you need an outbound rule for the response?
    • Why do NACLs need explicit rules for ephemeral ports?
    • When would you use a NACL instead of just security groups?
    • Book Reference: “AWS for Solutions Architects” Ch. 4 (Security on AWS) - Saurabh Shrivastava
  5. High Availability and Availability Zones
    • What is an Availability Zone physically?
    • Why do you need subnets in multiple AZs?
    • What happens if us-east-1a fails but you only deployed to that AZ?
    • How does an ALB distribute traffic across AZs?
    • Book Reference: “AWS for Solutions Architects” Ch. 2 (AWS Global Infrastructure)

Questions to Guide Your Design

Before implementing, think through these:

  1. CIDR Planning
    • How many IP addresses will you need now? In 5 years?
    • What if you need to add a third AZ later?
    • What if you need to peer with another VPC that uses 10.0.0.0/16?
    • How will you segment: web tier, app tier, database tier?
  2. Public vs Private Decisions
    • What resources MUST be in public subnets? (Hint: very few)
    • What resources should NEVER be in public subnets? (Hint: databases!)
    • How will private resources get software updates from the internet?
  3. Bastion Host Design
    • Why use a bastion instead of putting EC2 in public subnet with SSH?
    • How do you secure the bastion itself?
    • Should you use Session Manager instead of SSH? (Yes, probably)
  4. Security Group Strategy
    • Can you reference one security group from another?
    • How do you allow app servers to talk to databases without opening the database to everything?
    • What’s the principle of least privilege in security group terms?
  5. Cost Considerations
    • How much does a NAT Gateway cost per hour? Per GB transferred?
    • Do you need one NAT Gateway per AZ or can you share?
    • What’s the trade-off between cost and availability?

Thinking Exercise

Before coding, draw this diagram on paper:

Your Task: Fill in the missing pieces

Internet
    │
    ▼
┌───────────────────────────────────────────────────────────────┐
│  VPC: 10.0.0.0/16                                              │
│                                                                │
│  ┌─────────────────────────┐  ┌─────────────────────────────┐ │
│  │ Public Subnet A         │  │ Public Subnet B              │ │
│  │ CIDR: ____________      │  │ CIDR: ____________           │ │
│  │                         │  │                              │ │
│  │ What goes here?         │  │ What goes here?              │ │
│  │ ___________________     │  │ ___________________          │ │
│  └─────────────────────────┘  └─────────────────────────────┘ │
│              │                            │                    │
│              ▼                            ▼                    │
│  ┌─────────────────────────┐  ┌─────────────────────────────┐ │
│  │ Private Subnet A        │  │ Private Subnet B             │ │
│  │ CIDR: ____________      │  │ CIDR: ____________           │ │
│  │                         │  │                              │ │
│  │ What goes here?         │  │ What goes here?              │ │
│  │ ___________________     │  │ ___________________          │ │
│  └─────────────────────────┘  └─────────────────────────────┘ │
│                                                                │
│  Route Table (Public):                                         │
│  ___________________ → ___________________                     │
│  ___________________ → ___________________                     │
│                                                                │
│  Route Table (Private):                                        │
│  ___________________ → ___________________                     │
│  ___________________ → ___________________                     │
└────────────────────────────────────────────────────────────────┘

Questions while drawing:

  • Why does each public subnet need a different CIDR?
  • What AWS resource creates the connection to the internet?
  • How does traffic from the private subnet reach the internet?
  • What happens to the response traffic?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Walk me through how a request from a user’s browser reaches your web server in a private subnet.”
    • Expected: User → Internet → ALB (public subnet) → EC2 (private subnet) → Response reverses
  2. “Why is your database in a private subnet? How does it get software updates?”
    • Expected: Security - no direct internet access. Updates via NAT Gateway or VPC endpoints.
  3. “Your app in us-east-1a can’t reach the database in us-east-1b. What do you check?”
    • Expected: Security groups (allow from app SG?), NACLs (if modified), Route tables, VPC peering (if different VPCs)
  4. “What’s the difference between an Internet Gateway and a NAT Gateway?”
    • Expected: IGW provides two-way communication (public IPs). NAT provides outbound-only for private resources.
  5. “Your NAT Gateway bill is $500/month. How do you reduce it?”
    • Expected: Check if you have one per AZ (maybe share), use VPC endpoints for AWS services (S3, DynamoDB), check for excessive data transfer
  6. “Explain security groups vs NACLs. When would you use each?”
    • Expected: SG = stateful, instance-level, allow rules only. NACL = stateless, subnet-level, allow/deny rules. Use SG for app logic, NACL for subnet-wide rules or explicit denies.
  7. “What CIDR block would you use for a VPC? Why?”
    • Expected: RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16). Consider future growth and peering requirements.

Hints in Layers

Hint 1: Start with the VPC resource

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
  }
}

Hint 2: Create subnets with data source for AZs

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "public" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  map_public_ip_on_launch = true  # This makes it "public"

  tags = {
    Name = "public-${count.index + 1}"
  }
}

Hint 3: Internet Gateway requires explicit route

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table_association" "public" {
  count          = length(aws_subnet.public)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

Hint 4: NAT Gateway needs Elastic IP first

resource "aws_eip" "nat" {
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id  # NAT must be in PUBLIC subnet

  depends_on = [aws_internet_gateway.main]
}

Hint 5: Security Group allowing SSH from bastion only

resource "aws_security_group" "web" {
  name   = "web-server-sg"
  vpc_id = aws_vpc.main.id

  ingress {
    from_port       = 22
    to_port         = 22
    protocol        = "tcp"
    security_groups = [aws_security_group.bastion.id]  # Only from bastion!
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Books That Will Help

Topic Book Specific Chapters Why It Helps
VPC Fundamentals “AWS for Solutions Architects” by Saurabh Shrivastava Ch. 3: Networking on AWS Best comprehensive coverage of VPC architecture patterns and design decisions
Security Groups & IAM “AWS for Solutions Architects” by Saurabh Shrivastava Ch. 4: Security on AWS Security group vs NACL differences, IAM roles for EC2
TCP/IP Fundamentals “TCP/IP Illustrated, Volume 1” by W. Richard Stevens Ch. 1-3, 9 (IP Routing) Deep understanding of how packets flow and routing works
CIDR & IP Addressing “Computer Networks” by Tanenbaum & Wetherall Ch. 5: Network Layer Mathematical foundation of IP addressing and subnetting
Terraform Basics “Terraform: Up & Running” by Yevgeniy Brikman Ch. 2-3: Terraform State Managing infrastructure as code, state management
Network Security “The Linux Programming Interface” by Michael Kerrisk Ch. 59-61: Sockets Understanding network connections at the OS level
AWS Well-Architected AWS Well-Architected Framework (free) Security Pillar Official AWS best practices for secure VPC design

Reading strategy:

  1. Start with “AWS for Solutions Architects” Ch. 3 (VPC overview)
  2. Read “TCP/IP Illustrated” Ch. 9 if routing concepts are unclear
  3. Refer to “Terraform: Up & Running” as you write your code
  4. Use AWS Well-Architected Security Pillar as a checklist

Project 2: Serverless Data Pipeline (Lambda + Step Functions + S3)

  • File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: TypeScript, Go, Java
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: Level 3: The “Service & Support” Model
  • Difficulty: Level 2: Intermediate (The Developer)
  • Knowledge Area: Serverless, Event-Driven Architecture
  • Software or Tool: AWS Lambda, Step Functions, S3
  • Main Book: “AWS for Solutions Architects” by Shrivastava et al.

What you’ll build: An automated data processing pipeline that triggers when files land in S3, orchestrates multiple Lambda functions through Step Functions, handles errors gracefully, and outputs processed results—all without a single server to manage.

Why it teaches Serverless: Serverless is not “just deploy a function.” It’s about understanding event-driven architecture, dealing with cold starts, managing state across stateless functions, and designing for failure. Step Functions force you to think about workflow as a first-class concept.

Core challenges you’ll face:

  • Event-Driven Triggers (maps to decoupled architecture): Configuring S3 event notifications to invoke Lambda
  • State Machine Design (maps to workflow orchestration): Modeling your pipeline as explicit states with transitions
  • Error Handling (maps to resilience): Implementing retries, catch blocks, and fallback logic in Step Functions
  • Lambda Limits (maps to serverless constraints): Working within 15-minute timeout, /tmp storage, memory limits
  • IAM Execution Roles (maps to least-privilege security): Granting Lambda only the permissions it needs

Resources for key challenges:

Key Concepts:

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Python/Node.js, understanding of JSON, Project 1 completed (VPC knowledge)

Real world outcome:

  • Drop a CSV file into an S3 bucket and watch it get processed automatically
  • Step Functions visual console showing your workflow executing in real-time
  • Processed output appearing in a destination bucket
  • CloudWatch logs showing each Lambda invocation
  • Error handling that sends notifications when things fail

Learning milestones:

  1. First milestone: Single Lambda triggered by S3 upload → you understand event sources
  2. Second milestone: Chain 3 Lambdas via Step Functions → you understand state machines
  3. Third milestone: Add parallel processing and error handling → you understand resilience patterns
  4. Final milestone: Add SNS notifications and CloudWatch alarms → you understand observability in serverless

Real World Outcome

This is what your working pipeline looks like in action:

# Upload a CSV file to trigger the pipeline
$ aws s3 cp sales_data.csv s3://my-pipeline-bucket/input/ --profile douglascorrea_io --no-cli-pager
upload: ./sales_data.csv to s3://my-pipeline-bucket/input/sales_data.csv

# Check Step Functions execution status
$ aws stepfunctions list-executions \
    --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:DataProcessingPipeline \
    --profile douglascorrea_io --no-cli-pager
{
    "executions": [
        {
            "executionArn": "arn:aws:states:us-east-1:123456789012:execution:DataProcessingPipeline:execution-2024-12-20-15-30-45",
            "name": "execution-2024-12-20-15-30-45",
            "status": "SUCCEEDED",
            "startDate": "2024-12-20T15:30:45.123Z",
            "stopDate": "2024-12-20T15:32:10.456Z"
        }
    ]
}

# View CloudWatch logs for the validation Lambda
$ aws logs tail /aws/lambda/validate-data --since 5m --profile douglascorrea_io --no-cli-pager
2024-12-20T15:30:46.234Z START RequestId: abc123-def456 Version: $LATEST
2024-12-20T15:30:46.567Z INFO Validating CSV structure for sales_data.csv
2024-12-20T15:30:46.789Z INFO Found 1247 records, all valid
2024-12-20T15:30:47.012Z END RequestId: abc123-def456
2024-12-20T15:30:47.045Z REPORT Duration: 811ms Billed Duration: 900ms Memory: 512MB Max Memory Used: 128MB

# View the processed output
$ aws s3 ls s3://my-pipeline-bucket/output/ --profile douglascorrea_io --no-cli-pager
2024-12-20 15:32:10      45632 processed_sales_data.json
2024-12-20 15:32:10       1248 processing_summary.json

Step Functions Visual Workflow (in AWS Console shows real-time execution):

  • ValidateData Lambda: SUCCESS (Duration: 811ms)
  • TransformData Lambda: SUCCESS (Duration: 58.2s)
  • Parallel Processing: Both branches SUCCESS
    • CalculateStatistics: 2.3s
    • GenerateReport: 1.8s
  • WriteOutput Lambda: SUCCESS (Duration: 245ms)
  • SNS Notification: Delivered

CloudWatch Logs Insights showing complete pipeline flow with timestamps, error-free execution, and performance metrics for each Lambda invocation.

The Core Question You’re Answering

“How do I build systems that react to events without managing servers, and how do I coordinate multiple functions reliably?”

More specifically:

  • How do I design a system where dropping a file automatically triggers processing without a cron job?
  • How do I chain multiple processing steps when each Lambda is stateless and limited to 15 minutes?
  • How do I handle failures gracefully when any step might time out or throw an error?
  • How do I pass data between Lambdas without coupling them tightly?
  • How do I make my pipeline idempotent so reprocessing the same file produces the same result?

Concepts You Must Understand First

1. Event-Driven Architecture vs Request-Response

Request-Response (traditional):

Client → Server → Database → Server → Client
(synchronous, client waits for entire operation)

Event-Driven (serverless):

Event Producer → Event → Event Consumer
(asynchronous, producer fires and forgets)

Key: Your S3 upload doesn’t “call” your Lambda. It emits an event. This decoupling makes serverless scalable but harder to debug.

2. Lambda Execution Model (Cold Starts, Execution Context Reuse)

Lambda lifecycle:

  1. INIT Phase (cold start): Download code, start environment, initialize runtime (100ms-3s)
  2. INVOKE Phase: Run your handler function
  3. SHUTDOWN Phase: Environment terminated after idle timeout

Context reuse:

# Runs ONCE per execution environment (cold start)
import boto3
s3_client = boto3.client('s3')

# Runs EVERY invocation (warm or cold)
def handler(event, context):
    s3_client.get_object(Bucket='my-bucket', Key='file.csv')

Why it matters: Cold starts add latency. Initialize connections outside the handler to reuse them.

3. State Machines and Workflow Orchestration

With Step Functions:

  • Visual workflow in console
  • Declarative retry/error handling
  • State maintained between steps
  • Execution history for debugging

Step Functions is your distributed transaction coordinator.

4. Idempotency and Exactly-Once Processing

The problem: S3 events can be delivered more than once. Step Functions retries can duplicate processing.

Idempotent design:

# Generate deterministic ID from input
idempotency_key = hashlib.sha256(json.dumps(event, sort_keys=True).encode()).hexdigest()

# Check if already processed
try:
    s3_client.head_object(Bucket='results', Key=f'processed/{idempotency_key}.json')
    return {"status": "already_processed"}
except s3_client.exceptions.NoSuchKey:
    # Not processed yet, continue
    pass

5. IAM Execution Roles vs Resource-Based Policies

  • Execution Role (what Lambda can do): s3:GetObject, s3:PutObject
  • Resource-Based Policy (who can invoke Lambda): Allow S3 service to invoke

You need BOTH for S3 event triggers to work.

Questions to Guide Your Design

  1. What triggers your first Lambda? S3 event notification? SNS? EventBridge?
  2. How do you pass data between Lambda functions? Via Step Functions state? S3? SQS?
  3. What happens when a Lambda times out mid-execution? Does Step Functions retry? From scratch or resume?
  4. How do you handle partial failures? If 3 out of 5 parallel tasks succeed, proceed or fail?
  5. What errors are transient vs permanent? ThrottlingException (retry) vs InvalidDataFormat (alert)?
  6. What if the input file is 5GB? Lambda has 10GB /tmp max, 15-minute timeout.
  7. How do you maintain data lineage? Which output came from which input?
  8. How do you debug a failed execution? Step Functions history? CloudWatch Logs?

Thinking Exercise

Draw your event flow diagram:

S3 Bucket (input/)
     |
     | S3 Event Notification
     ↓
Lambda: ValidateData
     |
     ↓
Step Functions: DataProcessingStateMachine
     |
     ├→ Choice: Valid data?
     |   ├─ NO → Lambda: SendAlert → END
     |   └─ YES ↓
     |
     ├→ Lambda: TransformData
     ↓
     ├→ Parallel State:
     |   ├─ Lambda: CalculateStatistics
     |   └─ Lambda: GenerateReport
     ↓
     ├→ Lambda: WriteOutput
     ↓
     └→ SNS: SendSuccessNotification

Now answer:

  • Where does data live at each stage?
  • What happens if execution is paused for 1 year? (max execution time)
  • Maximum file size your pipeline can handle?
  • Cost for processing 1000 files?

The Interview Questions They’ll Ask

Q: “Explain synchronous vs asynchronous Lambda invocations.”

Expected answer:

  • Synchronous: Caller waits, gets return value (API Gateway, ALB)
  • Asynchronous: Caller gets 202 Accepted, Lambda processes later (S3, SNS)
  • Async has built-in retry (2x) and dead-letter queue support
  • In your pipeline: S3 invokes ValidateData async, Step Functions invokes Lambdas synchronously

Q: “How do you handle Lambda timeout mid-processing?”

Expected answer:

  • Chunked processing: Read in chunks, maintain cursor in DynamoDB
  • Recursive invocation: Lambda invokes itself with updated state
  • Step Functions Map state: Split into chunks, process in parallel
  • Alternative: Use Fargate for long-running tasks (no 15min limit)

Q: “What are Lambda cold starts and how did they affect your pipeline?”

Expected answer:

  • Cold start = INIT phase (100ms-3s) when scaling up or after idle
  • Mitigation: Provisioned concurrency, smaller packages, init code outside handler
  • In your pipeline: “Measured ~800ms cold starts, acceptable for batch processing”

Q: “Why use Step Functions instead of chaining Lambdas?”

Expected answer:

  • Visual workflow, declarative error handling, state management
  • Execution history shows inputs/outputs/durations
  • Parallel, Map, Wait states built-in

Q: “How do you pass large datasets between Step Functions states?”

Expected answer:

  • Step Functions has 256KB limit per state
  • Store data in S3, pass S3 URI in state
  • Use ResultPath to merge results without duplicating data

Q: “IAM permissions for S3 to trigger Lambda?”

Expected answer:

  • Lambda execution role (what Lambda can do): s3:GetObject, logs:CreateLogGroup
  • Lambda resource-based policy (who can invoke): Allow S3 service, condition on bucket ARN
  • S3 bucket notification config pointing to Lambda ARN

Q: “Cost for running pipeline 1000 times/day?”

Expected answer breakdown:

  • Lambda: Invocations ($0.20/1M) + Duration ($0.0000166667/GB-sec)
  • Step Functions: State transitions ($25/1M)
  • S3: Requests + Storage
  • Show calculation

Hints in Layers

Hint 1.1: What triggers the pipeline? Use S3 Event Notifications with object created events. Configure filter by prefix (input/) and suffix (.csv).

Hint 1.2: How should Lambdas communicate? Use Step Functions to orchestrate. Pass small metadata in state, store large data in S3.

Hint 2.1: Lambda function structure

import boto3
from aws_lambda_powertools import Logger, Tracer

logger = Logger()
tracer = Tracer()
s3_client = boto3.client('s3')  # Initialize outside handler

@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
    bucket = event['bucket']
    key = event['key']

    response = s3_client.get_object(Bucket=bucket, Key=key)
    # Process...

    return {"valid": True, "rowCount": 1247}

Hint 3.1: S3 event not triggering Lambda? Check:

  1. CloudWatch Logs for Lambda
  2. S3 notification config: aws s3api get-bucket-notification-configuration
  3. Lambda resource-based policy: aws lambda get-policy
  4. Event filter matches your file

Hint 3.2: Lambda cold starts too slow?

  • Check package size
  • Lazy import heavy libraries
  • Use Lambda Layers
  • Consider Provisioned Concurrency

Books That Will Help

Topic Book Specific Chapters Why It Helps
Serverless Architecture “AWS for Solutions Architects” by Saurabh Shrivastava Ch. 8-9: Serverless, Containers vs Lambda When to use Lambda vs Fargate, event-driven patterns
Distributed Systems “Designing Data-Intensive Applications” by Martin Kleppmann Ch. 11-12: Stream Processing Event-driven architecture, idempotency, exactly-once semantics
Lambda Deep Dive “AWS Lambda in Action” by Danilo Poccia Ch. 2, 4, 8: First Lambda, Data Streams, Optimization Cold start optimization, event source integration
Step Functions AWS Step Functions Developer Guide Error Handling, ASL Specification State machine syntax, retry/catch patterns
IAM Security “AWS Security” by Dylan Shield Ch. 3, 5: IAM, Data Protection Lambda execution roles, resource-based policies
Terraform “Terraform: Up & Running” by Yevgeniy Brikman Ch. 3, 7: State, Multiple Providers Deploy Step Functions + Lambda as code
Observability “Practical Monitoring” by Mike Julian Ch. 4, 6: Applications, Alerting Metrics and alerts for serverless
Cost Optimization “AWS Cost Optimization” by Brandon Carroll Ch. 5, 7: Lambda, Storage Memory tuning, S3 lifecycle policies

Reading strategy:

  1. Start with “AWS for Solutions Architects” Ch. 8 (serverless overview)
  2. Read “Designing Data-Intensive Applications” Ch. 11 (event-driven concepts)
  3. Dive into “AWS Lambda in Action” Ch. 4 (event sources) and Ch. 8 (optimization)
  4. Refer to Step Functions Developer Guide when writing state machine
  5. Use “Terraform: Up & Running” Ch. 3 as you build infrastructure

Project 3: Auto-Scaling Web Application (EC2 + ALB + RDS + S3)

  • File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
  • Programming Language: HCL (Terraform)
  • Coolness Level: Level 1: Pure Corporate Snoozefest
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Cloud Infrastructure / Scalability
  • Software or Tool: AWS (EC2, ALB, RDS)
  • Main Book: “AWS for Solutions Architects” by Saurabh Shrivastava

What you’ll build: A traditional multi-tier web application with load-balanced EC2 instances that scale based on CPU/request metrics, backed by RDS (Aurora or PostgreSQL), with static assets served from S3/CloudFront.

Why it teaches EC2 & Core AWS: This is the “classic” AWS architecture. Understanding auto-scaling groups, launch templates, health checks, and how ALB distributes traffic is foundational. You’ll also learn why RDS simplifies database ops and how S3+CloudFront offloads static content.

Core challenges you’ll face:

  • Launch Templates (maps to instance configuration): Defining AMI, instance type, user data scripts, IAM instance profiles
  • Auto Scaling Policies (maps to elasticity): Configuring scaling based on CloudWatch metrics (CPU, request count, custom)
  • Health Checks (maps to reliability): Understanding EC2 vs ELB health checks and how they affect scaling
  • Database Connectivity (maps to networking): RDS in private subnet, security group rules from app tier only
  • Static Asset Optimization (maps to performance): S3 origin with CloudFront distribution, cache behaviors

Resources for key challenges:

Key Concepts:

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Project 1 (VPC), basic web development, familiarity with databases

Real world outcome:

  • A working web application URL you can share
  • Watch instances spin up when you load test (use hey or Apache Bench)
  • See instances terminate when load decreases
  • Access your database via RDS console or psql through bastion
  • Fast-loading static assets via CloudFront with cache hits

Learning milestones:

  1. First milestone: Single EC2 with user data script serving a web app → you understand instance bootstrapping
  2. Second milestone: Add ALB + 2 instances in target group → you understand load balancing
  3. Third milestone: Add Auto Scaling with CPU-based policy → you understand elasticity
  4. Fourth milestone: Add RDS in private subnet → you understand data tier security
  5. Final milestone: Add S3 + CloudFront for static assets → you understand CDN patterns

Real World Outcome

Here’s what success looks like when you complete this project:

# 1. Deploy the infrastructure
$ terraform apply
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.

Outputs:
alb_dns_name = "my-app-alb-123456789.us-east-1.elb.amazonaws.com"
rds_endpoint = "myapp-db.c9akl1.us-east-1.rds.amazonaws.com:5432"
cloudfront_domain = "d111111abcdef8.cloudfront.net"

# 2. Verify the application is running
$ curl http://my-app-alb-123456789.us-east-1.elb.amazonaws.com
<html>
  <head><title>My Scalable App</title></head>
  <body>
    <h1>Hello from instance i-0abc123def456!</h1>
    <p>Current time: 2025-12-22 14:32:15</p>
    <p>Database connection: OK</p>
  </body>
</html>

# 3. Load test to trigger auto-scaling (using 'hey' tool)
$ hey -n 10000 -c 100 http://my-app-alb-123456789.us-east-1.elb.amazonaws.com

Summary:
  Total:        45.3216 secs
  Slowest:      2.3451 secs
  Fastest:      0.0234 secs
  Average:      0.4532 secs
  Requests/sec: 220.75

Status code distribution:
  [200] 10000 responses

# 4. Watch instances scale up in real-time (in another terminal)
$ watch "aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-names my-app-asg \
    --query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].[InstanceId,LifecycleState]]' \
    --output table --no-cli-pager"

Every 2.0s: aws autoscaling...
---------------------------------------------
|  DescribeAutoScalingGroups                |
+-------------------------------------------+
||  2  | 1 | 5                             ||  # Started with 2 instances
+-------------------------------------------+
|||  i-0abc123def456  |  InService        |||
|||  i-0def789ghi012  |  InService        |||
+-------------------------------------------+

# After load increases, watch it scale to 4 instances:
+-------------------------------------------+
||  4  | 1 | 5                             ||  # Scaled up!
+-------------------------------------------+
|||  i-0abc123def456  |  InService        |||
|||  i-0def789ghi012  |  InService        |||
|||  i-0jkl345mno678  |  InService        |||
|||  i-0pqr901stu234  |  InService        |||
+-------------------------------------------+

# 5. Check CloudWatch metrics showing scaling activity
$ aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name TargetResponseTime \
    --dimensions Name=LoadBalancer,Value=app/my-app-alb/50dc6c495c0c9188 \
    --start-time 2025-12-22T14:00:00Z \
    --end-time 2025-12-22T15:00:00Z \
    --period 300 \
    --statistics Average \
    --no-cli-pager

Datapoints:
- Timestamp: 2025-12-22T14:00:00Z, Average: 0.032
- Timestamp: 2025-12-22T14:05:00Z, Average: 0.125  # Load increasing
- Timestamp: 2025-12-22T14:10:00Z, Average: 0.456  # Scaling triggered
- Timestamp: 2025-12-22T14:15:00Z, Average: 0.089  # Back to normal after scale-up

# 6. Check ALB target health
$ aws elbv2 describe-target-health \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app-tg/50dc6c495c0c9188 \
    --no-cli-pager

TargetHealthDescriptions:
- Target:
    Id: i-0abc123def456
    Port: 80
  HealthCheckPort: '80'
  TargetHealth:
    State: healthy
    Reason: Target passed health checks
- Target:
    Id: i-0def789ghi012
    Port: 80
  HealthCheckPort: '80'
  TargetHealth:
    State: healthy
    Reason: Target passed health checks

# 7. View Auto Scaling activity history
$ aws autoscaling describe-scaling-activities \
    --auto-scaling-group-name my-app-asg \
    --max-records 5 \
    --no-cli-pager

Activities:
- ActivityId: 1234abcd-5678-90ef-gh12-ijklmnopqrst
  Description: "Launching a new EC2 instance: i-0jkl345mno678"
  Cause: "At 2025-12-22T14:12:00Z a monitor alarm TargetTracking-my-app-asg-AlarmHigh-... in state ALARM triggered policy my-app-scaling-policy causing a change to the desired capacity from 2 to 4."
  StartTime: 2025-12-22T14:12:15Z
  EndTime: 2025-12-22T14:13:42Z
  StatusCode: Successful

# 8. Check RDS database connection
$ psql -h myapp-db.c9akl1.us-east-1.rds.amazonaws.com -U dbadmin -d myappdb
Password:
myappdb=> SELECT current_database(), current_user, inet_server_addr();
 current_database | current_user | inet_server_addr
------------------+--------------+------------------
 myappdb          | dbadmin      | 10.0.3.45
(1 row)

myappdb=> SELECT COUNT(*) FROM app_requests;
 count
-------
 10247  # Showing all the requests that were logged
(1 row)

# 9. View CloudFront cache statistics
$ aws cloudfront get-distribution-config \
    --id E1234ABCDEFGH \
    --query 'DistributionConfig.Origins[0].DomainName' \
    --output text --no-cli-pager

my-app-static-assets.s3.us-east-1.amazonaws.com

# Test CloudFront delivery
$ curl -I https://d111111abcdef8.cloudfront.net/images/logo.png
HTTP/2 200
content-type: image/png
content-length: 45678
x-cache: Hit from cloudfront  # Cache HIT - fast delivery!
x-amz-cf-pop: SFO5-C1
x-amz-cf-id: abc123...

# 10. After load subsides, watch instances scale down
$ watch "aws autoscaling describe-auto-scaling-groups ..."

+-------------------------------------------+
||  1  | 1 | 5                             ||  # Scaled back down to minimum
+-------------------------------------------+
|||  i-0abc123def456  |  InService        |||
+-------------------------------------------+

CloudWatch Dashboard View:

  • CPU Utilization graph showing spike from 20% → 75% → back to 25%
  • Request Count showing 50 req/sec → 800 req/sec → 60 req/sec
  • Target Response Time showing latency spike then recovery
  • Healthy Host Count showing 2 → 4 → 1 instances over time
  • RDS connections showing stable connection pool usage
  • ALB HTTP 200 responses at 99.8% (a few timeouts during initial spike)

What You’ll See in AWS Console:

  1. EC2 Auto Scaling Groups showing scaling activities
  2. CloudWatch alarms transitioning: OK → ALARM → OK
  3. ALB target groups with health check status
  4. RDS Performance Insights showing query patterns
  5. S3 bucket with static assets
  6. CloudFront distribution with cache statistics

The Core Question You’re Answering

“How do I build applications that automatically handle 10x traffic spikes without falling over, and intelligently shrink when traffic subsides to save costs?”

This is the fundamental problem that auto-scaling solves. Traditional infrastructure requires you to provision for peak load—meaning you’re paying for idle capacity 90% of the time. Auto-scaling lets you provision for average load and dynamically expand when needed.

Why this matters in the real world:

  • E-commerce sites during Black Friday sales
  • News sites during breaking news events
  • SaaS platforms during business hours (scale down at night)
  • API backends handling unpredictable mobile app traffic
  • Gaming servers during new release launches

Without auto-scaling, you have two bad options:

  1. Over-provision: Run 10 servers 24/7 even though you only need them for 2 hours a day → wasted money
  2. Under-provision: Run 2 servers and accept that your site crashes during traffic spikes → lost revenue

Auto-scaling gives you the best of both worlds: cost efficiency during normal periods, reliability during peaks.

Concepts You Must Understand First

Before diving into implementation, you need to internalize these foundational concepts:

1. Horizontal vs Vertical Scaling

Vertical Scaling (Scale Up): Make your server bigger

  • EC2 instance: t3.medium → t3.large → t3.xlarge
  • More CPU, more RAM, same single instance
  • Limitation: Hard ceiling (largest EC2 instance), requires downtime, single point of failure

Horizontal Scaling (Scale Out): Add more servers

  • 1 instance → 3 instances → 10 instances
  • Distribute load across multiple machines
  • Advantage: Theoretically unlimited, no downtime, fault-tolerant

This project teaches horizontal scaling, which is how modern cloud applications achieve massive scale.

2. Stateless Application Design

Stateless: Each request is independent, no memory of previous requests

# Stateless - GOOD for auto-scaling
@app.route('/api/user/<user_id>')
def get_user(user_id):
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    return jsonify(user)
# Any instance can handle this request

Stateful: Application remembers information between requests

# Stateful - BAD for auto-scaling
user_sessions = {}  # In-memory storage

@app.route('/api/cart/add')
def add_to_cart(item_id):
    session_id = request.cookies.get('session')
    user_sessions[session_id].append(item_id)  # Only works if same instance
    return "Added"
# BREAKS when load balancer sends next request to different instance

Why stateless matters for auto-scaling:

  • Load balancer can send requests to any instance
  • Instances can be terminated without data loss
  • New instances can immediately serve traffic

How to handle state:

  • Store sessions in Redis/ElastiCache (shared across instances)
  • Store sessions in DynamoDB
  • Use JWT tokens (state stored client-side)
  • Use sticky sessions (not recommended, defeats auto-scaling benefits)

3. Load Balancer Algorithms

Round Robin (default for ALB):

Request 1 → Instance A
Request 2 → Instance B
Request 3 → Instance C
Request 4 → Instance A (back to start)

Pro: Simple, fair distribution Con: Doesn’t account for instance load

Least Outstanding Requests (ALB option):

Instance A: 5 active requests
Instance B: 2 active requests  ← Send here
Instance C: 8 active requests

Next request goes to Instance B (least busy)

Pro: Better for variable request durations Con: Requires tracking active connections

Weighted Target Groups:

Instance A: weight 100 (gets 50% of traffic)
Instance B: weight 50  (gets 25% of traffic)
Instance C: weight 50  (gets 25% of traffic)

Use case: Blue/green deployments, A/B testing

4. Health Checks: EC2 vs ELB Health Checks

EC2 Health Check (Auto Scaling Group):

  • Checks: Is the instance running? Is the status “OK”?
  • Fails if: Instance stopped, hardware failure, status check failures
  • Does NOT check: Is the application responding?

ELB Health Check (Application Load Balancer):

  • Checks: HTTP GET to /health endpoint, expects 200 response
  • Fails if: Application crashed, database unreachable, timeout (default 5 sec)
  • More comprehensive than EC2 check

Critical difference:

Scenario: EC2 instance is running, but your web app crashed

EC2 Health Check: PASS (instance is running)
ELB Health Check: FAIL (app not responding)

Result: Auto Scaling thinks instance is healthy, keeps it running
         Load balancer marks it unhealthy, stops sending traffic

Solution: Configure ASG to use ELB health checks!

Best practice configuration:

resource "aws_autoscaling_group" "app" {
  health_check_type         = "ELB"  # Use ELB checks, not EC2
  health_check_grace_period = 300    # Wait 5 min for app to start

  # ... other config
}

resource "aws_lb_target_group" "app" {
  health_check {
    path                = "/health"
    interval            = 30      # Check every 30 seconds
    timeout             = 5       # Fail if no response in 5 sec
    healthy_threshold   = 2       # 2 consecutive passes = healthy
    unhealthy_threshold = 3       # 3 consecutive fails = unhealthy
  }
}

5. Launch Templates vs Launch Configurations

Launch Configuration (DEPRECATED, but you’ll see it in old code):

  • Cannot be modified (must create new version)
  • Limited instance type options
  • No instance metadata service v2 support

Launch Template (USE THIS):

  • Versioned (can modify and rollback)
  • Supports mixed instance types
  • Supports Spot instances
  • Supports newer EC2 features
resource "aws_launch_template" "app" {
  name_prefix   = "my-app-"
  image_id      = "ami-0c55b159cbfafe1f0"  # Your AMI
  instance_type = "t3.medium"

  # User data script to configure instance on boot
  user_data = base64encode(<<-EOF
    #!/bin/bash
    cd /opt/myapp
    export DB_HOST=${aws_db_instance.main.endpoint}
    ./start-app.sh
  EOF
  )

  # IAM role for instance
  iam_instance_profile {
    arn = aws_iam_instance_profile.app.arn
  }

  # Security group
  vpc_security_group_ids = [aws_security_group.app.id]

  # Enable detailed monitoring
  monitoring {
    enabled = true
  }
}

Key understanding: Launch template is the blueprint for every instance that auto-scaling creates.

Questions to Guide Your Design

Ask yourself these questions as you build. If you can’t answer them, you don’t understand the architecture yet.

1. How does the ALB know which instances are healthy?

Answer: The ALB continuously sends HTTP requests to each instance’s health check endpoint (e.g., GET /health). If it receives a 200 OK response within the timeout period (default 5 seconds), the instance is marked healthy. If it fails the unhealthy threshold number of times (default 3 consecutive failures), it’s marked unhealthy and removed from rotation.

Follow-up: What should your /health endpoint check?

  • Database connectivity?
  • Disk space?
  • Memory available?
  • Dependency services reachable?

Best practice: Start simple (just return 200), then add checks for critical dependencies.

2. What metric should trigger scaling?

Common options:

CPU Utilization (most common starting point):

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "cpu-scale-up"
  scaling_adjustment     = 1
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.app.name
}

resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "cpu-utilization-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 120
  statistic           = "Average"
  threshold           = 70  # Scale up when CPU > 70%

  alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}

Problems with CPU-based scaling:

  • I/O-bound applications may never hit CPU threshold
  • Doesn’t capture actual user experience (latency)

Request Count Per Target (better for web apps):

resource "aws_cloudwatch_metric_alarm" "requests_high" {
  metric_name         = "RequestCountPerTarget"
  namespace           = "AWS/ApplicationELB"
  threshold           = 1000  # 1000 requests/target in 5 min
  # ... more config
}

Target Tracking Scaling (recommended - AWS manages it):

resource "aws_autoscaling_policy" "target_tracking" {
  name                   = "target-tracking-policy"
  policy_type            = "TargetTrackingScaling"
  autoscaling_group_name = aws_autoscaling_group.app.name

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 50.0  # Keep average CPU at 50%
  }
}

Decision criteria:

  • Web apps with predictable load per request: Request count
  • CPU-intensive tasks: CPU utilization
  • Most cases: Target tracking with CPU, let AWS handle it
  • Advanced: Custom metrics (latency from your app logs)

3. How do you handle session state with multiple instances?

Option 1: Don’t use sessions (best for APIs)

  • Use JWT tokens
  • All state in the token payload
  • Stateless, works with any instance

Option 2: Shared session store (for traditional web apps)

from flask import Flask, session
from flask_session import Session
import redis

app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.from_url('redis://elasticache-endpoint:6379')
Session(app)

# Now sessions are stored in Redis, not in-memory
# Any instance can read/write session data

Option 3: Sticky sessions (not recommended)

  • ALB always routes user to same instance
  • Defeats purpose of auto-scaling
  • Lose sessions when instance terminates

Real-world recommendation: Use ElastiCache (Redis) for sessions if you need them.

4. What’s the difference between desired, minimum, and maximum capacity?

resource "aws_autoscaling_group" "app" {
  min_size         = 2   # Never go below 2 (for high availability)
  max_size         = 10  # Never exceed 10 (cost protection)
  desired_capacity = 4   # Start with 4, let scaling adjust

  # ... more config
}

How they work together:

Scenario 1: Normal operation
Current: 4 instances (= desired)
Traffic increases, CPU hits 80%
Scaling policy triggers: desired_capacity = 4 + 2 = 6
Result: Launch 2 new instances

Scenario 2: Traffic spike
Current: 8 instances
Traffic spike continues, CPU still high
Scaling policy wants: desired_capacity = 8 + 2 = 10
Result: Launch 2 more instances (at max_size, stops)

Scenario 3: Traffic drops
Current: 6 instances
CPU drops to 20%, scale-down alarm triggers
Scaling policy: desired_capacity = 6 - 2 = 4
Result: Terminate 2 instances

Scenario 4: Major incident, all instances failing
Current: 0 healthy instances
Auto Scaling: "I need to meet min_size!"
Result: Launches 2 new instances (even if health checks failing)

Best practices:

  • min_size: Enough for high availability (at least 2 in different AZs)
  • max_size: Based on budget and realistic peak load
  • desired_capacity: Let auto-scaling manage it, or set initial value

5. What happens during a deployment to a running auto-scaled application?

Bad approach (causes downtime):

# Update launch template
# Terminate all instances
# New instances launch with new code
# → Downtime while instances come up

Good approach (rolling update):

resource "aws_autoscaling_group" "app" {
  # ...

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50  # Keep at least 50% healthy during update
    }
  }
}

Process:

  1. Update launch template with new AMI/user data
  2. Trigger instance refresh
  3. Auto Scaling terminates one instance
  4. New instance launches with new config
  5. Wait for health checks to pass
  6. Repeat until all instances replaced

Blue/Green deployment (zero downtime):

  1. Create new ASG with new version
  2. Attach to same target group
  3. Gradually shift traffic (weighted target groups)
  4. Monitor error rates
  5. Fully switch or rollback

Thinking Exercise

Problem: You’re building a web application where each EC2 instance can reliably handle 100 requests per second. You expect peak traffic of 500 requests per second during business hours (9 AM - 5 PM), and only 50 requests per second during off-hours.

Question 1: How many instances do you need at minimum for peak load?

Click for answer **Answer**: 500 req/sec ÷ 100 req/sec per instance = **5 instances minimum** at peak But you should add buffer for: - Instance failures - Traffic spikes beyond expected peak - Performance degradation under load **Recommended**: - `max_size = 8` (160% of calculated minimum, allows for 60% spike) - `min_size = 2` (high availability during off-hours) - `desired_capacity = 6` (20% buffer above minimum requirement)

Question 2: If each instance costs $0.05/hour, how much do you save per month with auto-scaling vs. running peak capacity 24/7?

Click for answer **Without auto-scaling** (running 5 instances 24/7): - Cost = 5 instances × $0.05/hour × 24 hours × 30 days = **$180/month** **With auto-scaling**: - Peak hours (9 AM - 5 PM = 8 hours): 5 instances - Off-hours (16 hours): 2 instances Daily cost: - Peak: 5 instances × $0.05 × 8 hours = $2.00 - Off-hours: 2 instances × $0.05 × 16 hours = $1.60 - Total per day = $3.60 Monthly cost: - $3.60 × 30 days = **$108/month** **Savings: $180 - $108 = $72/month (40% reduction)** With realistic scaling (more granular adjustments), savings are often 50-70%.

Question 3: Your health check interval is 30 seconds, unhealthy threshold is 3. An instance’s application crashes. How long until the ALB stops sending traffic to it?

Click for answer **Answer**: - Health check every 30 seconds - Need 3 consecutive failures - Time = 30 seconds × 3 = **90 seconds minimum** During this time, users may experience: - Timeout errors (if health check timeout is 5 sec, user requests timeout too) - Failed requests sent to unhealthy instance **Optimization**: Reduce interval to 10 seconds, unhealthy threshold to 2: - Time to detect = 10 × 2 = **20 seconds** - Trade-off: More frequent health checks = slightly more traffic

Question 4: You set scale-up to trigger at 70% CPU, scale-down at 30% CPU. What problem might you encounter?

Click for answer **Problem**: **Flapping** (constant scale-up/scale-down oscillation) **Scenario**: 1. 3 instances at 75% CPU → scale up to 4 instances 2. Load distributed across 4 → CPU drops to 56% (still above 30%) 3. Stays stable... then one instance terminates unexpectedly 4. Load redistributes to 3 instances → CPU back to 75% → scale up again 5. Repeat forever **Or worse**: 1. 3 instances at 75% CPU → scale up to 4 2. CPU drops to 28% → scale down to 3 3. CPU jumps to 75% → scale up to 4 4. Infinite loop, costs money, destabilizes app **Solution**: Add **cooldown periods** and **wider gap**: ```hcl resource "aws_autoscaling_policy" "scale_up" { cooldown = 300 # Wait 5 minutes before scaling again # ... } # Use thresholds with buffer: # Scale up at 70% CPU # Scale down at 20% CPU (50 point gap prevents flapping) ``` **Better solution**: Use **target tracking scaling**: ```hcl target_tracking_configuration { target_value = 50.0 # AWS automatically manages scale-up/down to maintain this } ``` AWS's algorithm is smarter, prevents flapping, considers cooldowns automatically.

The Interview Questions They’ll Ask

When you claim AWS auto-scaling experience on your resume, expect these questions:

1. Basic Understanding

Q: “Explain the difference between vertical and horizontal scaling. When would you use each?”

Expected answer:

  • Vertical = bigger instance (limited by hardware, downtime required)
  • Horizontal = more instances (unlimited, no downtime, requires stateless design)
  • Use vertical for: Legacy apps that can’t distribute, databases (until you move to RDS)
  • Use horizontal for: Web apps, APIs, stateless services

Q: “What’s the difference between an Application Load Balancer and a Network Load Balancer?”

Expected answer:

  • ALB: Layer 7 (HTTP/HTTPS), content-based routing, WebSockets, host/path routing
  • NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions of req/sec
  • Use ALB for: Web applications, microservices with path routing
  • Use NLB for: Non-HTTP protocols, extreme performance requirements, static IP needed

2. Design Scenarios

Q: “You’re seeing 5xx errors from your ALB. How do you troubleshoot?”

Expected approach:

  1. Check target health in target group (unhealthy instances?)
  2. Check ALB access logs (which endpoints returning 5xx?)
  3. Check application logs on instances (app crashes? database timeouts?)
  4. Check security group rules (instances can reach database?)
  5. Check CloudWatch metrics (CPU/memory maxed out?)

Q: “Your application needs to maintain user sessions. How do you architect this with auto-scaling?”

Expected answer:

  • Option 1: ElastiCache (Redis/Memcached) as shared session store
  • Option 2: DynamoDB for session storage
  • Option 3: JWT tokens (no server-side sessions)
  • NOT sticky sessions (defeats auto-scaling benefits, data loss on instance termination)

3. Scaling Logic

Q: “You set your ASG to min=2, max=10, desired=5. You manually terminate an instance. What happens?”

Expected answer:

  • Current instances: 4 (after termination)
  • Desired capacity: still 5
  • Auto Scaling detects current < desired
  • Launches 1 new instance to reach desired=5

Q: “What’s the difference between target tracking scaling and step scaling?”

Expected answer:

  • Target tracking: Set a target (e.g., “maintain 50% CPU”), AWS automatically scales up/down to maintain it. Simpler, recommended for most use cases.
  • Step scaling: Define explicit rules (e.g., “if CPU > 70%, add 2 instances; if CPU > 85%, add 4 instances”). More control, more complex, use for non-linear scaling needs.

4. Real-World Problem Solving

Q: “Your auto-scaling isn’t triggering when you expect. How do you debug?”

Expected approach:

  1. Check CloudWatch alarms (are they in ALARM state?)
  2. Check alarm history (has threshold actually been crossed?)
  3. Check alarm configuration (right metric? right threshold? evaluation periods?)
  4. Check ASG configuration (is policy attached? cooldown preventing scale?)
  5. Check instance metrics (is data actually being reported?)

Q: “You deployed a new version and now instances are failing health checks. What do you check?”

Expected approach:

  1. SSH to instance, check application logs
  2. Test health check endpoint manually: curl localhost:80/health
  3. Check if app started correctly (check user data script logs)
  4. Check security group (does instance allow traffic on health check port?)
  5. Check health check configuration (path correct? timeout too short?)
  6. Check grace period (is app given enough time to start before checks?)

5. Cost Optimization

Q: “How would you reduce costs for an auto-scaled application?”

Expected strategies:

  • Right-size instances (use smaller instance types if CPU consistently low)
  • Use Spot instances for fault-tolerant workloads (70-90% cheaper)
  • Implement aggressive scale-down (reduce min_size during known low-traffic periods)
  • Use scheduled scaling (scale down automatically at night/weekends)
  • Reserved Instances or Savings Plans for baseline capacity
  • Monitor and optimize unhealthy instance replacement (failing fast vs. retrying)

Q: “Explain Spot instances in the context of auto-scaling. What are the risks?”

Expected answer:

  • Spot = unused EC2 capacity at up to 90% discount
  • Risk: AWS can terminate with 2-minute notice if capacity needed
  • Use in ASG with mixed instance types (Spot + On-Demand)
  • Configure Spot allocation strategy (price-capacity-optimized)
  • Not suitable for: Stateful apps, databases, single-instance workloads
  • Perfect for: Batch processing, web front-ends (with On-Demand baseline)

Hints in Layers

When you get stuck, reveal hints progressively instead of jumping to the solution:

Problem: Instances launching but failing health checks immediately

Hint 1 (First check) Check the **health check grace period**. Your application might need time to start up. ```hcl resource "aws_autoscaling_group" "app" { health_check_grace_period = 300 # Seconds to wait before health checks } ``` If your app takes 3 minutes to initialize but grace period is 30 seconds, instances will be terminated before they're ready.
Hint 2 (If still failing) SSH into a failing instance and test the health check endpoint manually: ```bash # From within the instance curl -v http://localhost:80/health # Check if the application is actually running ps aux | grep your-app-name # Check application logs tail -f /var/log/your-app/app.log ``` Is the application even starting? Is it listening on the correct port?
Hint 3 (Security check) Verify security group rules allow health checks: ```bash aws ec2 describe-security-groups --group-ids sg-xxxxx --no-cli-pager # Look for: # - Inbound rule allowing ALB security group on port 80 # - Or inbound rule allowing the VPC CIDR on port 80 ``` The ALB needs network access to perform health checks.
Hint 4 (Application check) Check your user data script logs: ```bash # On Amazon Linux/Ubuntu cat /var/log/cloud-init-output.log # Look for errors in your bootstrap script # Did database connection fail? # Did dependencies install correctly? ``` A failing user data script means your app never starts.
Solution (Last resort) Common causes and fixes: 1. **Application takes too long to start**: ```hcl health_check_grace_period = 600 # Increase to 10 minutes ``` 2. **Wrong health check path**: ```hcl resource "aws_lb_target_group" "app" { health_check { path = "/health" # Make sure this endpoint exists! } } ``` 3. **Health check endpoint requires database, database unreachable**: - Fix security group rules allowing app tier → database tier - Or simplify health check to not require database 4. **Application listening on wrong port**: ```python # Your app app.run(host='0.0.0.0', port=80) # Must match target group port ``` 5. **User data script has errors, app never starts**: - Test user data script locally first - Add error handling: `set -e` to fail fast - Check logs: `/var/log/cloud-init-output.log`

Problem: Auto-scaling not triggering when CPU is high

Hint 1 Check if your CloudWatch alarm is actually in ALARM state: ```bash aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager ``` Look at `StateValue`. If it's `OK`, the threshold hasn't been crossed.
Hint 2 Check your alarm configuration: ```bash aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager # Verify: # - Threshold: Is it too high? (e.g., 99% vs 70%) # - EvaluationPeriods: Does CPU need to be high for multiple periods? # - Period: Is it too long? (e.g., 5 minutes vs 1 minute) # - Statistic: Average vs Maximum vs Minimum ``` Example: If `EvaluationPeriods=3` and `Period=300`, CPU must be high for **15 minutes** before alarm triggers.
Hint 3 Check if scaling is in cooldown: ```bash aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg --no-cli-pager # Look for recent scaling activities aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg --max-records 5 --no-cli-pager ``` If a scaling action just happened, cooldown period prevents another one (default 300 seconds).
Solution Common fixes: 1. **Alarm threshold too high**: ```hcl threshold = 70 # Not 90 ``` 2. **Evaluation period too long**: ```hcl evaluation_periods = 2 # Not 5 period = 60 # 1 minute, not 5 ``` 3. **Cooldown preventing scaling**: ```hcl resource "aws_autoscaling_policy" "scale_up" { cooldown = 60 # Reduce from 300 } ``` 4. **Alarm not attached to scaling policy**: ```hcl resource "aws_cloudwatch_metric_alarm" "cpu_high" { # ... alarm_actions = [aws_autoscaling_policy.scale_up.arn] # Must be set! } ``` 5. **Use target tracking instead**: ```hcl resource "aws_autoscaling_policy" "target_tracking" { policy_type = "TargetTrackingScaling" target_tracking_configuration { predefined_metric_specification { predefined_metric_type = "ASGAverageCPUUtilization" } target_value = 50.0 } } # AWS handles everything automatically ```

Books That Will Help

Book Author What It Teaches Best Sections for This Project
AWS for Solutions Architects Saurabh Shrivastava et al. Comprehensive AWS architecture patterns Ch. 6 (Auto Scaling Groups), Ch. 7 (RDS), Ch. 10 (High Availability)
AWS Certified Solutions Architect Study Guide Ben Piper, David Clinton Exam-focused AWS fundamentals Ch. 4 (EC2), Ch. 5 (ELB), Ch. 7 (CloudWatch)
Designing Data-Intensive Applications Martin Kleppmann Distributed systems theory (applies to auto-scaling) Ch. 1 (Scalability), Ch. 8 (Distributed Systems)
Amazon Web Services in Action Michael Wittig, Andreas Wittig Hands-on AWS with practical examples Ch. 3 (Infrastructure Automation), Ch. 6 (Scaling Up and Down)
The Phoenix Project Gene Kim et al. DevOps principles (why auto-scaling matters) Part 2 (First Way - Flow), Part 3 (Second Way - Feedback)
Site Reliability Engineering Google SRE Team Operational practices for scaled systems Ch. 6 (Monitoring), Ch. 22 (Cascading Failures - why you need auto-scaling)
Terraform: Up & Running Yevgeniy Brikman Infrastructure as Code for AWS Ch. 2 (Terraform Syntax), Ch. 5 (State Management), Ch. 7 (Modules)

Reading strategy:

  1. Start with “AWS for Solutions Architects” Ch. 6-7 for AWS-specific patterns
  2. Read “Designing Data-Intensive Applications” Ch. 1 to understand why systems need to scale
  3. Use “Terraform: Up & Running” Ch. 2 as a reference while coding
  4. Read “SRE” Ch. 22 after completing the project to understand failure modes you just protected against

Project 4: Container Platform (ECS Fargate or EKS)

  • File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, TypeScript, Terraform HCL
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: Level 3: The “Service & Support” Model
  • Difficulty: Level 3: Advanced (The Engineer)
  • Knowledge Area: Containers, Kubernetes, Orchestration
  • Software or Tool: Docker, ECS, EKS, Kubernetes
  • Main Book: “AWS for Solutions Architects” by Shrivastava et al.

What you’ll build: A containerized microservices application deployed on either ECS with Fargate (serverless containers) or EKS (managed Kubernetes), complete with service discovery, load balancing, and auto-scaling.

Why it teaches Containers on AWS: Containers are between EC2 and Lambda—more portable than EC2, more control than Lambda. ECS teaches you AWS’s native container orchestration; EKS teaches you Kubernetes. Both force you to understand task definitions, networking modes, and service mesh concepts.

Core challenges you’ll face:

  • Task Definitions (maps to container configuration): CPU/memory allocation, environment variables, port mappings, IAM task roles
  • Networking Modes (maps to container networking): awsvpc mode, service discovery, ALB integration
  • Service Scaling (maps to container orchestration): Target tracking, step scaling based on metrics
  • ECR Integration (maps to image management): Building, tagging, pushing container images
  • For EKS: Kubernetes Fundamentals (maps to orchestration): Pods, Deployments, Services, Ingress

Resources for key challenges:

Key Concepts:

Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Docker basics, Projects 1-2 completed, some Kubernetes knowledge for EKS path

Real world outcome:

  • Multiple containerized services communicating with each other
  • A working application accessible via ALB
  • CloudWatch Container Insights showing metrics
  • Ability to deploy new versions with zero downtime
  • For EKS: kubectl commands working against your cluster

Learning milestones:

  1. First milestone: Single container task running on Fargate → you understand task definitions
  2. Second milestone: Add ALB target group with service → you understand container load balancing
  3. Third milestone: Add second service with service discovery → you understand microservices communication
  4. Fourth milestone: Add auto-scaling based on CPU → you understand container elasticity
  5. Final milestone: CI/CD pipeline deploying to ECS/EKS → you understand container DevOps

Real World Outcome

When you complete this project, you’ll have tangible proof of a working container platform:

For ECS Fargate:

# View your running services
$ aws ecs list-services --cluster my-cluster --no-cli-pager
{
    "serviceArns": [
        "arn:aws:ecs:us-east-1:123456789:service/my-cluster/api-service",
        "arn:aws:ecs:us-east-1:123456789:service/my-cluster/worker-service"
    ]
}

# Check service health
$ aws ecs describe-services --cluster my-cluster --services api-service --no-cli-pager
{
    "services": [{
        "serviceName": "api-service",
        "runningCount": 2,
        "desiredCount": 2,
        "launchType": "FARGATE",
        "networkConfiguration": {
            "awsvpcConfiguration": {
                "subnets": ["subnet-abc123", "subnet-def456"],
                "securityGroups": ["sg-xyz789"]
            }
        }
    }]
}

# Your application responds via ALB
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "service": "api", "task_id": "abc123def456", "version": "1.2.0"}

# Service discovery working (containers finding each other by DNS)
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/api/worker-status
{"worker_service": "reachable", "tasks": 3, "queue_depth": 42}

# View container images in ECR
$ aws ecr describe-images --repository-name my-app --no-cli-pager
{
    "imageDetails": [
        {
            "imageDigest": "sha256:abc123...",
            "imageTags": ["latest", "v1.2.0"],
            "imagePushedAt": "2025-12-20T10:30:00+00:00"
        }
    ]
}

For EKS:

# Your Kubernetes cluster is accessible
$ kubectl cluster-info
Kubernetes control plane is running at https://ABC123.gr7.us-east-1.eks.amazonaws.com

# View your running workloads
$ kubectl get pods -n production
NAME                          READY   STATUS    RESTARTS   AGE
api-deployment-7d8f9c-abc12   1/1     Running   0          2h
api-deployment-7d8f9c-def34   1/1     Running   0          2h
worker-deployment-5e6f7-xyz   1/1     Running   0          1h

$ kubectl get services -n production
NAME          TYPE           CLUSTER-IP      EXTERNAL-IP                           PORT(S)
api-service   LoadBalancer   10.100.50.25    a1b2c3.us-east-1.elb.amazonaws.com   80:30001/TCP
worker-svc    ClusterIP      10.100.75.10    <none>                                8080/TCP

# Application accessible via Kubernetes LoadBalancer
$ curl http://a1b2c3.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "pod": "api-deployment-7d8f9c-abc12", "node": "ip-10-0-1-50.ec2.internal"}

# Rolling deployment with zero downtime
$ kubectl set image deployment/api-deployment api=my-repo/api:v1.3.0 -n production
deployment.apps/api-deployment image updated

$ kubectl rollout status deployment/api-deployment -n production
Waiting for deployment "api-deployment" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
deployment "api-deployment" successfully rolled out

Container Insights Dashboard:

  • CPU/Memory utilization per service/pod
  • Network throughput and connections
  • Task/pod startup time and failure rates
  • Container-level logs aggregated in CloudWatch

Service Discovery Working:

  • ECS: Cloud Map DNS names (api-service.local, worker-service.local)
  • EKS: Kubernetes DNS (api-service.production.svc.cluster.local)
  • Containers automatically discover each other without hardcoded IPs

Rolling Deployments:

  • Deploy new version without downtime
  • Watch old tasks drain and new tasks become healthy
  • Automatic rollback if health checks fail

The Core Question You’re Answering

“When should I use containers vs Lambda vs EC2, and what’s the difference between ECS and EKS?”

This is THE fundamental architectural decision on AWS:

  • Lambda: Event-driven, sub-15-minute executions, no state, extreme auto-scaling. Use when you have unpredictable spiky traffic and stateless operations.
  • Containers (ECS/EKS): Long-running processes, stateful applications, need specific runtime control, want portability. Use when you need more than 15 minutes, WebSocket connections, background workers, or existing Dockerized apps.
  • EC2: Full OS control, legacy apps, specialized hardware needs, licensed software. Use when containers/Lambda don’t fit.

ECS vs EKS:

  • ECS: AWS-native, simpler, less operational overhead, great for teams new to containers. Task definitions are AWS-specific JSON.
  • EKS: Standard Kubernetes, portable across clouds, richer ecosystem, more complex. Use when you need Kubernetes features or multi-cloud portability.

Fargate vs EC2 Launch Type:

  • Fargate: Serverless containers—you define task CPU/memory, AWS manages infrastructure. Higher per-task cost, zero operational overhead.
  • EC2 Launch Type: You manage the EC2 instances in the cluster. Lower per-task cost if you have steady baseline load, more control, more ops work.

Concepts You Must Understand First

  1. Container Fundamentals:
    • Docker images are layered filesystems (each Dockerfile instruction = layer)
    • Container registries store images (ECR, Docker Hub)
    • Containers are isolated processes sharing a kernel (not VMs)
    • Port mapping: container internal port → host port
    • Environment variables and secrets injection
  2. Container Orchestration:
    • Scheduling: Placing containers on available compute resources based on CPU/memory requirements
    • Service Discovery: How containers find each other (DNS-based)
    • Load Balancing: Distributing traffic across container replicas
    • Health Checks: Determining if a container is ready to receive traffic
    • Auto-Scaling: Adjusting container count based on metrics
  3. ECS Concepts:
    • Task Definition: Blueprint for your container (image, CPU, memory, ports, environment)
    • Task: Running instance of a task definition (one or more containers running together)
    • Service: Maintains desired count of tasks, integrates with ALB, handles deployments
    • Cluster: Logical grouping of tasks/services
    • Task Role: IAM permissions for your application (what the container can do)
    • Execution Role: IAM permissions for ECS agent (pulling ECR images, writing logs)
  4. Kubernetes Concepts (for EKS):
    • Pod: Smallest unit (one or more containers, shared network/storage)
    • Deployment: Declarative pod management (replicas, rolling updates)
    • Service: Stable endpoint for pods (ClusterIP, LoadBalancer, NodePort)
    • Ingress: HTTP routing rules (maps URLs to services)
    • Namespace: Logical cluster subdivision
    • ConfigMap/Secret: Configuration and sensitive data injection
  5. Fargate vs EC2 Launch Types:
    • Fargate: Specify vCPU/memory, AWS provisions infrastructure, pay per task resource/time
    • EC2: You provision EC2 instances, ECS schedules tasks on them, pay for instances (can be cheaper at scale)
    • Trade-off: Fargate = simplicity, EC2 = control + potential cost savings with Reserved Instances

Questions to Guide Your Design

Architecture Decisions:

  • When would you choose Fargate over EC2 launch type? (Hint: variable workload vs steady baseline, ops overhead tolerance)
  • Should you use ECS or EKS? (Hint: team Kubernetes experience, multi-cloud needs, ecosystem requirements)
  • How many containers should run in a single task/pod? (Hint: tightly coupled = same task, independent = separate tasks)

Networking:

  • How do containers in the same task communicate? (Hint: localhost, shared network namespace)
  • How do containers in different tasks communicate? (Hint: service discovery via DNS)
  • What’s the difference between ECS service discovery (Cloud Map) and an ALB? (Hint: internal service-to-service vs external clients)
  • Which networking mode should you use (awsvpc, bridge, host)? (Hint: Fargate requires awsvpc)

Security:

  • How do you handle secrets in containerized applications? (Hint: Secrets Manager/Parameter Store + task definition secrets, NOT environment variables in plaintext)
  • What’s the difference between task role and execution role? (Hint: execution = ECS needs it, task = your app needs it)
  • How do you restrict which services can talk to each other? (Hint: security groups with awsvpc mode)

Scaling & Performance:

  • Should you scale based on CPU, memory, or request count? (Hint: depends on bottleneck—CPU-bound vs I/O-bound workloads)
  • How do you handle database connection pooling with auto-scaling containers? (Hint: RDS Proxy or application-level pooling)
  • What happens during a deployment? (Hint: rolling update drains old tasks while starting new ones)

Operational:

  • How do you get logs from containers? (Hint: awslogs driver → CloudWatch)
  • How do you debug a failing container startup? (Hint: CloudWatch logs, check execution role for ECR pull permissions)
  • How do you do zero-downtime deployments? (Hint: ALB health checks + rolling update strategy)

Thinking Exercise

Map Kubernetes concepts to ECS equivalents:

Kubernetes ECS Equivalent Notes
Pod Task Both are one or more containers with shared resources
Deployment Service Both maintain desired count and handle updates
Service (ClusterIP) Service Discovery (Cloud Map) Internal DNS-based discovery
Service (LoadBalancer) ALB Target Group External load balancing
Container in Pod spec Container Definition in Task Definition Both define image, ports, env vars
ConfigMap SSM Parameter Store / Secrets Manager Both inject configuration
Secret Secrets Manager Both handle sensitive data
Namespace Cluster (loosely) Logical separation (but ECS clusters are less strict)
Ingress ALB Listener Rules HTTP routing rules
HorizontalPodAutoscaler Service Auto Scaling Both scale based on metrics

Key differences:

  • Kubernetes is more declarative (desired state via YAML)
  • ECS is more imperative (API calls to create/update services)
  • Kubernetes has richer networking (network policies, service mesh)
  • ECS is simpler for AWS-only deployments

Design exercise: If you have a microservices app with 5 services, should they all go in one task definition or separate services?

  • Answer: Separate ECS services (or Kubernetes deployments). Each microservice should scale independently.
  • Exception: If two containers are tightly coupled (app + sidecar proxy), same task/pod makes sense.

The Interview Questions They’ll Ask

Basic Level:

  1. Q: What’s the difference between a Docker image and a container?
    • A: An image is a read-only template (layers of filesystem changes). A container is a running instance of an image with a writable layer on top. Analogy: image = class, container = object instance.
  2. Q: Explain ECS task vs service.
    • A: A task is a running instantiation of a task definition (one or more containers). A service maintains a desired count of tasks, integrates with load balancers, and handles deployments. Tasks are ephemeral; services ensure they keep running.
  3. Q: What is Fargate?
    • A: Serverless compute for containers. You specify CPU/memory in your task definition, AWS provisions and manages the underlying infrastructure. No EC2 instances to manage.
  4. Q: How do containers in the same ECS task communicate?
    • A: Via localhost. Containers in the same task share a network namespace, so they can reach each other on 127.0.0.1 using different ports.

Intermediate Level:

  1. Q: When would you use ECS over Lambda?
    • A: When you need: (1) longer than 15-minute execution, (2) WebSocket/long-lived connections, (3) stateful processing, (4) specific runtime dependencies not available in Lambda layers, (5) existing Dockerized applications.
  2. Q: Explain the difference between task role and execution role in ECS.
    • A: Execution role: permissions ECS needs to set up your task (pull ECR images, write CloudWatch logs). Task role: permissions your application code needs (read S3, query DynamoDB). Never confuse these—execution role is infrastructure, task role is application.
  3. Q: How does ECS service discovery work?
    • A: Uses AWS Cloud Map to create DNS records for tasks. When you enable service discovery, each task gets a DNS entry (e.g., api-service.local). Other services query this DNS name and get IPs of healthy tasks. Updates automatically as tasks start/stop.
  4. Q: How do you implement zero-downtime deployments in ECS?
    • A: Use rolling update deployment type with ALB. Configure minimum healthy percent (e.g., 100%) and maximum percent (e.g., 200%). ECS starts new tasks, waits for ALB health checks to pass, then drains and stops old tasks. If health checks fail, deployment rolls back.

Advanced Level:

  1. Q: You have an ECS service that keeps failing health checks and restarting. How do you debug?
    • A: (1) Check CloudWatch logs for application errors. (2) Verify health check endpoint in ALB target group matches application. (3) Check security groups allow ALB → tasks traffic. (4) Verify task role has permissions app needs. (5) Check container startup time vs health check interval (may need longer initial delay). (6) Exec into a running task to test manually: aws ecs execute-command --cluster X --task Y --container Z --command /bin/sh.
  2. Q: How would you handle database connection pooling with auto-scaling ECS services?
    • A: Options: (1) Use RDS Proxy (connection pooling/multiplexing at AWS layer). (2) Application-level pooling with conservative pool size per container (max_connections / expected_task_count). (3) Use connection pool libraries that handle connection reuse. Problem: Each task creates its own pool, so 10 tasks with 10 connections each = 100 DB connections. RDS Proxy solves this.
  3. Q: Compare ECS awsvpc networking mode with bridge mode.
    • A: awsvpc: Each task gets its own ENI with private IP from VPC subnet. Security groups apply at task level. Required for Fargate. Better isolation. Bridge: Containers share host’s network via Docker bridge. Port mapping required (host port != container port). Multiple tasks on same host need different host ports. awsvpc is recommended for new deployments.
  4. Q: When would you choose EKS over ECS?
    • A: When you need: (1) Kubernetes-specific features (CRDs, operators, Helm charts), (2) multi-cloud portability (same K8s manifests work on GKE/AKS), (3) existing Kubernetes expertise on team, (4) vendor-neutral orchestration. ECS is simpler and AWS-native; EKS is more complex but standard.
  5. Q: How do you handle secrets in containerized apps on AWS?
    • A: Store secrets in AWS Secrets Manager or SSM Parameter Store (SecureString). Reference them in task definition secrets field (NOT environment variables). ECS retrieves secrets at task startup using execution role permissions, injects them as environment variables into container. Secrets are encrypted at rest and in transit. Never hardcode secrets in Dockerfile or pass as plaintext env vars.
  6. Q: Explain a scenario where you’d use Fargate EC2 launch type instead of Fargate.
    • A: When you have: (1) steady baseline load (Reserved Instances cheaper than Fargate per-task pricing), (2) need for specific EC2 instance types (GPU, high memory), (3) tasks require privileged mode or host networking, (4) need to run your own AMI with custom configs. Trade-off: lower cost and more control, but you manage cluster capacity and OS patching.
  7. Q: Your containerized application experiences cold start delays. How do you optimize?
    • A: (1) Reduce image size (multi-stage builds, minimal base images like Alpine). (2) Optimize Dockerfile layer caching (COPY dependencies before code). (3) Pre-warm containers if predictable traffic spikes. (4) Use smaller task CPU/memory if over-provisioned (faster scheduling). (5) For EKS: Use cluster autoscaler with appropriate scaling configs. (6) Consider keeping minimum task count > 0 to avoid cold starts entirely.

Hints in Layers

Level 1: Getting Started

  • Start with a single-container task definition running nginx or a simple Python/Node.js app
  • Use Fargate to avoid EC2 cluster management
  • Deploy to public subnet first (simpler than private with NAT)
  • Use AWS Console to create your first task definition—you’ll see all the options
  • Put “latest” tag on your first image (iterate fast)

Level 2: Adding Realism

  • Move to private subnets with NAT gateway (production pattern)
  • Add ALB in front of your service for stable endpoint
  • Create second container that calls the first (understand inter-service communication)
  • Use ECR instead of Docker Hub (AWS-native, faster pulls)
  • Start using specific version tags (v1.0.0, not “latest”)

Level 3: Production Patterns

  • Enable service discovery (Cloud Map) for service-to-service DNS
  • Configure auto-scaling based on ALB request count or custom CloudWatch metrics
  • Set up proper health checks (readiness vs liveness)
  • Add Container Insights for metrics
  • Implement rolling deployments with deployment circuit breaker (auto-rollback on failure)

Level 4: Advanced Scenarios

  • Multi-container task with sidecar pattern (app + logging agent)
  • Task role granting only necessary permissions (least privilege)
  • Secrets injection from Secrets Manager (no plaintext env vars)
  • Blue/green deployments using CodeDeploy
  • For EKS: Implement Horizontal Pod Autoscaler and Cluster Autoscaler together

Level 5: Mastery

  • CI/CD pipeline: GitHub Actions → build image → push to ECR → update ECS service
  • Canary deployments (send 10% traffic to new version, monitor, then 100%)
  • Service mesh (App Mesh for ECS or Istio for EKS) for advanced routing/observability
  • Cross-region replication for DR (replicate ECR images, deploy to multiple regions)
  • Custom metrics from containers to CloudWatch for scaling (e.g., queue depth)

Books That Will Help

Book Title Author(s) Relevant Chapters Why It Helps
AWS for Solutions Architects Saurabh Shrivastava Ch. 9: Containers on AWS Best coverage of ECS architecture patterns, Fargate vs EC2 decisions
Docker Deep Dive Nigel Poulton Ch. 3-5, 8-9 Container fundamentals, image layers, networking modes
Kubernetes in Action Marko Luksa Ch. 3-7 Core K8s concepts (pods, services, deployments) for EKS path
Kubernetes Up & Running Kelsey Hightower et al. Ch. 5-6, 9-10 Practical K8s patterns, service discovery, load balancing
Amazon Web Services in Action Andreas Wittig & Michael Wittig Ch. 14: Containers Step-by-step ECS tutorial with CloudFormation examples
The DevOps Handbook Gene Kim et al. Part IV: Technical Practices CI/CD for containers, deployment strategies (blue/green, canary)
Site Reliability Engineering Google Ch. 7, 21 Load balancing, monitoring containerized systems at scale
Production Kubernetes Josh Rosso et al. Ch. 4-6, 11 Production-grade EKS: networking, security, observability
Container Security Liz Rice Ch. 2-4, 7 Securing container images, runtime, orchestrator (critical for prod)
Designing Data-Intensive Applications Martin Kleppmann Ch. 11: Stream Processing Understanding when to use containers for stateful vs stateless workloads

Quick Reference:

  • New to containers? Start with “Docker Deep Dive” Ch. 3-5
  • Choosing ECS? Focus on “AWS for Solutions Architects” Ch. 9
  • Choosing EKS? Read “Kubernetes Up & Running” Ch. 5-6 first
  • Ready for production? “Container Security” is mandatory
  • Need CI/CD? “The DevOps Handbook” Part IV

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
VPC from Scratch Intermediate 1-2 weeks ⭐⭐⭐⭐⭐ (Foundational) ⭐⭐⭐
Serverless Pipeline Intermediate 1-2 weeks ⭐⭐⭐⭐ (Event-driven) ⭐⭐⭐⭐
Auto-Scaling Web App Intermediate 2-3 weeks ⭐⭐⭐⭐⭐ (Classic AWS) ⭐⭐⭐
Container Platform Advanced 2-4 weeks ⭐⭐⭐⭐⭐ (Modern AWS) ⭐⭐⭐⭐⭐

Recommendation

Start with Project 1 (VPC from Scratch). Everything else depends on understanding networking. A misconfigured VPC will break Lambda VPC access, ECS service discovery, RDS connectivity—everything. Once you can confidently explain why your private subnet routes to a NAT gateway and your public subnet routes to an IGW, move on.

Then do Project 2 (Serverless Pipeline) to understand event-driven architecture—this is where modern AWS shines. Step Functions + Lambda is incredibly powerful once you “get it.”

Then Project 3 (Auto-Scaling Web App) for the “traditional” AWS architecture that many existing systems use. This teaches you fundamentals that apply everywhere.

Finally Project 4 (Containers) brings it all together—you need VPC knowledge, you can integrate with Lambda/Step Functions, and you’ll build on auto-scaling concepts.


Final Overall Project: Full-Stack SaaS Platform

What you’ll build: A complete SaaS application with:

  • Multi-tenant architecture with isolated VPCs per environment (dev/staging/prod)
  • API Gateway + Lambda for REST/GraphQL endpoints
  • ECS Fargate running background workers
  • Step Functions orchestrating complex business workflows
  • RDS Aurora for relational data
  • DynamoDB for high-speed session/cache data
  • S3 + CloudFront for static assets and file uploads
  • Cognito for authentication
  • CI/CD with CodePipeline or GitHub Actions
  • Infrastructure defined entirely in Terraform
  • CloudWatch dashboards, X-Ray tracing, and alarms

Why this is the ultimate AWS learning project: This forces you to make real architectural decisions—when to use Lambda vs ECS, how to structure your VPCs for multi-environment deployments, how services talk to each other across boundaries. You’ll hit every AWS gotcha: Lambda cold starts affecting user experience, ECS task role permissions, RDS connection pooling, S3 CORS issues, CloudFront cache invalidation timing.

Core challenges you’ll face:

  • Multi-Environment Architecture: Terraform workspaces or separate state files, environment-specific configurations
  • API Design: REST vs GraphQL, API Gateway vs ALB, authentication flows
  • Data Architecture: When to use RDS vs DynamoDB, cross-service data access patterns
  • Async Processing: SQS queues, dead-letter queues, exactly-once processing
  • Security: IAM policies, secrets management (Secrets Manager/Parameter Store), encryption at rest and in transit
  • Observability: Distributed tracing, centralized logging, meaningful metrics
  • Cost Optimization: Right-sizing, Reserved Instances vs Savings Plans, S3 lifecycle policies

Key Concepts:

  • Well-Architected Framework: AWS Well-Architected - AWS Official
  • SaaS Architecture: “AWS for Solutions Architects” Ch. 10-12 - Shrivastava et al.
  • Infrastructure as Code at Scale: Terraform Best Practices - HashiCorp
  • Serverless Patterns: “Designing Data-Intensive Applications” Ch. 11-12 - Martin Kleppmann (for async patterns)

Difficulty: Advanced Time estimate: 1-2 months Prerequisites: Projects 1-4 completed

Real world outcome:

  • A working SaaS application you can demo to employers
  • User registration, login, and authenticated API calls
  • Background job processing visible in Step Functions console
  • Multi-environment deployment from a single codebase
  • Cost monitoring dashboard showing your AWS spend
  • A portfolio piece that demonstrates comprehensive AWS knowledge

Learning milestones:

  1. First milestone: Auth working (Cognito + API Gateway) → you understand identity on AWS
  2. Second milestone: Core API + database working → you understand data tier patterns
  3. Third milestone: Background processing working → you understand async architecture
  4. Fourth milestone: Multi-environment deployment → you understand infrastructure management
  5. Final milestone: Observability + alerting working → you understand production operations

Key Learning Resources Summary

Resource Best For
“AWS for Solutions Architects” - Shrivastava et al. Comprehensive coverage of all services
AWS DevOps Zero to Hero Free hands-on projects
HashiCorp Terraform Tutorials Infrastructure as Code
AWS Official Tutorials Service-specific guides
“Designing Data-Intensive Applications” - Kleppmann Distributed systems concepts

Sources