Deep Understanding of AWS Through Building
Goal: Master AWS cloud infrastructure by building real systems that break, fail, and teach you why networking, security, and scalability patterns exist. You’ll go from clicking through the console to understanding why a private subnet needs a NAT Gateway, how IAM policy evaluation actually works, and when to choose Lambda over ECS over EC2.
Why AWS Mastery Requires Building
You’re tackling a massive ecosystem with over 200 services across compute, storage, networking, databases, machine learning, and more. As of 2025, AWS dominates the cloud infrastructure market with a 30% market share (compared to Azure’s 20% and Google Cloud’s 13%), serving 4.19 million customers worldwide—a 357% increase since 2020.
The market reality: Cloud infrastructure spending hit $102.6 billion in Q3 2025 alone, with AWS earning $29 billion from cloud services in Q1 2025. The cloud market is projected to exceed $400 billion annually for the first time in 2025. This means AWS expertise is not just valuable—it’s increasingly critical as organizations migrate workloads to the cloud.
But here’s the challenge: AWS is not something you can understand by reading documentation—you need to build systems where misconfigured security groups break things, where wrong subnet routing causes timeouts, and where Lambda cold starts ruin your latency (typically 200-400ms for Python/Node.js, but can reach 1-3 seconds for Java/.NET). That’s when the concepts become real.
The AWS Learning Problem
Most developers learn AWS backwards:
- Click through console tutorials
- Copy-paste CloudFormation templates
- Wonder why things break in production
- Panic when asked “why did you choose this architecture?”
The right approach:
- Understand the problem each service solves
- Build it wrong first (feel the pain)
- Fix it using IaC (understand every line)
- Break it intentionally (learn failure modes)
- Explain your architecture to others
The AWS Shared Responsibility Model
Before building anything, understand who is responsible for what:
┌─────────────────────────────────────────────────────────────────────────────┐
│ YOUR RESPONSIBILITY │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Customer Data │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Platform, Applications, Identity & Access Management │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Operating System, Network & Firewall Configuration │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Client-Side Data Encryption │ Server-Side Encryption │ Network Traffic│ │
│ │ & Data Integrity Auth │ (File System/Data) │ Protection │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────┤
│ AWS RESPONSIBILITY │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Compute │ Storage │ Database │ Networking │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Hardware/AWS Global Infrastructure │ │
│ │ (Regions, Availability Zones, Edge Locations) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Key insight: AWS manages the physical infrastructure. YOU manage everything from the OS up. If your EC2 instance gets hacked because of a weak password, that’s on you. If an AWS data center floods, that’s on them.
Understanding AWS Architecture: The Mental Model
The Three Pillars of AWS
Every AWS architecture decision comes down to balancing three concerns:
┌─────────────────┐
│ │
│ RELIABILITY │
│ (Multi-AZ, │
│ Redundancy) │
│ │
└────────┬────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ │ │ │
│ COST │◄─────────►│ PERFORMANCE │
│ (Right-sizing, │ │ (Low latency, │
│ Reserved, │ │ High throughput│
│ Spot) │ │ Scaling) │
│ │ │ │
└─────────────────┘ └─────────────────┘
The AWS Well-Architected Trade-off Triangle
Every architecture decision involves trade-offs:
- Multi-AZ RDS is more reliable but costs 2x
- Spot instances are 70% cheaper but can be terminated anytime
- Lambda has zero idle cost but cold starts add latency
AWS Global Infrastructure
Understanding the hierarchy is critical:
┌─────────────────────────────────────────────────────────────────────────────┐
│ AWS GLOBAL │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Region: us-east-1 (N. Virginia) ││
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ ││
│ │ │ AZ 1 │ │ AZ 2 │ │ AZ 3 │ ... ││
│ │ │ us-east-1a │ │ us-east-1b │ │ us-east-1c │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ ││
│ │ │ │ Data Center│ │ │ │ Data Center│ │ │ │ Data Center│ │ ││
│ │ │ │ │ │ │ │ │ │ │ │ │ │ ││
│ │ │ │ • EC2 │ │ │ │ • EC2 │ │ │ │ • EC2 │ │ ││
│ │ │ │ • RDS │ │ │ │ • RDS │ │ │ │ • RDS │ │ ││
│ │ │ │ • EBS │ │ │ │ • EBS │ │ │ │ • EBS │ │ ││
│ │ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ ││
│ │ └────────────────┘ └────────────────┘ └────────────────┘ ││
│ │ ││
│ │ ← Low latency connections between AZs (< 2ms) → ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Region: eu-west-1 (Ireland) ││
│ │ ...similar AZ structure... ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
│ Edge Locations (CloudFront): 400+ worldwide for content delivery │
└─────────────────────────────────────────────────────────────────────────────┘
Key insight for projects:
- Region: Contains all your resources, data residency compliance
- AZ (Availability Zone): Isolated data centers, failure boundary
- Multi-AZ: Deploy across 2+ AZs for high availability (if one AZ fails, others continue)
VPC Networking: The Foundation of Everything
Every AWS resource you deploy lives in a network. Understanding VPC is non-negotiable.
What is a VPC?
A Virtual Private Cloud (VPC) is your isolated network within AWS. Think of it as your own private data center in the cloud, with complete control over:
- IP address ranges (CIDR blocks)
- Subnets (network segments)
- Route tables (traffic rules)
- Gateways (internet access)
- Security (firewalls)
The Anatomy of a Production VPC
┌─────────────────────────────────────────────────────────────────────────────────┐
│ VPC: 10.0.0.0/16 (65,536 IP addresses) │
│ │
│ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐│
│ │ Availability Zone A (us-east-1a) │ │ Availability Zone B (us-east-1b) ││
│ │ │ │ ││
│ │ ┌─────────────────────────────┐ │ │ ┌─────────────────────────────┐ ││
│ │ │ PUBLIC SUBNET: 10.0.1.0/24 │ │ │ │ PUBLIC SUBNET: 10.0.2.0/24 │ ││
│ │ │ (256 IPs) │ │ │ │ (256 IPs) │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌───────┐ ┌───────┐ │ │ │ │ ┌───────┐ ┌───────┐ │ ││
│ │ │ │Bastion│ │ NAT │ │ │ │ │ │Bastion│ │ NAT │ │ ││
│ │ │ │ Host │ │Gateway│ │ │ │ │ │ (HA) │ │Gateway│ │ ││
│ │ │ └───────┘ └───┬───┘ │ │ │ │ └───────┘ └───┬───┘ │ ││
│ │ └───────────────────┼────────┘ │ │ └───────────────────┼────────┘ ││
│ │ │ │ │ │ ││
│ │ ┌───────────────────┼────────┐ │ │ ┌───────────────────┼────────┐ ││
│ │ │ PRIVATE SUBNET: │ │ │ │ │ PRIVATE SUBNET: │ │ ││
│ │ │ 10.0.10.0/24 │ │ │ │ │ 10.0.20.0/24 │ │ ││
│ │ │ (App Tier) ▼ │ │ │ │ (App Tier) ▼ │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌───────┐ ┌───────┐ │ │ │ │ ┌───────┐ ┌───────┐ │ ││
│ │ │ │ EC2 │ │ EC2 │ │ │ │ │ │ EC2 │ │ EC2 │ │ ││
│ │ │ │ (App) │ │ (App) │ │ │ │ │ │ (App) │ │ (App) │ │ ││
│ │ │ └───────┘ └───────┘ │ │ │ │ └───────┘ └───────┘ │ ││
│ │ └────────────────────────────┘ │ │ └────────────────────────────┘ ││
│ │ │ │ ││
│ │ ┌────────────────────────────┐ │ │ ┌────────────────────────────┐ ││
│ │ │ DATA SUBNET: 10.0.100.0/24 │ │ │ │ DATA SUBNET: 10.0.200.0/24 │ ││
│ │ │ (Database Tier) │ │ │ │ (Database Tier) │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌────────────────────┐ │ │ │ │ ┌────────────────────┐ │ ││
│ │ │ │ RDS Primary │───┼───┼─┼──│ │ RDS Standby │ │ ││
│ │ │ │ (PostgreSQL) │ │ │ │ │ │ (Sync Replica) │ │ ││
│ │ │ └────────────────────┘ │ │ │ │ └────────────────────┘ │ ││
│ │ └────────────────────────────┘ │ │ └────────────────────────────┘ ││
│ └─────────────────────────────────────┘ └─────────────────────────────────────┘│
│ │
│ ┌─────────────────┐ │
│ │ Internet Gateway│ ◄──── Allows public subnets to reach internet │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Internet│ │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
Understanding CIDR Notation
CIDR (Classless Inter-Domain Routing) defines IP address ranges:
IP Address: 10.0.0.0
CIDR: /16
Binary breakdown:
10.0.0.0/16 means:
├── First 16 bits are FIXED (network portion)
│ 10 . 0 (00001010.00000000)
│
└── Last 16 bits are FREE (host portion)
x.x (00000000.00000000 to 11111111.11111111)
Result: 10.0.0.0 to 10.0.255.255 = 65,536 addresses
Common CIDR blocks for VPC design:
┌─────────┬─────────────────┬──────────────────────────────┐
│ CIDR │ # of IPs │ Use Case │
├─────────┼─────────────────┼──────────────────────────────┤
│ /16 │ 65,536 │ Entire VPC │
│ /20 │ 4,096 │ Large subnet (production) │
│ /24 │ 256 │ Standard subnet │
│ /28 │ 16 │ Small subnet (bastion hosts) │
└─────────┴─────────────────┴──────────────────────────────┘
⚠️ AWS reserves 5 IPs per subnet:
.0 Network address
.1 VPC router
.2 DNS server
.3 Reserved for future use
.255 Broadcast (not supported, but reserved)
So a /24 subnet (256 IPs) actually has 251 usable IPs.
Public vs Private Subnets: The Critical Difference
PUBLIC SUBNET PRIVATE SUBNET
───────────── ──────────────
Route Table: Route Table:
┌────────────────────────────┐ ┌────────────────────────────┐
│ Destination │ Target │ │ Destination │ Target │
├───────────────┼────────────┤ ├───────────────┼────────────┤
│ 10.0.0.0/16 │ local │ │ 10.0.0.0/16 │ local │
│ 0.0.0.0/0 │ igw-xxxxx │◄──IGW │ 0.0.0.0/0 │ nat-xxxxx │◄──NAT
└───────────────┴────────────┘ └───────────────┴────────────┘
Key differences:
┌─────────────────┬──────────────────────┬──────────────────────┐
│ │ Public Subnet │ Private Subnet │
├─────────────────┼──────────────────────┼──────────────────────┤
│ Internet access │ Via Internet Gateway │ Via NAT Gateway │
│ Inbound traffic │ Can receive from web │ Cannot receive │
│ Public IP │ Can have Elastic IP │ No public IP │
│ Use case │ Load balancers, │ App servers, │
│ │ bastion hosts │ databases │
└─────────────────┴──────────────────────┴──────────────────────┘
Security Groups vs NACLs: Two Layers of Defense
┌─────────────────────────────────────────────────────────────────────────────┐
│ NACL (Subnet Level) │
│ - Stateless │
│ - Explicit allow/deny │
│ - Rule numbers (processed in order) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Security Group (Instance Level) │ │
│ │ - Stateful (return traffic auto-allowed) │ │
│ │ - Allow rules only (implicit deny) │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ EC2 Instance │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Traffic flow example (HTTP request to web server):
1. Request arrives at VPC
2. NACL checks inbound rules (allow port 80?) → Yes? Continue
3. Security Group checks inbound rules → Yes? Continue
4. Request reaches EC2 instance
5. Response leaves EC2 instance
6. Security Group: AUTOMATICALLY allows response → Stateful!
7. NACL checks outbound rules (allow response?) → Must be explicit
8. NACL checks ephemeral port range → Rule needed!
Security Group (stateful): NACL (stateless):
Inbound: Allow TCP 80 Inbound: Allow TCP 80
Outbound: (not needed for response) Outbound: Allow TCP 1024-65535 (ephemeral)
IAM: The Security Model You Must Master
IAM (Identity and Access Management) controls WHO can do WHAT to WHICH resources.
The IAM Policy Language
Every IAM policy answers these questions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow", ◄─── Allow or Deny?
"Action": [ ◄─── What actions?
"s3:GetObject",
"s3:PutObject"
],
"Resource": [ ◄─── Which resources?
"arn:aws:s3:::my-bucket/*"
],
"Condition": { ◄─── Under what conditions?
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}
]
}
IAM Policy Evaluation Logic
When AWS evaluates permissions, it follows this order:
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM Policy Evaluation │
│ │
│ 1. By default, all requests are DENIED (implicit deny) │
│ │ │
│ ▼ │
│ 2. Check all applicable policies │
│ ┌──────────────┬──────────────┬──────────────┬──────────────┐ │
│ │ Identity │ Resource │ Permission │ Service │ │
│ │ Policies │ Policies │ Boundaries │ Control │ │
│ │ (IAM user/ │ (S3 bucket, │ (IAM) │ Policies │ │
│ │ role) │ SQS queue) │ │ (Org level) │ │
│ └──────────────┴──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ▼ │
│ 3. If ANY policy has explicit DENY ──────────────► DENIED (final) │
│ │ │
│ ▼ │
│ 4. If ANY policy has explicit ALLOW ─────────────► ALLOWED │
│ │ │
│ ▼ │
│ 5. If no explicit allow found ───────────────────► DENIED (implicit) │
│ │
│ Remember: Explicit DENY always wins! │
└─────────────────────────────────────────────────────────────────────────────┘
IAM Roles vs Users vs Groups
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM USERS │
│ - Permanent credentials (access key + secret key) │
│ - Used for: Human users, CLI access │
│ - Best practice: Use MFA, rotate keys regularly │
│ │
│ ┌─────────┐ │
│ │ User │───► Has access keys, belongs to groups │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM GROUPS │
│ - Collection of users │
│ - Policies attached to group apply to all members │
│ - Used for: Organizing users by job function (Developers, Admins) │
│ │
│ ┌─────────┐ │
│ │ Group │───► Contains users, has policies │
│ │(Devs) │ ┌──────┐ ┌──────┐ ┌──────┐ │
│ └─────────┘ │User A│ │User B│ │User C│ │
│ └──────┘ └──────┘ └──────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM ROLES │
│ - Temporary credentials (auto-rotated by AWS) │
│ - Can be ASSUMED by: EC2, Lambda, ECS, other AWS accounts, SAML users │
│ - Used for: Service-to-service authentication, cross-account access │
│ │
│ ┌─────────┐ ┌─────────────────────────────────────────┐ │
│ │ Role │ │ Trust Policy: WHO can assume this role │ │
│ │(Lambda) │◄────────│ Permissions: WHAT they can do │ │
│ └─────────┘ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Lambda function assumes role → Gets temp credentials → Accesses S3 │
└─────────────────────────────────────────────────────────────────────────────┘
Best Practice Hierarchy:
┌──────────────────────────────────────────────────────────────┐
│ PREFER ROLES OVER USERS │
│ │
│ ✓ Roles: Temporary creds, auto-rotated, no keys to leak │
│ ✗ Users: Permanent creds, must rotate manually, can leak │
│ │
│ EC2 instances? Use Instance Profile (role wrapper) │
│ Lambda? Use Execution Role │
│ ECS tasks? Use Task Role │
│ Cross-account? Use AssumeRole │
└──────────────────────────────────────────────────────────────┘
Serverless Architecture: Lambda, Step Functions & Event-Driven Design
The Lambda Execution Model
Understanding Lambda’s lifecycle is critical for performance:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAMBDA EXECUTION LIFECYCLE │
│ │
│ COLD START (first invocation or after idle) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Download code 2. Start runtime 3. Initialize handler │ │
│ │ from S3 (Python, Node) (your imports) │ │
│ │ ~100ms ~200ms ~500-3000ms │ │
│ │ │ │
│ │ TOTAL COLD START: 800ms - 5s depending on package size │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ WARM START (execution environment reused) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Environment already running, just execute handler │ │
│ │ TOTAL WARM START: <100ms │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ EXECUTION CONTEXT REUSE │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ # This code runs ONCE (cold start only) │ │
│ │ import boto3 │ │
│ │ s3_client = boto3.client('s3') # Reused across invocations! │ │
│ │ │ │
│ │ # This code runs EVERY invocation │ │
│ │ def handler(event, context): │ │
│ │ s3_client.get_object(...) # Uses pre-initialized client │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ LAMBDA LIMITS: │
│ ┌──────────────────────────┬───────────────────────────────────────────┐ │
│ │ Max execution time │ 15 minutes │ │
│ │ Max memory │ 10 GB │ │
│ │ Max /tmp storage │ 10 GB (ephemeral, NOT persistent) │ │
│ │ Max deployment package │ 50 MB (zipped), 250 MB (unzipped) │ │
│ │ Max concurrent executions│ 1000 (soft limit, can increase) │ │
│ └──────────────────────────┴───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Step Functions: Orchestrating Serverless Workflows
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP FUNCTIONS STATE MACHINE │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ StartState │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ ValidateInput │ (Task: Lambda) │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ Choice │ (Decision point) │ │
│ │ └───────┬───────┘ │ │
│ │ ┌────────┴────────┐ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Valid: Yes │ │ Valid: No │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ ProcessData │ │ SendError │ │ │
│ │ │ (Task) │ │ Notification │ │ │
│ │ └───────┬───────┘ └───────┬───────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌───────────────┐ End │ │
│ │ │ Parallel │ (Run multiple branches) │ │
│ │ │ ┌───┬───┐ │ │ │
│ │ │ │ A │ B │ │ │ │
│ │ │ └───┴───┘ │ │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ SaveResults │ │ │
│ │ │ (Task) │ │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ End │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ STATE TYPES: │
│ ┌──────────┬────────────────────────────────────────────────────────────┐ │
│ │ Task │ Execute Lambda, ECS task, API call, etc. │ │
│ │ Choice │ If/else branching based on input │ │
│ │ Parallel │ Execute multiple branches simultaneously │ │
│ │ Map │ Iterate over array, execute state for each item │ │
│ │ Wait │ Delay for specified time │ │
│ │ Pass │ Pass input to output (useful for transformations) │ │
│ │ Succeed │ Terminal state indicating success │ │
│ │ Fail │ Terminal state indicating failure │ │
│ └──────────┴────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Compute Options: When to Use What
┌─────────────────────────────────────────────────────────────────────────────┐
│ AWS COMPUTE DECISION TREE │
│ │
│ What are you running? │
│ │ │
│ ├──► Event-driven, short tasks (<15 min)? │
│ │ │ │
│ │ └──► Lambda (serverless, pay per invocation) │
│ │ │
│ ├──► Containerized application? │
│ │ │ │
│ │ ├──► Need Kubernetes? ─────► EKS │
│ │ │ │
│ │ └──► AWS-native? ──────────► ECS │
│ │ │ │
│ │ ├──► Don't want to manage servers? ──► Fargate │
│ │ └──► Need GPU/custom instances? ─────► EC2 │
│ │ │
│ └──► Traditional application, full OS control? │
│ │ │
│ └──► EC2 │
│ │ │
│ ├──► Predictable workload? ──► Reserved Instances │
│ ├──► Flexible timing? ───────► Spot Instances │
│ └──► Unknown/variable? ──────► On-Demand │
│ │
│ COST COMPARISON (approximate, varies by region): │
│ ┌──────────────────┬─────────────────────┬────────────────────────────┐ │
│ │ Service │ Pricing Model │ Best For │ │
│ ├──────────────────┼─────────────────────┼────────────────────────────┤ │
│ │ Lambda │ $0.20/1M requests │ Infrequent, event-driven │ │
│ │ │ + compute time │ tasks │ │
│ │ Fargate │ ~$0.04/vCPU-hour │ Containers without EC2 │ │
│ │ EC2 On-Demand │ ~$0.10/hour (t3.md) │ Unknown workloads │ │
│ │ EC2 Reserved │ ~40% cheaper │ Predictable, 1-3 year │ │
│ │ EC2 Spot │ ~70% cheaper │ Flexible, interruptible │ │
│ └──────────────────┴─────────────────────┴────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Core Concept Analysis
AWS services break down into these fundamental building blocks:
| Domain | Core Concepts to Internalize |
|---|---|
| Networking (VPC) | CIDR blocks, public/private subnets, route tables, NAT gateways, Internet gateways, security groups vs NACLs, VPC peering, Transit Gateway |
| Compute (EC2) | Instance types, AMIs, user data, auto-scaling groups, launch templates, Elastic IPs, placement groups |
| Containers (ECS/EKS) | Task definitions, services, clusters, Fargate vs EC2 launch types, service discovery, Kubernetes control plane |
| Serverless (Lambda) | Event sources, execution context, cold starts, layers, concurrency, IAM execution roles |
| Orchestration (Step Functions) | State machines, ASL (Amazon States Language), error handling, retries, parallel/map states |
| Storage (S3) | Buckets, objects, policies, versioning, lifecycle rules, storage classes, presigned URLs |
| Infrastructure as Code | CloudFormation, Terraform, resource dependencies, state management, drift detection |
Concept Summary Table
Before diving into projects, internalize these AWS concept clusters. Each project will force you to understand multiple concepts simultaneously:
| Concept Cluster | What You Need to Internalize |
|---|---|
| VPC & Networking | CIDR blocks and subnet math (how /16, /24 work), public vs private subnets (routing differences), route tables (0.0.0.0/0 means “default route”), Internet Gateway (public subnet requirement), NAT Gateway (private subnet internet access), security groups (stateful, instance-level), NACLs (stateless, subnet-level), VPC peering (connecting VPCs), Transit Gateway (hub-and-spoke networking) |
| Compute (EC2) | Instance types (compute vs memory vs storage optimized), AMIs (golden images), user data (bootstrap scripts on launch), instance profiles (IAM roles for EC2), auto-scaling groups (elasticity), launch templates (instance configuration), Elastic IPs (static public IPs), placement groups (low latency) |
| Serverless (Lambda) | Execution model (stateless, ephemeral), cold starts (initialization penalty), warm starts (reusing execution context), event sources (what triggers Lambda), layers (shared code/dependencies), concurrency (parallel executions), execution roles (IAM permissions), timeout limits (15 min max), memory allocation (128MB-10GB), /tmp storage (512MB ephemeral) |
| Containers (ECS/EKS) | Task definitions (container configuration), services (long-running tasks), clusters (logical grouping), Fargate vs EC2 launch types (serverless vs managed instances), awsvpc networking mode (each task gets ENI), service discovery (DNS-based), ALB target groups (container load balancing), EKS control plane (managed Kubernetes), ECR (container registry) |
| Orchestration (Step Functions) | State machines (workflow as code), ASL (Amazon States Language - JSON), states (Task, Choice, Parallel, Map, Wait, Pass, Fail, Succeed), error handling (Retry, Catch), parallel execution, map state (dynamic parallelism), Express vs Standard workflows |
| Storage (S3) | Buckets (global namespace), objects (key-value), S3 API operations (PUT, GET, DELETE, LIST), versioning (immutable history), lifecycle rules (transition/expiration), storage classes (Standard, IA, Glacier), presigned URLs (temporary access), CORS (cross-origin access), bucket policies vs IAM policies, S3 Select (query in place) |
| Security (IAM) | Policies (JSON documents), principals (who), actions (what), resources (where), conditions (when), identity-based policies (attached to users/roles), resource-based policies (attached to resources like S3 buckets), roles (temporary credentials), instance profiles (EC2 role wrapper), service-linked roles, policy evaluation logic (explicit deny wins) |
| Infrastructure as Code | Terraform state (source of truth), state locking (prevent concurrent modifications), CloudFormation stacks (grouped resources), drift detection (config vs reality), resource dependencies (implicit vs explicit), modules/nested stacks (reusability), workspaces/stack sets (multi-environment), import (existing resources into IaC) |
| Observability | CloudWatch Logs (log aggregation), log groups and streams, CloudWatch Metrics (time-series data), custom metrics, CloudWatch Alarms (threshold-based alerts), CloudWatch Dashboards (visualization), X-Ray tracing (distributed request tracking), segments and subsegments, service maps, VPC Flow Logs (network traffic), CloudTrail (API audit logs) |
Why this matters: AWS services don’t exist in isolation. When you build a VPC, you’re also configuring security groups, IAM roles, and CloudWatch logs. When you deploy Lambda, you need to understand IAM execution roles, VPC networking (if accessing RDS), and CloudWatch for debugging. These concepts interconnect constantly.
Deep Dive Reading by Concept
Map your project work to specific reading. Don’t read these books cover-to-cover upfront—use them as references when you hit specific challenges in your projects.
VPC & Networking Fundamentals
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| CIDR and IP Addressing | “The Linux Programming Interface” - Michael Kerrisk | Ch. 59 (Sockets: Internet Domains) | Understand IP addresses, subnetting, CIDR notation at the networking fundamentals level |
| VPC Architecture Patterns | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 3: Networking on AWS | Best comprehensive coverage of VPC design, multi-AZ patterns, and network segmentation |
| Security Groups vs NACLs | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 4: Security on AWS | Stateful vs stateless firewall rules, when to use each |
| NAT Gateway Design | AWS Architecture Blog: VPC Design Evolution | Full Article | Real-world patterns for scaling NAT, costs, and HA |
| Routing Deep Dive | AWS VPC User Guide | Route Tables Section | Official documentation on route table priority, longest prefix matching |
Compute (EC2) Deep Dive
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| EC2 Instance Types | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 6: Compute Services | When to choose compute-optimized vs memory-optimized vs storage-optimized |
| Auto Scaling Architecture | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 6: Compute Services | Scaling policies, health checks, target tracking vs step scaling |
| AMI Best Practices | AWS EC2 User Guide | AMIs Section | Golden images, versioning, cross-region copying |
| Instance Metadata Service | AWS EC2 User Guide | Instance Metadata Section | How user data works, IMDSv2, security implications |
Serverless (Lambda & Step Functions)
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| Lambda Execution Model | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 8: Serverless Architecture | Cold starts, execution context reuse, concurrency models |
| Event-Driven Architecture | “Designing Data-Intensive Applications” - Martin Kleppmann | Ch. 11: Stream Processing | Foundational understanding of event-driven systems, exactly-once processing |
| Step Functions State Machines | AWS Step Functions Developer Guide | ASL Specification | Learn the state machine language, error handling patterns |
| Lambda Best Practices | AWS Lambda Developer Guide | Best Practices Section | Performance optimization, error handling, testing strategies |
Containers (ECS/EKS)
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| ECS Architecture | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 9: Container Services | Task definitions, Fargate vs EC2 launch type, service discovery |
| Kubernetes Fundamentals | Amazon EKS Best Practices Guide | Full Guide | Networking, security, observability for EKS |
| Container Networking | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 9: Container Services | awsvpc mode, VPC integration, load balancer integration |
| Service Mesh Concepts | AWS App Mesh Documentation | What is App Mesh | Service-to-service communication, retries, circuit breakers |
Storage (S3) Deep Dive
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| S3 Data Model | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 5: Storage Services | Object storage concepts, consistency model, versioning |
| S3 Performance | S3 Performance Guidelines | Full Document | Request rate optimization, multipart upload, transfer acceleration |
| S3 Security | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 5: Storage Services | Bucket policies, ACLs, presigned URLs, encryption options |
| Lifecycle Management | S3 Lifecycle Configuration | Lifecycle Section | Storage class transitions, expiration policies, cost optimization |
Security & IAM
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| IAM Policy Fundamentals | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 4: Security on AWS | Policy structure, evaluation logic, least privilege |
| IAM Roles Deep Dive | AWS IAM User Guide | Roles Section | Trust policies, assume role, instance profiles, cross-account access |
| Security Best Practices | AWS Security Best Practices | Full Document | MFA, key rotation, policy conditions, service control policies |
| Secrets Management | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 4: Security on AWS | Secrets Manager vs Parameter Store, rotation strategies |
Infrastructure as Code
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| Terraform Fundamentals | HashiCorp Terraform Tutorials | Get Started on AWS | State management, resource dependencies, modules |
| CloudFormation Deep Dive | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 13: Infrastructure as Code | Stack operations, drift detection, change sets |
| Terraform State Management | Terraform State Documentation | State Section | Remote state, locking, workspaces, state migration |
| IaC Best Practices | Terraform Best Practices | Full Guide | Module structure, naming conventions, security scanning |
Observability & Monitoring
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| CloudWatch Fundamentals | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 12: Monitoring and Logging | Logs, metrics, alarms, dashboards |
| Distributed Tracing | “Designing Data-Intensive Applications” - Martin Kleppmann | Ch. 1: Reliable, Scalable, Maintainable | Understanding observability in distributed systems |
| X-Ray Deep Dive | AWS X-Ray Developer Guide | Full Guide | Trace segments, service maps, sampling rules |
| Log Analysis Patterns | CloudWatch Logs Insights Tutorial | Insights Query Syntax | Query language, aggregation, performance debugging |
Multi-Service Architecture
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| Well-Architected Framework | AWS Well-Architected Framework | All 6 Pillars | Operational Excellence, Security, Reliability, Performance, Cost, Sustainability |
| Distributed Systems Patterns | “Designing Data-Intensive Applications” - Martin Kleppmann | Ch. 8-12 | Replication, partitioning, consistency, consensus, batch vs stream processing |
| SaaS Architecture | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 10-11: Advanced Architectures | Multi-tenancy, isolation models, data partitioning strategies |
| Event-Driven Patterns | AWS Serverless Patterns Collection | Browse Patterns | Real-world architectures, EventBridge patterns, async workflows |
How to use this table: When you hit a specific challenge in a project (e.g., “My Lambda can’t access RDS in my VPC”), consult the relevant sections rather than reading books sequentially. Learn just-in-time based on what you’re building.
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
Before starting these projects, you need:
- Programming Fundamentals
- Comfortable with at least one language (Python, Node.js, or Go recommended)
- Understanding of HTTP, REST APIs, JSON
- Basic command-line proficiency (bash/zsh)
- Networking Basics
- What IP addresses are (public vs private)
- Basic understanding of DNS
- TCP/IP fundamentals (ports, protocols)
- CIDR notation (e.g., what 10.0.0.0/16 means)
- Linux/Unix Basics
- SSH into servers
- File permissions and ownership
- Environment variables
- Package managers (apt, yum)
- Version Control
- Git fundamentals (commit, push, pull, branching)
- GitHub/GitLab experience
Helpful But Not Required
These topics will be learned through the projects:
- Infrastructure as Code (Terraform/CloudFormation)
- Container technologies (Docker, Kubernetes)
- CI/CD pipelines
- AWS-specific services (you’ll learn by building!)
Self-Assessment Questions
Check if you’re ready:
- Can you explain what a subnet is and why it exists?
- Have you deployed a web application (even locally)?
- Do you understand what an API is and how to call one?
- Can you read and understand JSON?
- Have you used the command line to navigate directories and run scripts?
- Do you know the difference between public and private IP addresses?
- Have you worked with environment variables?
- Can you SSH into a remote server?
If you answered “yes” to at least 6 of these, you’re ready to start. If not, spend 1-2 weeks on basic networking and Linux fundamentals first.
Development Environment Setup
Required Tools:
# AWS CLI (for interacting with AWS from terminal)
aws --version # Should be v2.x or higher
# Terraform (for infrastructure as code)
terraform --version # Should be v1.5+
# Git (version control)
git --version
# Code editor (VS Code recommended with AWS/Terraform extensions)
AWS Account Setup:
- Create an AWS Free Tier account (12 months free for many services)
- CRITICAL: Set up billing alerts immediately (prevent surprise charges)
- CloudWatch alarm when estimated charges exceed $10
- Budget alert for monthly spending
- Create an IAM user with admin access (don’t use root account)
- Configure AWS CLI:
aws configure - Enable MFA on both root and IAM user accounts
Cost Expectations:
- Most projects fit within Free Tier limits if you’re careful
- Expect $5-20/month if you leave resources running
- Golden rule: Destroy resources when not in use (
terraform destroy) - Projects 3-4 may cost $20-50/month if left running
Time Investment
Realistic time estimates:
| Project | Time to Complete | Why This Long |
|---|---|---|
| Project 1: VPC | 1-2 weeks (15-25 hours) | Learning Terraform, understanding subnet math, debugging routing issues |
| Project 2: Serverless Pipeline | 1-2 weeks (15-20 hours) | Event-driven architecture is a paradigm shift, Step Functions syntax learning curve |
| Project 3: Auto-Scaling Web App | 2-3 weeks (25-35 hours) | Most complex, integrates 6+ services, debugging distributed systems |
| Project 4: Containers (ECS/EKS) | 2-3 weeks (25-40 hours) | Kubernetes has steep learning curve, EKS networking is intricate |
| Project 5: Full SaaS Platform | 3-4 weeks (40-60 hours) | Combines all previous projects, adds authentication, multi-tenancy |
Total time for all 5 projects: 3-4 months working 10 hours/week
These are learning estimates, not “I already know this” estimates. If you’re building while understanding from first principles, it takes time.
Important Reality Check
What makes AWS hard to learn:
- Interconnectedness: You can’t fully understand Lambda without understanding IAM roles, VPC networking, and CloudWatch logs
- Hidden complexity: Many services have 50+ configuration options, only 5 of which you actually need
- Cost anxiety: Fear of surprise bills slows down experimentation
- Documentation overload: AWS docs are comprehensive but not tutorial-friendly
- Rapid change: Services get updated, new features added constantly
Success strategies:
- Build one project at a time, fully complete it before moving on
- Use billing alarms religiously
- Join AWS communities (r/aws, AWS Discord servers)
- Use
terraform destroyas a daily habit - Read the conceptual docs first, then the API reference
- Don’t try to memorize—focus on understanding patterns
Quick Start: First 48 Hours for Overwhelmed Learners
Feeling overwhelmed by the 200+ services and 3000+ lines of documentation? Here’s your focused start.
Day 1: Get Your Hands Dirty (4 hours)
Goal: Deploy something to AWS and see it work.
- Hour 1: Set up your AWS account
- Create Free Tier account
- Set up billing alert for $10
- Create IAM user with admin access
- Configure
aws configurewith access keys
- Hour 2: Manual console exploration
- Launch an EC2 instance (t2.micro, free tier)
- SSH into it
- Install nginx:
sudo apt install nginx - See it running on public IP
- Then destroy it (terminate instance)
- Hour 3: Your first S3 bucket
# Create bucket (use unique name) aws s3 mb s3://my-learning-bucket-123456 # Upload file echo "Hello AWS" > test.txt aws s3 cp test.txt s3://my-learning-bucket-123456/ # List objects aws s3 ls s3://my-learning-bucket-123456/ # Clean up aws s3 rb s3://my-learning-bucket-123456/ --force - Hour 4: Your first Lambda function
- Go to AWS Console → Lambda
- Create function (Python 3.12, default settings)
- Replace code with:
def lambda_handler(event, context): return {'statusCode': 200, 'body': 'My first Lambda!'} - Click “Test” and see it work
- Check CloudWatch logs (see your function’s output)
- Delete the function
What you just learned: Basic compute (EC2), object storage (S3), serverless (Lambda), and observability (CloudWatch). These are the foundations.
Day 2: Infrastructure as Code (4 hours)
Goal: Build the same S3 bucket, but with Terraform.
- Install Terraform (if not done)
brew install terraform # Mac # or download from terraform.io - Create your first Terraform file (
main.tf):terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } } provider "aws" { region = "us-east-1" } resource "aws_s3_bucket" "learning" { bucket = "my-terraform-bucket-123456" # Use unique name } - Run Terraform:
terraform init # Download AWS provider terraform plan # See what will be created terraform apply # Create the bucket (type "yes") - Verify it worked:
aws s3 ls | grep terraform - Destroy it:
terraform destroy # Type "yes"
What you just learned: Infrastructure as Code means your infrastructure configuration is versioned, reproducible, and reviewable. You’ll never click through the console again.
After 48 Hours: Choose Your Path
Now you’re ready to start Project 1: VPC. This is where real learning begins.
Recommended Learning Paths
Choose the path that matches your background and goals.
Path 1: “I’m a backend developer who wants to deploy my apps to AWS”
Focus: Learn just enough AWS to deploy and scale applications confidently.
Projects order:
- Start with Project 1 (VPC) → Understand networking basics
- Jump to Project 3 (Auto-Scaling Web App) → Deploy a real application with RDS, load balancing
- Optional: Project 2 (Serverless Pipeline) → If you work with data/event-driven systems
- Skip Project 4 (unless you’re using containers)
- Optional: Project 5 → If building SaaS products
Time: 6-8 weeks
Why this path: You’ll quickly get to deploying real applications (Project 3) after understanding networking fundamentals. Most backend developers need EC2/RDS/ALB more urgently than Kubernetes.
Path 2: “I’m a DevOps engineer who needs to architect AWS infrastructure”
Focus: Understand every layer, from VPC to Kubernetes.
Projects order:
- Project 1 (VPC) → Foundation
- Project 2 (Serverless Pipeline) → Event-driven architecture
- Project 4 (Containers - ECS then EKS) → Modern compute platforms
- Project 3 (Auto-Scaling Web App) → Traditional web architecture
- Project 5 (Full SaaS) → Pulling it all together
Time: 12-16 weeks
Why this path: You need deep understanding of all compute models (EC2, Lambda, containers) and how they fit into different architectural patterns. You’ll be making these trade-off decisions daily.
Path 3: “I’m coming from on-prem infrastructure and need to understand cloud”
Focus: Map your existing knowledge to AWS equivalents, understand the cloud-native differences.
Projects order:
- Project 1 (VPC) → This is your “data center” in AWS
- Project 3 (Auto-Scaling Web App) → Traditional architecture with cloud benefits (auto-scaling, managed databases)
- Project 2 (Serverless Pipeline) → The paradigm shift to event-driven, serverless
- Project 4 (ECS, not EKS) → Containers without Kubernetes complexity
- Optional: Project 5 → Multi-tenancy patterns
Time: 10-12 weeks
Why this path: Starts with familiar concepts (networking, VMs, load balancers) then progressively introduces cloud-native patterns. ECS is easier than EKS for container beginners.
Path 4: “I want to build serverless/event-driven systems”
Focus: Lambda, Step Functions, EventBridge, S3, DynamoDB.
Projects order:
- Project 1 (VPC) → Even serverless needs networking knowledge
- Project 2 (Serverless Pipeline) → Your core competency
- Project 3 (Auto-Scaling Web App) → Hybrid: Lambda + traditional components
- Skip Project 4 (unless curious about containers)
- Project 5 (Full SaaS) → Build it serverless-first
Time: 8-10 weeks
Why this path: Optimizes for Lambda/Step Functions mastery. You’ll learn RDS in Project 3, but can substitute DynamoDB if preferred.
Path 5: “I need AWS certification (Solutions Architect Associate)”
Focus: Breadth over depth, understand all core services.
Projects order:
- Project 1 (VPC) → 25% of exam questions
- Project 3 (Auto-Scaling Web App) → Most services in one project
- Project 2 (Serverless Pipeline) → Lambda/Step Functions coverage
- Project 4 (ECS path, skip EKS) → Container basics, EKS rarely on Associate exam
- Read AWS Well-Architected Framework → Critical for exam
Time: 8 weeks + 2 weeks exam prep
Why this path: Covers VPC (critical), EC2/RDS/ALB (core), Lambda (serverless), S3 (storage), IAM (security), CloudWatch (monitoring). These are 80% of the exam.
Project 1: Production-Ready VPC from Scratch (with Terraform)
| Attribute | Value |
|---|---|
| Language | HCL (Terraform) |
| Difficulty | Intermediate |
| Time | 1–2 weeks |
| Knowledge Area | Cloud Networking / Infrastructure as Code |
| Coolness | ★☆☆☆☆ Pure Corporate Snoozefest |
| Portfolio Value | Resume Gold |
What you’ll build: A multi-AZ VPC with public/private subnets, NAT gateways, bastion host, and a deployed web application—all defined in Terraform that you can destroy and recreate at will.
Why it matters: You cannot understand VPCs by clicking through the console. You need to see what happens when a private subnet has no route to a NAT gateway, when a security group blocks outbound traffic, or when your CIDR blocks overlap with a peered VPC. Building with IaC forces you to explicitly declare every component and understand their relationships.
Core challenges:
- CIDR planning and non-overlapping IP ranges that allow future growth and peering
- Routing logic: understanding why private subnets route to NAT vs public subnets route to IGW
- Security layers: configuring security groups (stateful) vs NACLs (stateless) and knowing when to use each
- High availability: deploying across multiple AZs with redundant NAT gateways
- Bastion access: SSH tunneling through a jump host to reach private instances
Key concepts to master:
- VPC networking and subnetting (CIDR blocks)
- Route tables and internet connectivity patterns
- Security group vs NACL design
- Multi-AZ architecture for high availability
- Infrastructure as Code with Terraform
Prerequisites: Basic AWS console navigation, understanding of IP addresses, familiarity with any IaC tool.
Deliverable: A fully functional multi-AZ VPC with public/private subnets, NAT gateways, bastion host, and deployed web application—all managed by Terraform code that can be destroyed and recreated deterministically.
Implementation hints:
- Start with a single public subnet and IGW before adding complexity
- Use Terraform modules for reusable VPC components
- Test network connectivity at each stage (ping, curl, traceroute)
- Enable VPC Flow Logs from the start for debugging
Milestones:
- Deploy VPC with public subnet only, web server reachable from internet → you understand IGW + route tables
- Add private subnet with NAT, move app server there, bastion-only access → you understand public vs private architecture
- Multi-AZ deployment with ALB → you understand high availability patterns
- Add VPC Flow Logs and analyze traffic → you understand network observability
Real World Outcome
This is what your working VPC infrastructure looks like when you complete this project:
# 1. Deploy the infrastructure with Terraform
$ cd terraform/vpc-project
$ terraform init
Initializing provider plugins...
- Downloading hashicorp/aws v5.31.0...
Terraform has been successfully initialized!
$ terraform apply
Plan: 23 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Enter a value: yes
aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.
Outputs:
vpc_id = "vpc-0abc123def456789"
public_subnet_ids = ["subnet-0aaa111", "subnet-0bbb222"]
private_subnet_ids = ["subnet-0ccc333", "subnet-0ddd444"]
bastion_public_ip = "54.123.45.67"
web_server_private_ip = "10.0.10.50"
nat_gateway_ip = "52.87.123.45"
# 2. SSH into the bastion host
$ ssh -i ~/.ssh/aws-bastion.pem ec2-user@54.123.45.67
The authenticity of host '54.123.45.67' can't be established.
Are you sure you want to continue connecting (yes/no)? yes
__| __|_ )
_| ( / Amazon Linux 2
___|\___|___|
[ec2-user@bastion ~]$ hostname
ip-10-0-1-25.ec2.internal
# 3. From bastion, SSH into private web server (SSH agent forwarding)
[ec2-user@bastion ~]$ ssh 10.0.10.50
[ec2-user@webserver ~]$ hostname
ip-10-0-10-50.ec2.internal
# 4. Verify the web server can reach internet via NAT
[ec2-user@webserver ~]$ curl -s https://ifconfig.me
52.87.123.45 # This is the NAT Gateway's IP, NOT the web server's private IP!
# 5. Verify the web server cannot be reached directly from internet
$ curl http://10.0.10.50 # From your local machine
curl: (7) Failed to connect to 10.0.10.50 port 80: No route to host
# CORRECT! Private subnet is not reachable from internet
# 6. Check VPC Flow Logs in CloudWatch
$ aws logs filter-log-events \
--log-group-name /aws/vpc/flowlogs \
--filter-pattern "[version, account, eni, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log_status]" \
--limit 5 \
--profile douglascorrea_io --no-cli-pager
{
"events": [
{
"message": "2 123456789012 eni-0abc123 10.0.1.25 54.123.45.67 22 52341 6 25 4500 1703264400 1703264460 ACCEPT OK",
"ingestionTime": 1703264465000
},
{
"message": "2 123456789012 eni-0def456 10.0.10.50 52.87.123.45 443 32145 6 10 1500 1703264400 1703264460 ACCEPT OK",
"ingestionTime": 1703264465000
}
]
}
# Flow logs show: SSH to bastion (ACCEPT), HTTPS from private instance via NAT (ACCEPT)
# 7. Describe your VPC and subnets
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 \
--query 'Vpcs[0].{VpcId:VpcId,CIDR:CidrBlock,State:State}' \
--output table --no-cli-pager
-----------------------------------------
| DescribeVpcs |
+-------+------------------+------------+
| CIDR | State | VpcId |
+-------+------------------+------------+
|10.0.0.0/16| available |vpc-0abc123 |
+-------+------------------+------------+
$ aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
--query 'Subnets[*].{SubnetId:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock,Public:MapPublicIpOnLaunch}' \
--output table --no-cli-pager
------------------------------------------------------------
| DescribeSubnets |
+--------------+---------------+------------+--------------+
| AZ | CIDR | Public | SubnetId |
+--------------+---------------+------------+--------------+
|us-east-1a |10.0.1.0/24 | True |subnet-0aaa111|
|us-east-1b |10.0.2.0/24 | True |subnet-0bbb222|
|us-east-1a |10.0.10.0/24 | False |subnet-0ccc333|
|us-east-1b |10.0.20.0/24 | False |subnet-0ddd444|
+--------------+---------------+------------+--------------+
# 8. View route tables to understand routing
$ aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
--query 'RouteTables[*].{RouteTableId:RouteTableId,Routes:Routes[*].{Destination:DestinationCidrBlock,Target:GatewayId||NatGatewayId}}' \
--no-cli-pager
# Public route table: 0.0.0.0/0 → igw-xxxxx (Internet Gateway)
# Private route table: 0.0.0.0/0 → nat-xxxxx (NAT Gateway)
# 9. Clean up when done (saves money!)
$ terraform destroy
Plan: 0 to add, 0 to change, 23 to destroy.
Do you want to perform these actions?
Enter a value: yes
aws_instance.web_server: Destroying...
aws_nat_gateway.main: Destroying...
...
Destroy complete! Resources: 23 destroyed.
AWS Console View:
- VPC Dashboard showing your VPC with proper CIDR block
- Subnet map showing public subnets with Internet Gateway icon, private subnets with NAT icon
- Route tables tab showing explicit routes for each subnet type
- Security Groups showing inbound/outbound rules
- VPC Flow Logs showing real network traffic in CloudWatch
The Core Question You’re Answering
“What IS a VPC, and how does network traffic actually flow between the internet, public subnets, private subnets, and databases?”
Before you write any Terraform code, sit with this question. Most developers have a vague sense that “VPCs are networks” but can’t explain:
- Why a public subnet can receive traffic from the internet but a private subnet cannot
- What the difference between a route table entry and a security group rule is
- Why you need a NAT Gateway (and why it costs money) for private instances to download packages
- How data flows when a user hits your website: ALB → EC2 → RDS and back
This project forces you to confront these questions because misconfiguration means your infrastructure doesn’t work.
Concepts You Must Understand First
Stop and research these before coding:
- CIDR Blocks and Subnetting
- What does 10.0.0.0/16 actually mean in binary?
- How many IP addresses are in a /24 subnet?
- Why does AWS reserve 5 IPs per subnet?
- What happens if two VPCs have overlapping CIDR blocks and you try to peer them?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 11 (Network Programming) - Bryant & O’Hallaron
- Routing and the Default Gateway (0.0.0.0/0)
- What does 0.0.0.0/0 mean in a route table?
- When a packet leaves an EC2 instance, how does the VPC know where to send it?
- What’s the difference between
localroute andigw-xxxxxroute? - Why does longest prefix match matter?
- Book Reference: “TCP/IP Illustrated, Volume 1” Ch. 9 (IP Routing) - W. Richard Stevens
- Internet Gateway vs NAT Gateway
- What does “stateful NAT” mean?
- Why can’t you just put an Internet Gateway on a private subnet?
- How does a NAT Gateway translate private IPs to public IPs?
- Why does the NAT Gateway need to be in a public subnet?
- Book Reference: “AWS for Solutions Architects” Ch. 3 - Saurabh Shrivastava
- Security Groups (Stateful) vs NACLs (Stateless)
- What does “stateful” mean in terms of network connections?
- If you allow inbound port 80, do you need an outbound rule for the response?
- Why do NACLs need explicit rules for ephemeral ports?
- When would you use a NACL instead of just security groups?
- Book Reference: “AWS for Solutions Architects” Ch. 4 (Security on AWS) - Saurabh Shrivastava
- High Availability and Availability Zones
- What is an Availability Zone physically?
- Why do you need subnets in multiple AZs?
- What happens if us-east-1a fails but you only deployed to that AZ?
- How does an ALB distribute traffic across AZs?
- Book Reference: “AWS for Solutions Architects” Ch. 2 (AWS Global Infrastructure)
Questions to Guide Your Design
Before implementing, think through these:
- CIDR Planning
- How many IP addresses will you need now? In 5 years?
- What if you need to add a third AZ later?
- What if you need to peer with another VPC that uses 10.0.0.0/16?
- How will you segment: web tier, app tier, database tier?
- Public vs Private Decisions
- What resources MUST be in public subnets? (Hint: very few)
- What resources should NEVER be in public subnets? (Hint: databases!)
- How will private resources get software updates from the internet?
- Bastion Host Design
- Why use a bastion instead of putting EC2 in public subnet with SSH?
- How do you secure the bastion itself?
- Should you use Session Manager instead of SSH? (Yes, probably)
- Security Group Strategy
- Can you reference one security group from another?
- How do you allow app servers to talk to databases without opening the database to everything?
- What’s the principle of least privilege in security group terms?
- Cost Considerations
- How much does a NAT Gateway cost per hour? Per GB transferred?
- Do you need one NAT Gateway per AZ or can you share?
- What’s the trade-off between cost and availability?
Thinking Exercise
Before coding, draw this diagram on paper:
Your Task: Fill in the missing pieces
Internet
│
▼
┌───────────────────────────────────────────────────────────────┐
│ VPC: 10.0.0.0/16 │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Public Subnet A │ │ Public Subnet B │ │
│ │ CIDR: ____________ │ │ CIDR: ____________ │ │
│ │ │ │ │ │
│ │ What goes here? │ │ What goes here? │ │
│ │ ___________________ │ │ ___________________ │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Private Subnet A │ │ Private Subnet B │ │
│ │ CIDR: ____________ │ │ CIDR: ____________ │ │
│ │ │ │ │ │
│ │ What goes here? │ │ What goes here? │ │
│ │ ___________________ │ │ ___________________ │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
│ │
│ Route Table (Public): │
│ ___________________ → ___________________ │
│ ___________________ → ___________________ │
│ │
│ Route Table (Private): │
│ ___________________ → ___________________ │
│ ___________________ → ___________________ │
└────────────────────────────────────────────────────────────────┘
Questions while drawing:
- Why does each public subnet need a different CIDR?
- What AWS resource creates the connection to the internet?
- How does traffic from the private subnet reach the internet?
- What happens to the response traffic?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Walk me through how a request from a user’s browser reaches your web server in a private subnet.”
- Expected: User → Internet → ALB (public subnet) → EC2 (private subnet) → Response reverses
- “Why is your database in a private subnet? How does it get software updates?”
- Expected: Security - no direct internet access. Updates via NAT Gateway or VPC endpoints.
- “Your app in us-east-1a can’t reach the database in us-east-1b. What do you check?”
- Expected: Security groups (allow from app SG?), NACLs (if modified), Route tables, VPC peering (if different VPCs)
- “What’s the difference between an Internet Gateway and a NAT Gateway?”
- Expected: IGW provides two-way communication (public IPs). NAT provides outbound-only for private resources.
- “Your NAT Gateway bill is $500/month. How do you reduce it?”
- Expected: Check if you have one per AZ (maybe share), use VPC endpoints for AWS services (S3, DynamoDB), check for excessive data transfer
- “Explain security groups vs NACLs. When would you use each?”
- Expected: SG = stateful, instance-level, allow rules only. NACL = stateless, subnet-level, allow/deny rules. Use SG for app logic, NACL for subnet-wide rules or explicit denies.
- “What CIDR block would you use for a VPC? Why?”
- Expected: RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16). Consider future growth and peering requirements.
Hints in Layers
Hint 1: Start with the VPC resource
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "production-vpc"
}
}
Hint 2: Create subnets with data source for AZs
data "aws_availability_zones" "available" {
state = "available"
}
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true # This makes it "public"
tags = {
Name = "public-${count.index + 1}"
}
}
Hint 3: Internet Gateway requires explicit route
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "public" {
count = length(aws_subnet.public)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
Hint 4: NAT Gateway needs Elastic IP first
resource "aws_eip" "nat" {
domain = "vpc"
}
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public[0].id # NAT must be in PUBLIC subnet
depends_on = [aws_internet_gateway.main]
}
Hint 5: Security Group allowing SSH from bastion only
resource "aws_security_group" "web" {
name = "web-server-sg"
vpc_id = aws_vpc.main.id
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
security_groups = [aws_security_group.bastion.id] # Only from bastion!
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Books That Will Help
| Topic | Book | Specific Chapters | Why It Helps |
|---|---|---|---|
| VPC Fundamentals | “AWS for Solutions Architects” by Saurabh Shrivastava | Ch. 3: Networking on AWS | Best comprehensive coverage of VPC architecture patterns and design decisions |
| Security Groups & IAM | “AWS for Solutions Architects” by Saurabh Shrivastava | Ch. 4: Security on AWS | Security group vs NACL differences, IAM roles for EC2 |
| TCP/IP Fundamentals | “TCP/IP Illustrated, Volume 1” by W. Richard Stevens | Ch. 1-3, 9 (IP Routing) | Deep understanding of how packets flow and routing works |
| CIDR & IP Addressing | “Computer Networks” by Tanenbaum & Wetherall | Ch. 5: Network Layer | Mathematical foundation of IP addressing and subnetting |
| Terraform Basics | “Terraform: Up & Running” by Yevgeniy Brikman | Ch. 2-3: Terraform State | Managing infrastructure as code, state management |
| Network Security | “The Linux Programming Interface” by Michael Kerrisk | Ch. 59-61: Sockets | Understanding network connections at the OS level |
| AWS Well-Architected | AWS Well-Architected Framework (free) | Security Pillar | Official AWS best practices for secure VPC design |
Reading strategy:
- Start with “AWS for Solutions Architects” Ch. 3 (VPC overview)
- Read “TCP/IP Illustrated” Ch. 9 if routing concepts are unclear
- Refer to “Terraform: Up & Running” as you write your code
- Use AWS Well-Architected Security Pillar as a checklist
Common Pitfalls & Debugging
Problem 1: “My Terraform apply fails with ‘InvalidVPCID.NotFound’“
- Why: Terraform is trying to create a resource before its dependency (the VPC) exists
- Fix: Ensure resources explicitly reference the VPC:
vpc_id = aws_vpc.main.id(not a string) - Quick test: Run
terraform planand verify resource dependencies are correct
Problem 2: “I can SSH to the bastion but can’t reach the private EC2 instance”
- Why: Security group on private instance doesn’t allow inbound SSH from bastion’s security group
- Fix:
resource "aws_security_group" "private_ec2" { ingress { from_port = 22 to_port = 22 protocol = "tcp" security_groups = [aws_security_group.bastion.id] # Not CIDR! } } - Quick test:
aws ec2 describe-security-groups --group-ids <private-sg-id> --query 'SecurityGroups[0].IpPermissions'
Problem 3: “My private EC2 instance can’t access the internet (yum/apt fails)”
- Why: Route table for private subnet isn’t routing 0.0.0.0/0 to NAT Gateway, or NAT Gateway isn’t in a public subnet
- Fix:
- Verify NAT Gateway is in a public subnet
- Verify private route table has:
0.0.0.0/0 → nat-xxxxx - Verify NAT Gateway’s Elastic IP exists
- Quick test:
# SSH to private instance via bastion curl -v https://www.google.com # Should work # Check route table aws ec2 describe-route-tables --filters "Name=vpc-id,Values=<vpc-id>"
Problem 4: “terraform destroy hangs when destroying NAT Gateway”
- Why: ENIs (Elastic Network Interfaces) attached to NAT Gateway take time to detach
- Fix: Wait 3-5 minutes. If still hanging, manually delete NAT Gateway in console, then re-run
terraform destroy - Prevention: Always run
terraform destroywhen done to avoid hourly charges
Problem 5: “My CIDR blocks overlap and VPC peering fails”
- Why: You used 10.0.0.0/16 for both VPCs
- Fix: Plan CIDR blocks in advance:
- VPC A: 10.0.0.0/16
- VPC B: 10.1.0.0/16
- VPC C: 10.2.0.0/16
- Quick test: Use
cidrsubnet()Terraform function to mathematically avoid overlaps
Problem 6: “AWS bill is $50 after 1 day - I thought this was free tier!”
- Why: NAT Gateway costs $0.045/hour ($32/month) + data transfer fees. NOT covered by free tier.
- Fix:
- Run
terraform destroywhen not using - Use VPC Endpoints for S3/DynamoDB (free, avoids NAT)
- Consider one NAT Gateway instead of one per AZ for learning
- Run
- Cost check:
aws ce get-cost-and-usage \ --time-period Start=2024-12-01,End=2024-12-21 \ --granularity DAILY \ --metrics BlendedCost \ --group-by Type=SERVICE
Problem 7: “terraform apply creates subnets in the same AZ”
- Why: Using
count.indexdirectly on a potentially unordered AZ list - Fix:
data "aws_availability_zones" "available" { state = "available" } # Ensure you're using different indices resource "aws_subnet" "public" { count = 2 availability_zone = data.aws_availability_zones.available.names[count.index] # ... } - Quick test:
aws ec2 describe-subnets --filters "Name=vpc-id,Values=<vpc-id>" --query 'Subnets[*].AvailabilityZone'
Problem 8: “VPC Flow Logs show all REJECT - but connections work fine”
- Why: You’re looking at ephemeral port traffic that NACLs might be blocking (if you modified default NACL)
- Fix: Default NACLs allow all traffic. If you customized NACLs, ensure ephemeral ports (1024-65535) are allowed for responses
- Debugging flow logs:
# Filter for REJECTs only aws logs filter-log-events \ --log-group-name /aws/vpc/flowlogs \ --filter-pattern "REJECT" \ --limit 20
Project 2: Serverless Data Pipeline (Lambda + Step Functions + S3)
| Attribute | Value |
|---|---|
| Language | Python (alt: TypeScript, Go, Java) |
| Difficulty | Intermediate |
| Time | 1–2 weeks |
| Knowledge Area | Serverless, Event-Driven Architecture |
| Coolness | ★★☆☆☆ Practical but Forgettable |
| Portfolio Value | Resume Gold |
What you’ll build: An automated data processing pipeline that triggers when files land in S3, orchestrates multiple Lambda functions through Step Functions, handles errors gracefully, and outputs processed results—all without a single server to manage.
Why it matters: Serverless is not “just deploy a function.” It’s about understanding event-driven architecture, dealing with cold starts, managing state across stateless functions, and designing for failure. Step Functions force you to think about workflow as a first-class concept.
Core challenges:
- Event-driven triggers: configuring S3 event notifications to invoke Lambda
- State machine design: modeling your pipeline as explicit states with transitions
- Error handling: implementing retries, catch blocks, and fallback logic in Step Functions
- Lambda limits: working within 15-minute timeout, /tmp storage, memory limits
- IAM execution roles: granting Lambda only the permissions it needs
Key concepts to master:
- Event-driven architecture and decoupled systems
- Step Functions state machine design
- Lambda execution model and cold start optimization
- S3 event notifications
- Error handling and retry strategies
Prerequisites: Basic Python/Node.js, understanding of JSON, Project 1 completed (VPC knowledge).
Deliverable: An automated data processing pipeline that triggers on S3 upload, orchestrates multiple Lambda functions through Step Functions with proper error handling, and outputs processed results to a destination bucket—all observable through CloudWatch logs and metrics.
Implementation hints:
- Start with a single Lambda function before adding Step Functions
- Use Step Functions visual designer to prototype your workflow
- Test error handling by intentionally failing Lambda functions
- Use S3 event notification filters to avoid infinite loops
Milestones:
- Single Lambda triggered by S3 upload → you understand event sources
- Chain 3 Lambdas via Step Functions → you understand state machines
- Add parallel processing and error handling → you understand resilience patterns
- Add SNS notifications and CloudWatch alarms → you understand observability in serverless
Real World Outcome
This is what your working pipeline looks like in action:
# Upload a CSV file to trigger the pipeline
$ aws s3 cp sales_data.csv s3://my-pipeline-bucket/input/ --profile douglascorrea_io --no-cli-pager
upload: ./sales_data.csv to s3://my-pipeline-bucket/input/sales_data.csv
# Check Step Functions execution status
$ aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:DataProcessingPipeline \
--profile douglascorrea_io --no-cli-pager
{
"executions": [
{
"executionArn": "arn:aws:states:us-east-1:123456789012:execution:DataProcessingPipeline:execution-2024-12-20-15-30-45",
"name": "execution-2024-12-20-15-30-45",
"status": "SUCCEEDED",
"startDate": "2024-12-20T15:30:45.123Z",
"stopDate": "2024-12-20T15:32:10.456Z"
}
]
}
# View CloudWatch logs for the validation Lambda
$ aws logs tail /aws/lambda/validate-data --since 5m --profile douglascorrea_io --no-cli-pager
2024-12-20T15:30:46.234Z START RequestId: abc123-def456 Version: $LATEST
2024-12-20T15:30:46.567Z INFO Validating CSV structure for sales_data.csv
2024-12-20T15:30:46.789Z INFO Found 1247 records, all valid
2024-12-20T15:30:47.012Z END RequestId: abc123-def456
2024-12-20T15:30:47.045Z REPORT Duration: 811ms Billed Duration: 900ms Memory: 512MB Max Memory Used: 128MB
# View the processed output
$ aws s3 ls s3://my-pipeline-bucket/output/ --profile douglascorrea_io --no-cli-pager
2024-12-20 15:32:10 45632 processed_sales_data.json
2024-12-20 15:32:10 1248 processing_summary.json
Step Functions Visual Workflow (in AWS Console shows real-time execution):
- ValidateData Lambda: SUCCESS (Duration: 811ms)
- TransformData Lambda: SUCCESS (Duration: 58.2s)
- Parallel Processing: Both branches SUCCESS
- CalculateStatistics: 2.3s
- GenerateReport: 1.8s
- WriteOutput Lambda: SUCCESS (Duration: 245ms)
- SNS Notification: Delivered
CloudWatch Logs Insights showing complete pipeline flow with timestamps, error-free execution, and performance metrics for each Lambda invocation.
The Core Question You’re Answering
“How do I build systems that react to events without managing servers, and how do I coordinate multiple functions reliably?”
More specifically:
- How do I design a system where dropping a file automatically triggers processing without a cron job?
- How do I chain multiple processing steps when each Lambda is stateless and limited to 15 minutes?
- How do I handle failures gracefully when any step might time out or throw an error?
- How do I pass data between Lambdas without coupling them tightly?
- How do I make my pipeline idempotent so reprocessing the same file produces the same result?
Concepts You Must Understand First
1. Event-Driven Architecture vs Request-Response
Request-Response (traditional):
Client → Server → Database → Server → Client
(synchronous, client waits for entire operation)
Event-Driven (serverless):
Event Producer → Event → Event Consumer
(asynchronous, producer fires and forgets)
Key: Your S3 upload doesn’t “call” your Lambda. It emits an event. This decoupling makes serverless scalable but harder to debug.
2. Lambda Execution Model (Cold Starts, Execution Context Reuse)
Lambda lifecycle:
- INIT Phase (cold start): Download code, start environment, initialize runtime (100ms-3s)
- INVOKE Phase: Run your handler function
- SHUTDOWN Phase: Environment terminated after idle timeout
Context reuse:
# Runs ONCE per execution environment (cold start)
import boto3
s3_client = boto3.client('s3')
# Runs EVERY invocation (warm or cold)
def handler(event, context):
s3_client.get_object(Bucket='my-bucket', Key='file.csv')
Why it matters: Cold starts add latency. Initialize connections outside the handler to reuse them.
3. State Machines and Workflow Orchestration
With Step Functions:
- Visual workflow in console
- Declarative retry/error handling
- State maintained between steps
- Execution history for debugging
Step Functions is your distributed transaction coordinator.
4. Idempotency and Exactly-Once Processing
The problem: S3 events can be delivered more than once. Step Functions retries can duplicate processing.
Idempotent design:
# Generate deterministic ID from input
idempotency_key = hashlib.sha256(json.dumps(event, sort_keys=True).encode()).hexdigest()
# Check if already processed
try:
s3_client.head_object(Bucket='results', Key=f'processed/{idempotency_key}.json')
return {"status": "already_processed"}
except s3_client.exceptions.NoSuchKey:
# Not processed yet, continue
pass
5. IAM Execution Roles vs Resource-Based Policies
- Execution Role (what Lambda can do): s3:GetObject, s3:PutObject
- Resource-Based Policy (who can invoke Lambda): Allow S3 service to invoke
You need BOTH for S3 event triggers to work.
Questions to Guide Your Design
- What triggers your first Lambda? S3 event notification? SNS? EventBridge?
- How do you pass data between Lambda functions? Via Step Functions state? S3? SQS?
- What happens when a Lambda times out mid-execution? Does Step Functions retry? From scratch or resume?
- How do you handle partial failures? If 3 out of 5 parallel tasks succeed, proceed or fail?
- What errors are transient vs permanent? ThrottlingException (retry) vs InvalidDataFormat (alert)?
- What if the input file is 5GB? Lambda has 10GB /tmp max, 15-minute timeout.
- How do you maintain data lineage? Which output came from which input?
- How do you debug a failed execution? Step Functions history? CloudWatch Logs?
Thinking Exercise
Draw your event flow diagram:
S3 Bucket (input/)
|
| S3 Event Notification
↓
Lambda: ValidateData
|
↓
Step Functions: DataProcessingStateMachine
|
├→ Choice: Valid data?
| ├─ NO → Lambda: SendAlert → END
| └─ YES ↓
|
├→ Lambda: TransformData
↓
├→ Parallel State:
| ├─ Lambda: CalculateStatistics
| └─ Lambda: GenerateReport
↓
├→ Lambda: WriteOutput
↓
└→ SNS: SendSuccessNotification
Now answer:
- Where does data live at each stage?
- What happens if execution is paused for 1 year? (max execution time)
- Maximum file size your pipeline can handle?
- Cost for processing 1000 files?
The Interview Questions They’ll Ask
Q: “Explain synchronous vs asynchronous Lambda invocations.”
Expected answer:
- Synchronous: Caller waits, gets return value (API Gateway, ALB)
- Asynchronous: Caller gets 202 Accepted, Lambda processes later (S3, SNS)
- Async has built-in retry (2x) and dead-letter queue support
- In your pipeline: S3 invokes ValidateData async, Step Functions invokes Lambdas synchronously
Q: “How do you handle Lambda timeout mid-processing?”
Expected answer:
- Chunked processing: Read in chunks, maintain cursor in DynamoDB
- Recursive invocation: Lambda invokes itself with updated state
- Step Functions Map state: Split into chunks, process in parallel
- Alternative: Use Fargate for long-running tasks (no 15min limit)
Q: “What are Lambda cold starts and how did they affect your pipeline?”
Expected answer:
- Cold start = INIT phase (100ms-3s) when scaling up or after idle
- Mitigation: Provisioned concurrency, smaller packages, init code outside handler
- In your pipeline: “Measured ~800ms cold starts, acceptable for batch processing”
Q: “Why use Step Functions instead of chaining Lambdas?”
Expected answer:
- Visual workflow, declarative error handling, state management
- Execution history shows inputs/outputs/durations
- Parallel, Map, Wait states built-in
Q: “How do you pass large datasets between Step Functions states?”
Expected answer:
- Step Functions has 256KB limit per state
- Store data in S3, pass S3 URI in state
- Use ResultPath to merge results without duplicating data
Q: “IAM permissions for S3 to trigger Lambda?”
Expected answer:
- Lambda execution role (what Lambda can do): s3:GetObject, logs:CreateLogGroup
- Lambda resource-based policy (who can invoke): Allow S3 service, condition on bucket ARN
- S3 bucket notification config pointing to Lambda ARN
Q: “Cost for running pipeline 1000 times/day?”
Expected answer breakdown:
- Lambda: Invocations ($0.20/1M) + Duration ($0.0000166667/GB-sec)
- Step Functions: State transitions ($25/1M)
- S3: Requests + Storage
- Show calculation
Hints in Layers
Hint 1.1: What triggers the pipeline? Use S3 Event Notifications with object created events. Configure filter by prefix (input/) and suffix (.csv).
Hint 1.2: How should Lambdas communicate? Use Step Functions to orchestrate. Pass small metadata in state, store large data in S3.
Hint 2.1: Lambda function structure
import boto3
from aws_lambda_powertools import Logger, Tracer
logger = Logger()
tracer = Tracer()
s3_client = boto3.client('s3') # Initialize outside handler
@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
bucket = event['bucket']
key = event['key']
response = s3_client.get_object(Bucket=bucket, Key=key)
# Process...
return {"valid": True, "rowCount": 1247}
Hint 3.1: S3 event not triggering Lambda? Check:
- CloudWatch Logs for Lambda
- S3 notification config:
aws s3api get-bucket-notification-configuration - Lambda resource-based policy:
aws lambda get-policy - Event filter matches your file
Hint 3.2: Lambda cold starts too slow?
- Check package size
- Lazy import heavy libraries
- Use Lambda Layers
- Consider Provisioned Concurrency
Books That Will Help
| Topic | Book | Specific Chapters | Why It Helps |
|---|---|---|---|
| Serverless Architecture | “AWS for Solutions Architects” by Saurabh Shrivastava | Ch. 8-9: Serverless, Containers vs Lambda | When to use Lambda vs Fargate, event-driven patterns |
| Distributed Systems | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 11-12: Stream Processing | Event-driven architecture, idempotency, exactly-once semantics |
| Lambda Deep Dive | “AWS Lambda in Action” by Danilo Poccia | Ch. 2, 4, 8: First Lambda, Data Streams, Optimization | Cold start optimization, event source integration |
| Step Functions | AWS Step Functions Developer Guide | Error Handling, ASL Specification | State machine syntax, retry/catch patterns |
| IAM Security | “AWS Security” by Dylan Shield | Ch. 3, 5: IAM, Data Protection | Lambda execution roles, resource-based policies |
| Terraform | “Terraform: Up & Running” by Yevgeniy Brikman | Ch. 3, 7: State, Multiple Providers | Deploy Step Functions + Lambda as code |
| Observability | “Practical Monitoring” by Mike Julian | Ch. 4, 6: Applications, Alerting | Metrics and alerts for serverless |
| Cost Optimization | “AWS Cost Optimization” by Brandon Carroll | Ch. 5, 7: Lambda, Storage | Memory tuning, S3 lifecycle policies |
Reading strategy:
- Start with “AWS for Solutions Architects” Ch. 8 (serverless overview)
- Read “Designing Data-Intensive Applications” Ch. 11 (event-driven concepts)
- Dive into “AWS Lambda in Action” Ch. 4 (event sources) and Ch. 8 (optimization)
- Refer to Step Functions Developer Guide when writing state machine
- Use “Terraform: Up & Running” Ch. 3 as you build infrastructure
Common Pitfalls & Debugging
Problem 1: “Lambda timing out without entering handler or logging output”
- Why: Initialization (Init) timeouts occur before handler execution, often due to heavy dependencies loading, VPC cold starts, or DNS resolution issues
- Fix: Check CloudWatch Logs for INIT_REPORT entries, reduce package size by removing unused dependencies, use Lambda Layers for shared code
- Quick test:
aws logs tail /aws/lambda/YOUR_FUNCTION_NAME --followand invoke the function to see initialization logs - Deep dive: If using VPC, ensure Lambda has ENI capacity and consider using VPC endpoints for AWS services to avoid internet routing
Problem 2: “S3 events not triggering Lambda function”
- Why: Missing Lambda resource-based policy, incorrect event filter configuration, or S3 notification not properly configured
- Fix: Verify S3 notification configuration:
aws s3api get-bucket-notification-configuration --bucket YOUR_BUCKET - Quick test: Check Lambda’s resource policy allows S3 invocation:
aws lambda get-policy --function-name YOUR_FUNCTION - Verification: Test event filter by uploading file matching your prefix/suffix pattern and checking CloudWatch Logs
Problem 3: “Step Functions execution fails with ‘States.TaskFailed’ or ‘States.Timeout’“
- Why: Lambda function errors not caught, incorrect error handling in state machine, or timeout settings too restrictive
- Fix: Add Catch blocks in state machine definition for error handling, increase TimeoutSeconds in task states, check Lambda logs for actual errors
- Quick test: View execution history in Step Functions console, examine each state’s input/output to identify where it failed
- Best practice: Use
$.Causeand$.Errorin Catch blocks to preserve error context for debugging
Problem 4: “Cold starts causing unacceptable latency (>3 seconds)”
- Why: Large deployment package (>50MB), VPC configuration overhead (adds 1-10 seconds), or heavy runtime initialization (.NET/Java have longer cold starts than Python/Node.js)
- Fix: Reduce package size (use Lambda Layers for dependencies), lazy-load heavy libraries only when needed, consider Provisioned Concurrency for critical paths ($0.000004 per GB-second)
- Quick test: Check Lambda Insights metrics in CloudWatch for init duration and billed duration breakdown
- Cost vs performance: Provisioned Concurrency keeps 1+ instances warm but costs ~$13/month per instance (1GB memory) - calculate if cold start impact justifies cost
Problem 5: “Step Functions execution fails with ‘Exceeded maximum allowed execution time’“
- Why: Step Functions has a maximum execution time of 1 year for Standard workflows, but individual states might exceed Lambda’s 15-minute timeout
- Fix: For long-running tasks, break them into smaller Lambda invocations or use Fargate tasks instead
- Alternative: Use Step Functions Express workflows for short-lived, high-volume executions (<5 minutes)
- Pattern: Implement a “chunking” pattern where Lambda processes batches and Step Functions Map state iterates
Problem 6: “Concurrent execution limit reached - throttling errors”
- Why: AWS accounts have default concurrent execution limit of 1,000 across all Lambda functions in a region; sudden traffic spike can exhaust this
- Fix: Request limit increase via AWS Support, or set reserved concurrency on critical functions to guarantee capacity
- Quick test: Check CloudWatch metric
ConcurrentExecutionsandThrottles - Warning: Setting reserved concurrency on one function reduces available capacity for others - use carefully
Problem 7: “Lambda consuming more memory than expected - high costs”
- Why: Memory setting too high for actual usage, memory leaks in function code, or inefficient data processing
- Fix: Use Lambda Power Tuning tool to find optimal memory/cost balance: https://github.com/alexcasalboni/aws-lambda-power-tuning
- Quick test: Check CloudWatch Logs for “Max Memory Used” vs “Memory Size” - if consistently using <50%, you’re over-provisioned
- Reality check: Lambda is billed by GB-second; 128MB function running 100ms costs less than 1GB function running 100ms (6.4x cost difference)
Problem 8: “Step Functions state machine JSON is too large (>256KB)”
- Why: Passing large data payloads directly in Step Functions state instead of using S3 references
- Fix: Store large data in S3/DynamoDB, pass only S3 URI or keys in state machine
- Pattern: Lambda 1 writes to S3 → passes
{"s3Key": "data/processed/file.json"}→ Lambda 2 reads from S3 - Best practice: Step Functions state should contain metadata only, not actual data
Debugging Tools & Techniques:
- CloudWatch Logs Insights: Query Lambda logs across multiple invocations to find patterns
fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20 - AWS X-Ray: Enable active tracing on Lambda to see full request path through services (Lambda → S3 → DynamoDB)
- Lambda Insights: Enhanced monitoring showing memory, CPU, network stats (requires CloudWatch agent layer)
- Step Functions Graph View: Visual execution flow showing which states succeeded/failed with exact error messages
- Remote Debugging (2025): Use AWS Toolkit for VS Code to set breakpoints and debug Lambda functions executing in the cloud
Sources:
-
[Issues to Avoid When Implementing Serverless Architecture with AWS Lambda AWS Architecture Blog](https://aws.amazon.com/blogs/architecture/mistakes-to-avoid-when-implementing-serverless-architecture-with-lambda/) - AVOID these AWS Lambda MISTAKES (checklist + symptoms, solutions)
- Accelerating local serverless development with console to IDE and remote debugging for AWS Lambda
- 4 AWS Serverless Security Traps & How to Fix Them
Project 3: Auto-Scaling Web Application (EC2 + ALB + RDS + S3)
| Attribute | Value |
|---|---|
| Language | HCL (Terraform) |
| Difficulty | Intermediate |
| Time | 2–3 weeks |
| Knowledge Area | Cloud Infrastructure / Scalability |
| Coolness | ★☆☆☆☆ Pure Corporate Snoozefest |
| Portfolio Value | Resume Gold |
What you’ll build: A traditional multi-tier web application with load-balanced EC2 instances that scale based on CPU/request metrics, backed by RDS (Aurora or PostgreSQL), with static assets served from S3/CloudFront.
Why it matters: This is the “classic” AWS architecture. Understanding auto-scaling groups, launch templates, health checks, and how ALB distributes traffic is foundational. You’ll also learn why RDS simplifies database ops and how S3+CloudFront offloads static content.
Core challenges:
- Launch templates: defining AMI, instance type, user data scripts, IAM instance profiles
- Auto scaling policies: configuring scaling based on CloudWatch metrics (CPU, request count, custom)
- Health checks: understanding EC2 vs ELB health checks and how they affect scaling
- Database connectivity: RDS in private subnet, security group rules from app tier only
- Static asset optimization: S3 origin with CloudFront distribution, cache behaviors
Key concepts to master:
- Auto Scaling Groups and launch templates
- Application Load Balancer and target groups
- RDS Multi-AZ deployments
- CloudFront CDN and cache behaviors
- Health checks and instance lifecycle
Prerequisites: Project 1 (VPC), basic web development, familiarity with databases.
Deliverable: A production-ready multi-tier web application with auto-scaling EC2 instances behind an ALB, RDS database in private subnets, and S3/CloudFront for static assets—all managed with Terraform and observable through CloudWatch metrics and dashboards.
Implementation hints:
- Start with a single EC2 instance before adding auto-scaling
- Use user data scripts to bootstrap instances with your application
- Test health checks by manually stopping your web server
- Use CloudWatch alarms to trigger scaling events
Milestones:
- Single EC2 with user data script serving a web app → you understand instance bootstrapping
- Add ALB + 2 instances in target group → you understand load balancing
- Add Auto Scaling with CPU-based policy → you understand elasticity
- Add RDS in private subnet → you understand data tier security
- Add S3 + CloudFront for static assets → you understand CDN patterns
Real World Outcome
Here’s what success looks like when you complete this project:
# 1. Deploy the infrastructure
$ terraform apply
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.
Outputs:
alb_dns_name = "my-app-alb-123456789.us-east-1.elb.amazonaws.com"
rds_endpoint = "myapp-db.c9akl1.us-east-1.rds.amazonaws.com:5432"
cloudfront_domain = "d111111abcdef8.cloudfront.net"
# 2. Verify the application is running
$ curl http://my-app-alb-123456789.us-east-1.elb.amazonaws.com
<html>
<head><title>My Scalable App</title></head>
<body>
<h1>Hello from instance i-0abc123def456!</h1>
<p>Current time: 2025-12-22 14:32:15</p>
<p>Database connection: OK</p>
</body>
</html>
# 3. Load test to trigger auto-scaling (using 'hey' tool)
$ hey -n 10000 -c 100 http://my-app-alb-123456789.us-east-1.elb.amazonaws.com
Summary:
Total: 45.3216 secs
Slowest: 2.3451 secs
Fastest: 0.0234 secs
Average: 0.4532 secs
Requests/sec: 220.75
Status code distribution:
[200] 10000 responses
# 4. Watch instances scale up in real-time (in another terminal)
$ watch "aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-app-asg \
--query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].[InstanceId,LifecycleState]]' \
--output table --no-cli-pager"
Every 2.0s: aws autoscaling...
---------------------------------------------
| DescribeAutoScalingGroups |
+-------------------------------------------+
|| 2 | 1 | 5 || # Started with 2 instances
+-------------------------------------------+
||| i-0abc123def456 | InService |||
||| i-0def789ghi012 | InService |||
+-------------------------------------------+
# After load increases, watch it scale to 4 instances:
+-------------------------------------------+
|| 4 | 1 | 5 || # Scaled up!
+-------------------------------------------+
||| i-0abc123def456 | InService |||
||| i-0def789ghi012 | InService |||
||| i-0jkl345mno678 | InService |||
||| i-0pqr901stu234 | InService |||
+-------------------------------------------+
# 5. Check CloudWatch metrics showing scaling activity
$ aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions Name=LoadBalancer,Value=app/my-app-alb/50dc6c495c0c9188 \
--start-time 2025-12-22T14:00:00Z \
--end-time 2025-12-22T15:00:00Z \
--period 300 \
--statistics Average \
--no-cli-pager
Datapoints:
- Timestamp: 2025-12-22T14:00:00Z, Average: 0.032
- Timestamp: 2025-12-22T14:05:00Z, Average: 0.125 # Load increasing
- Timestamp: 2025-12-22T14:10:00Z, Average: 0.456 # Scaling triggered
- Timestamp: 2025-12-22T14:15:00Z, Average: 0.089 # Back to normal after scale-up
# 6. Check ALB target health
$ aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app-tg/50dc6c495c0c9188 \
--no-cli-pager
TargetHealthDescriptions:
- Target:
Id: i-0abc123def456
Port: 80
HealthCheckPort: '80'
TargetHealth:
State: healthy
Reason: Target passed health checks
- Target:
Id: i-0def789ghi012
Port: 80
HealthCheckPort: '80'
TargetHealth:
State: healthy
Reason: Target passed health checks
# 7. View Auto Scaling activity history
$ aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-app-asg \
--max-records 5 \
--no-cli-pager
Activities:
- ActivityId: 1234abcd-5678-90ef-gh12-ijklmnopqrst
Description: "Launching a new EC2 instance: i-0jkl345mno678"
Cause: "At 2025-12-22T14:12:00Z a monitor alarm TargetTracking-my-app-asg-AlarmHigh-... in state ALARM triggered policy my-app-scaling-policy causing a change to the desired capacity from 2 to 4."
StartTime: 2025-12-22T14:12:15Z
EndTime: 2025-12-22T14:13:42Z
StatusCode: Successful
# 8. Check RDS database connection
$ psql -h myapp-db.c9akl1.us-east-1.rds.amazonaws.com -U dbadmin -d myappdb
Password:
myappdb=> SELECT current_database(), current_user, inet_server_addr();
current_database | current_user | inet_server_addr
------------------+--------------+------------------
myappdb | dbadmin | 10.0.3.45
(1 row)
myappdb=> SELECT COUNT(*) FROM app_requests;
count
-------
10247 # Showing all the requests that were logged
(1 row)
# 9. View CloudFront cache statistics
$ aws cloudfront get-distribution-config \
--id E1234ABCDEFGH \
--query 'DistributionConfig.Origins[0].DomainName' \
--output text --no-cli-pager
my-app-static-assets.s3.us-east-1.amazonaws.com
# Test CloudFront delivery
$ curl -I https://d111111abcdef8.cloudfront.net/images/logo.png
HTTP/2 200
content-type: image/png
content-length: 45678
x-cache: Hit from cloudfront # Cache HIT - fast delivery!
x-amz-cf-pop: SFO5-C1
x-amz-cf-id: abc123...
# 10. After load subsides, watch instances scale down
$ watch "aws autoscaling describe-auto-scaling-groups ..."
+-------------------------------------------+
|| 1 | 1 | 5 || # Scaled back down to minimum
+-------------------------------------------+
||| i-0abc123def456 | InService |||
+-------------------------------------------+
CloudWatch Dashboard View:
- CPU Utilization graph showing spike from 20% → 75% → back to 25%
- Request Count showing 50 req/sec → 800 req/sec → 60 req/sec
- Target Response Time showing latency spike then recovery
- Healthy Host Count showing 2 → 4 → 1 instances over time
- RDS connections showing stable connection pool usage
- ALB HTTP 200 responses at 99.8% (a few timeouts during initial spike)
What You’ll See in AWS Console:
- EC2 Auto Scaling Groups showing scaling activities
- CloudWatch alarms transitioning: OK → ALARM → OK
- ALB target groups with health check status
- RDS Performance Insights showing query patterns
- S3 bucket with static assets
- CloudFront distribution with cache statistics
The Core Question You’re Answering
“How do I build applications that automatically handle 10x traffic spikes without falling over, and intelligently shrink when traffic subsides to save costs?”
This is the fundamental problem that auto-scaling solves. Traditional infrastructure requires you to provision for peak load—meaning you’re paying for idle capacity 90% of the time. Auto-scaling lets you provision for average load and dynamically expand when needed.
Why this matters in the real world:
- E-commerce sites during Black Friday sales
- News sites during breaking news events
- SaaS platforms during business hours (scale down at night)
- API backends handling unpredictable mobile app traffic
- Gaming servers during new release launches
Without auto-scaling, you have two bad options:
- Over-provision: Run 10 servers 24/7 even though you only need them for 2 hours a day → wasted money
- Under-provision: Run 2 servers and accept that your site crashes during traffic spikes → lost revenue
Auto-scaling gives you the best of both worlds: cost efficiency during normal periods, reliability during peaks.
Concepts You Must Understand First
Before diving into implementation, you need to internalize these foundational concepts:
1. Horizontal vs Vertical Scaling
Vertical Scaling (Scale Up): Make your server bigger
- EC2 instance: t3.medium → t3.large → t3.xlarge
- More CPU, more RAM, same single instance
- Limitation: Hard ceiling (largest EC2 instance), requires downtime, single point of failure
Horizontal Scaling (Scale Out): Add more servers
- 1 instance → 3 instances → 10 instances
- Distribute load across multiple machines
- Advantage: Theoretically unlimited, no downtime, fault-tolerant
This project teaches horizontal scaling, which is how modern cloud applications achieve massive scale.
2. Stateless Application Design
Stateless: Each request is independent, no memory of previous requests
# Stateless - GOOD for auto-scaling
@app.route('/api/user/<user_id>')
def get_user(user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
return jsonify(user)
# Any instance can handle this request
Stateful: Application remembers information between requests
# Stateful - BAD for auto-scaling
user_sessions = {} # In-memory storage
@app.route('/api/cart/add')
def add_to_cart(item_id):
session_id = request.cookies.get('session')
user_sessions[session_id].append(item_id) # Only works if same instance
return "Added"
# BREAKS when load balancer sends next request to different instance
Why stateless matters for auto-scaling:
- Load balancer can send requests to any instance
- Instances can be terminated without data loss
- New instances can immediately serve traffic
How to handle state:
- Store sessions in Redis/ElastiCache (shared across instances)
- Store sessions in DynamoDB
- Use JWT tokens (state stored client-side)
- Use sticky sessions (not recommended, defeats auto-scaling benefits)
3. Load Balancer Algorithms
Round Robin (default for ALB):
Request 1 → Instance A
Request 2 → Instance B
Request 3 → Instance C
Request 4 → Instance A (back to start)
Pro: Simple, fair distribution Con: Doesn’t account for instance load
Least Outstanding Requests (ALB option):
Instance A: 5 active requests
Instance B: 2 active requests ← Send here
Instance C: 8 active requests
Next request goes to Instance B (least busy)
Pro: Better for variable request durations Con: Requires tracking active connections
Weighted Target Groups:
Instance A: weight 100 (gets 50% of traffic)
Instance B: weight 50 (gets 25% of traffic)
Instance C: weight 50 (gets 25% of traffic)
Use case: Blue/green deployments, A/B testing
4. Health Checks: EC2 vs ELB Health Checks
EC2 Health Check (Auto Scaling Group):
- Checks: Is the instance running? Is the status “OK”?
- Fails if: Instance stopped, hardware failure, status check failures
- Does NOT check: Is the application responding?
ELB Health Check (Application Load Balancer):
- Checks: HTTP GET to
/healthendpoint, expects 200 response - Fails if: Application crashed, database unreachable, timeout (default 5 sec)
- More comprehensive than EC2 check
Critical difference:
Scenario: EC2 instance is running, but your web app crashed
EC2 Health Check: PASS (instance is running)
ELB Health Check: FAIL (app not responding)
Result: Auto Scaling thinks instance is healthy, keeps it running
Load balancer marks it unhealthy, stops sending traffic
Solution: Configure ASG to use ELB health checks!
Best practice configuration:
resource "aws_autoscaling_group" "app" {
health_check_type = "ELB" # Use ELB checks, not EC2
health_check_grace_period = 300 # Wait 5 min for app to start
# ... other config
}
resource "aws_lb_target_group" "app" {
health_check {
path = "/health"
interval = 30 # Check every 30 seconds
timeout = 5 # Fail if no response in 5 sec
healthy_threshold = 2 # 2 consecutive passes = healthy
unhealthy_threshold = 3 # 3 consecutive fails = unhealthy
}
}
5. Launch Templates vs Launch Configurations
Launch Configuration (DEPRECATED, but you’ll see it in old code):
- Cannot be modified (must create new version)
- Limited instance type options
- No instance metadata service v2 support
Launch Template (USE THIS):
- Versioned (can modify and rollback)
- Supports mixed instance types
- Supports Spot instances
- Supports newer EC2 features
resource "aws_launch_template" "app" {
name_prefix = "my-app-"
image_id = "ami-0c55b159cbfafe1f0" # Your AMI
instance_type = "t3.medium"
# User data script to configure instance on boot
user_data = base64encode(<<-EOF
#!/bin/bash
cd /opt/myapp
export DB_HOST=${aws_db_instance.main.endpoint}
./start-app.sh
EOF
)
# IAM role for instance
iam_instance_profile {
arn = aws_iam_instance_profile.app.arn
}
# Security group
vpc_security_group_ids = [aws_security_group.app.id]
# Enable detailed monitoring
monitoring {
enabled = true
}
}
Key understanding: Launch template is the blueprint for every instance that auto-scaling creates.
Questions to Guide Your Design
Ask yourself these questions as you build. If you can’t answer them, you don’t understand the architecture yet.
1. How does the ALB know which instances are healthy?
Answer: The ALB continuously sends HTTP requests to each instance’s health check endpoint (e.g., GET /health). If it receives a 200 OK response within the timeout period (default 5 seconds), the instance is marked healthy. If it fails the unhealthy threshold number of times (default 3 consecutive failures), it’s marked unhealthy and removed from rotation.
Follow-up: What should your /health endpoint check?
- Database connectivity?
- Disk space?
- Memory available?
- Dependency services reachable?
Best practice: Start simple (just return 200), then add checks for critical dependencies.
2. What metric should trigger scaling?
Common options:
CPU Utilization (most common starting point):
resource "aws_autoscaling_policy" "scale_up" {
name = "cpu-scale-up"
scaling_adjustment = 1
adjustment_type = "ChangeInCapacity"
cooldown = 300
autoscaling_group_name = aws_autoscaling_group.app.name
}
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "cpu-utilization-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
statistic = "Average"
threshold = 70 # Scale up when CPU > 70%
alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}
Problems with CPU-based scaling:
- I/O-bound applications may never hit CPU threshold
- Doesn’t capture actual user experience (latency)
Request Count Per Target (better for web apps):
resource "aws_cloudwatch_metric_alarm" "requests_high" {
metric_name = "RequestCountPerTarget"
namespace = "AWS/ApplicationELB"
threshold = 1000 # 1000 requests/target in 5 min
# ... more config
}
Target Tracking Scaling (recommended - AWS manages it):
resource "aws_autoscaling_policy" "target_tracking" {
name = "target-tracking-policy"
policy_type = "TargetTrackingScaling"
autoscaling_group_name = aws_autoscaling_group.app.name
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 50.0 # Keep average CPU at 50%
}
}
Decision criteria:
- Web apps with predictable load per request: Request count
- CPU-intensive tasks: CPU utilization
- Most cases: Target tracking with CPU, let AWS handle it
- Advanced: Custom metrics (latency from your app logs)
3. How do you handle session state with multiple instances?
Option 1: Don’t use sessions (best for APIs)
- Use JWT tokens
- All state in the token payload
- Stateless, works with any instance
Option 2: Shared session store (for traditional web apps)
from flask import Flask, session
from flask_session import Session
import redis
app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.from_url('redis://elasticache-endpoint:6379')
Session(app)
# Now sessions are stored in Redis, not in-memory
# Any instance can read/write session data
Option 3: Sticky sessions (not recommended)
- ALB always routes user to same instance
- Defeats purpose of auto-scaling
- Lose sessions when instance terminates
Real-world recommendation: Use ElastiCache (Redis) for sessions if you need them.
4. What’s the difference between desired, minimum, and maximum capacity?
resource "aws_autoscaling_group" "app" {
min_size = 2 # Never go below 2 (for high availability)
max_size = 10 # Never exceed 10 (cost protection)
desired_capacity = 4 # Start with 4, let scaling adjust
# ... more config
}
How they work together:
Scenario 1: Normal operation
Current: 4 instances (= desired)
Traffic increases, CPU hits 80%
Scaling policy triggers: desired_capacity = 4 + 2 = 6
Result: Launch 2 new instances
Scenario 2: Traffic spike
Current: 8 instances
Traffic spike continues, CPU still high
Scaling policy wants: desired_capacity = 8 + 2 = 10
Result: Launch 2 more instances (at max_size, stops)
Scenario 3: Traffic drops
Current: 6 instances
CPU drops to 20%, scale-down alarm triggers
Scaling policy: desired_capacity = 6 - 2 = 4
Result: Terminate 2 instances
Scenario 4: Major incident, all instances failing
Current: 0 healthy instances
Auto Scaling: "I need to meet min_size!"
Result: Launches 2 new instances (even if health checks failing)
Best practices:
min_size: Enough for high availability (at least 2 in different AZs)max_size: Based on budget and realistic peak loaddesired_capacity: Let auto-scaling manage it, or set initial value
5. What happens during a deployment to a running auto-scaled application?
Bad approach (causes downtime):
# Update launch template
# Terminate all instances
# New instances launch with new code
# → Downtime while instances come up
Good approach (rolling update):
resource "aws_autoscaling_group" "app" {
# ...
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50 # Keep at least 50% healthy during update
}
}
}
Process:
- Update launch template with new AMI/user data
- Trigger instance refresh
- Auto Scaling terminates one instance
- New instance launches with new config
- Wait for health checks to pass
- Repeat until all instances replaced
Blue/Green deployment (zero downtime):
- Create new ASG with new version
- Attach to same target group
- Gradually shift traffic (weighted target groups)
- Monitor error rates
- Fully switch or rollback
Thinking Exercise
Problem: You’re building a web application where each EC2 instance can reliably handle 100 requests per second. You expect peak traffic of 500 requests per second during business hours (9 AM - 5 PM), and only 50 requests per second during off-hours.
Question 1: How many instances do you need at minimum for peak load?
Click for answer
Answer: 500 req/sec ÷ 100 req/sec per instance = 5 instances minimum at peak
But you should add buffer for:
- Instance failures
- Traffic spikes beyond expected peak
- Performance degradation under load
Recommended:
max_size = 8(160% of calculated minimum, allows for 60% spike)min_size = 2(high availability during off-hours)desired_capacity = 6(20% buffer above minimum requirement)
Question 2: If each instance costs $0.05/hour, how much do you save per month with auto-scaling vs. running peak capacity 24/7?
Click for answer
Without auto-scaling (running 5 instances 24/7):
- Cost = 5 instances × $0.05/hour × 24 hours × 30 days = $180/month
With auto-scaling:
- Peak hours (9 AM - 5 PM = 8 hours): 5 instances
- Off-hours (16 hours): 2 instances
Daily cost:
- Peak: 5 instances × $0.05 × 8 hours = $2.00
- Off-hours: 2 instances × $0.05 × 16 hours = $1.60
- Total per day = $3.60
Monthly cost:
- $3.60 × 30 days = $108/month
Savings: $180 - $108 = $72/month (40% reduction)
With realistic scaling (more granular adjustments), savings are often 50-70%.
Question 3: Your health check interval is 30 seconds, unhealthy threshold is 3. An instance’s application crashes. How long until the ALB stops sending traffic to it?
Click for answer
Answer:
- Health check every 30 seconds
- Need 3 consecutive failures
- Time = 30 seconds × 3 = 90 seconds minimum
During this time, users may experience:
- Timeout errors (if health check timeout is 5 sec, user requests timeout too)
- Failed requests sent to unhealthy instance
Optimization: Reduce interval to 10 seconds, unhealthy threshold to 2:
- Time to detect = 10 × 2 = 20 seconds
- Trade-off: More frequent health checks = slightly more traffic
Question 4: You set scale-up to trigger at 70% CPU, scale-down at 30% CPU. What problem might you encounter?
Click for answer
Problem: Flapping (constant scale-up/scale-down oscillation)
Scenario:
- 3 instances at 75% CPU → scale up to 4 instances
- Load distributed across 4 → CPU drops to 56% (still above 30%)
- Stays stable… then one instance terminates unexpectedly
- Load redistributes to 3 instances → CPU back to 75% → scale up again
- Repeat forever
Or worse:
- 3 instances at 75% CPU → scale up to 4
- CPU drops to 28% → scale down to 3
- CPU jumps to 75% → scale up to 4
- Infinite loop, costs money, destabilizes app
Solution: Add cooldown periods and wider gap:
resource "aws_autoscaling_policy" "scale_up" {
cooldown = 300 # Wait 5 minutes before scaling again
# ...
}
# Use thresholds with buffer:
# Scale up at 70% CPU
# Scale down at 20% CPU (50 point gap prevents flapping)
Better solution: Use target tracking scaling:
target_tracking_configuration {
target_value = 50.0 # AWS automatically manages scale-up/down to maintain this
}
AWS’s algorithm is smarter, prevents flapping, considers cooldowns automatically.
The Interview Questions They’ll Ask
When you claim AWS auto-scaling experience on your resume, expect these questions:
1. Basic Understanding
Q: “Explain the difference between vertical and horizontal scaling. When would you use each?”
Expected answer:
- Vertical = bigger instance (limited by hardware, downtime required)
- Horizontal = more instances (unlimited, no downtime, requires stateless design)
- Use vertical for: Legacy apps that can’t distribute, databases (until you move to RDS)
- Use horizontal for: Web apps, APIs, stateless services
Q: “What’s the difference between an Application Load Balancer and a Network Load Balancer?”
Expected answer:
- ALB: Layer 7 (HTTP/HTTPS), content-based routing, WebSockets, host/path routing
- NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions of req/sec
- Use ALB for: Web applications, microservices with path routing
- Use NLB for: Non-HTTP protocols, extreme performance requirements, static IP needed
2. Design Scenarios
Q: “You’re seeing 5xx errors from your ALB. How do you troubleshoot?”
Expected approach:
- Check target health in target group (unhealthy instances?)
- Check ALB access logs (which endpoints returning 5xx?)
- Check application logs on instances (app crashes? database timeouts?)
- Check security group rules (instances can reach database?)
- Check CloudWatch metrics (CPU/memory maxed out?)
Q: “Your application needs to maintain user sessions. How do you architect this with auto-scaling?”
Expected answer:
- Option 1: ElastiCache (Redis/Memcached) as shared session store
- Option 2: DynamoDB for session storage
- Option 3: JWT tokens (no server-side sessions)
- NOT sticky sessions (defeats auto-scaling benefits, data loss on instance termination)
3. Scaling Logic
Q: “You set your ASG to min=2, max=10, desired=5. You manually terminate an instance. What happens?”
Expected answer:
- Current instances: 4 (after termination)
- Desired capacity: still 5
- Auto Scaling detects current < desired
- Launches 1 new instance to reach desired=5
Q: “What’s the difference between target tracking scaling and step scaling?”
Expected answer:
- Target tracking: Set a target (e.g., “maintain 50% CPU”), AWS automatically scales up/down to maintain it. Simpler, recommended for most use cases.
- Step scaling: Define explicit rules (e.g., “if CPU > 70%, add 2 instances; if CPU > 85%, add 4 instances”). More control, more complex, use for non-linear scaling needs.
4. Real-World Problem Solving
Q: “Your auto-scaling isn’t triggering when you expect. How do you debug?”
Expected approach:
- Check CloudWatch alarms (are they in ALARM state?)
- Check alarm history (has threshold actually been crossed?)
- Check alarm configuration (right metric? right threshold? evaluation periods?)
- Check ASG configuration (is policy attached? cooldown preventing scale?)
- Check instance metrics (is data actually being reported?)
Q: “You deployed a new version and now instances are failing health checks. What do you check?”
Expected approach:
- SSH to instance, check application logs
- Test health check endpoint manually:
curl localhost:80/health - Check if app started correctly (check user data script logs)
- Check security group (does instance allow traffic on health check port?)
- Check health check configuration (path correct? timeout too short?)
- Check grace period (is app given enough time to start before checks?)
5. Cost Optimization
Q: “How would you reduce costs for an auto-scaled application?”
Expected strategies:
- Right-size instances (use smaller instance types if CPU consistently low)
- Use Spot instances for fault-tolerant workloads (70-90% cheaper)
- Implement aggressive scale-down (reduce min_size during known low-traffic periods)
- Use scheduled scaling (scale down automatically at night/weekends)
- Reserved Instances or Savings Plans for baseline capacity
- Monitor and optimize unhealthy instance replacement (failing fast vs. retrying)
Q: “Explain Spot instances in the context of auto-scaling. What are the risks?”
Expected answer:
- Spot = unused EC2 capacity at up to 90% discount
- Risk: AWS can terminate with 2-minute notice if capacity needed
- Use in ASG with mixed instance types (Spot + On-Demand)
- Configure Spot allocation strategy (price-capacity-optimized)
- Not suitable for: Stateful apps, databases, single-instance workloads
- Perfect for: Batch processing, web front-ends (with On-Demand baseline)
Hints in Layers
When you get stuck, reveal hints progressively instead of jumping to the solution:
Problem: Instances launching but failing health checks immediately
Hint 1 (First check)
Check the health check grace period. Your application might need time to start up.
resource "aws_autoscaling_group" "app" {
health_check_grace_period = 300 # Seconds to wait before health checks
}
If your app takes 3 minutes to initialize but grace period is 30 seconds, instances will be terminated before they’re ready.
Hint 2 (If still failing)
SSH into a failing instance and test the health check endpoint manually:
# From within the instance
curl -v http://localhost:80/health
# Check if the application is actually running
ps aux | grep your-app-name
# Check application logs
tail -f /var/log/your-app/app.log
Is the application even starting? Is it listening on the correct port?
Hint 3 (Security check)
Verify security group rules allow health checks:
aws ec2 describe-security-groups --group-ids sg-xxxxx --no-cli-pager
# Look for:
# - Inbound rule allowing ALB security group on port 80
# - Or inbound rule allowing the VPC CIDR on port 80
The ALB needs network access to perform health checks.
Hint 4 (Application check)
Check your user data script logs:
# On Amazon Linux/Ubuntu
cat /var/log/cloud-init-output.log
# Look for errors in your bootstrap script
# Did database connection fail?
# Did dependencies install correctly?
A failing user data script means your app never starts.
Solution (Last resort)
Common causes and fixes:
- Application takes too long to start:
health_check_grace_period = 600 # Increase to 10 minutes - Wrong health check path:
resource "aws_lb_target_group" "app" { health_check { path = "/health" # Make sure this endpoint exists! } } - Health check endpoint requires database, database unreachable:
- Fix security group rules allowing app tier → database tier
- Or simplify health check to not require database
- Application listening on wrong port:
# Your app app.run(host='0.0.0.0', port=80) # Must match target group port - User data script has errors, app never starts:
- Test user data script locally first
- Add error handling:
set -eto fail fast - Check logs:
/var/log/cloud-init-output.log
Problem: Auto-scaling not triggering when CPU is high
Hint 1
Check if your CloudWatch alarm is actually in ALARM state:
aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager
Look at StateValue. If it’s OK, the threshold hasn’t been crossed.
Hint 2
Check your alarm configuration:
aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager
# Verify:
# - Threshold: Is it too high? (e.g., 99% vs 70%)
# - EvaluationPeriods: Does CPU need to be high for multiple periods?
# - Period: Is it too long? (e.g., 5 minutes vs 1 minute)
# - Statistic: Average vs Maximum vs Minimum
Example: If EvaluationPeriods=3 and Period=300, CPU must be high for 15 minutes before alarm triggers.
Hint 3
Check if scaling is in cooldown:
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg --no-cli-pager
# Look for recent scaling activities
aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg --max-records 5 --no-cli-pager
If a scaling action just happened, cooldown period prevents another one (default 300 seconds).
Solution
Common fixes:
- Alarm threshold too high:
threshold = 70 # Not 90 - Evaluation period too long:
evaluation_periods = 2 # Not 5 period = 60 # 1 minute, not 5 - Cooldown preventing scaling:
resource "aws_autoscaling_policy" "scale_up" { cooldown = 60 # Reduce from 300 } - Alarm not attached to scaling policy:
resource "aws_cloudwatch_metric_alarm" "cpu_high" { # ... alarm_actions = [aws_autoscaling_policy.scale_up.arn] # Must be set! } - Use target tracking instead:
resource "aws_autoscaling_policy" "target_tracking" { policy_type = "TargetTrackingScaling" target_tracking_configuration { predefined_metric_specification { predefined_metric_type = "ASGAverageCPUUtilization" } target_value = 50.0 } } # AWS handles everything automatically
Books That Will Help
| Book | Author | What It Teaches | Best Sections for This Project |
|---|---|---|---|
| AWS for Solutions Architects | Saurabh Shrivastava et al. | Comprehensive AWS architecture patterns | Ch. 6 (Auto Scaling Groups), Ch. 7 (RDS), Ch. 10 (High Availability) |
| AWS Certified Solutions Architect Study Guide | Ben Piper, David Clinton | Exam-focused AWS fundamentals | Ch. 4 (EC2), Ch. 5 (ELB), Ch. 7 (CloudWatch) |
| Designing Data-Intensive Applications | Martin Kleppmann | Distributed systems theory (applies to auto-scaling) | Ch. 1 (Scalability), Ch. 8 (Distributed Systems) |
| Amazon Web Services in Action | Michael Wittig, Andreas Wittig | Hands-on AWS with practical examples | Ch. 3 (Infrastructure Automation), Ch. 6 (Scaling Up and Down) |
| The Phoenix Project | Gene Kim et al. | DevOps principles (why auto-scaling matters) | Part 2 (First Way - Flow), Part 3 (Second Way - Feedback) |
| Site Reliability Engineering | Google SRE Team | Operational practices for scaled systems | Ch. 6 (Monitoring), Ch. 22 (Cascading Failures - why you need auto-scaling) |
| Terraform: Up & Running | Yevgeniy Brikman | Infrastructure as Code for AWS | Ch. 2 (Terraform Syntax), Ch. 5 (State Management), Ch. 7 (Modules) |
Reading strategy:
- Start with “AWS for Solutions Architects” Ch. 6-7 for AWS-specific patterns
- Read “Designing Data-Intensive Applications” Ch. 1 to understand why systems need to scale
- Use “Terraform: Up & Running” Ch. 2 as a reference while coding
- Read “SRE” Ch. 22 after completing the project to understand failure modes you just protected against
Common Pitfalls & Debugging
Problem 1: “Auto Scaling not triggering when CPU is high”
- Why: CloudWatch alarm incorrectly configured, insufficient evaluation periods, or metric not being published
- Fix: Verify alarm state:
aws cloudwatch describe-alarms --alarm-names YOUR_ALARM_NAME - Quick test: Check alarm history:
aws cloudwatch describe-alarm-history --alarm-name YOUR_ALARM_NAME --max-records 5 - Common cause: Using average CPU across all instances - if 1 instance is at 100% but others are idle, average might not trigger threshold
- Best practice: Use target tracking policies instead of step scaling for more responsive behavior
Problem 2: “Instances launching but failing health checks immediately”
- Why: Health check grace period too short, application taking longer to start than expected, or wrong health check endpoint configured
- Fix: Increase
health_check_grace_periodin Auto Scaling Group (default 300 seconds often too short for complex apps) - Quick test: SSH into a newly launched instance and manually curl the health check endpoint to see actual response
- Debugging: Check ALB target group health checks vs ASG health checks - they can conflict
- Common issue: Application binds to
localhostinstead of0.0.0.0, so ALB health checks from outside instance fail
Problem 3: “RDS connection pool exhausted - ‘too many connections’ error”
- Why: Each EC2 instance creates its own connection pool; auto-scaling adds instances which multiply connections to RDS
- Fix: Implement connection pooling properly - use PgBouncer/ProxySQL or RDS Proxy to multiplex connections
- Quick test: Check RDS CloudWatch metric
DatabaseConnections- compare tomax_connectionsparameter - Calculation: If each instance uses 20 connections and you scale to 50 instances = 1,000 connections (most RDS instances max out at 150-500)
- Solution: Use RDS Proxy (AWS managed connection pooler) or reduce per-instance connection pool size
Problem 4: “Instances can’t reach RDS database - connection timeout”
- Why: Security group on RDS not allowing traffic from EC2 security group, or RDS in different VPC/subnet
- Fix: RDS security group must have inbound rule allowing port 5432 (PostgreSQL) or 3306 (MySQL) from EC2 security group ID
- Quick test: From EC2 instance, run
nc -zv RDS_ENDPOINT 5432to test TCP connectivity - Network debugging: Check route tables - private subnet route table must route to NAT gateway for outbound, but RDS is internal so no NAT needed
- Common mistake: Allowing RDS access by IP range instead of security group ID - IPs change with auto-scaling
Problem 5: “ALB returning 502 Bad Gateway errors intermittently”
- Why: Application crashing on some instances, instances deregistering during traffic, or connection timeout misconfiguration
- Fix: Check ALB target health in console, examine CloudWatch Logs for unhealthy target count
- Quick test:
aws elbv2 describe-target-health --target-group-arn YOUR_TG_ARNshows which instances are unhealthy - Common cause: Application takes 30+ seconds to respond, but ALB target timeout is 30 seconds (default) - increase to 60s
- Debugging: Enable ALB access logs to S3 to see exact error codes and target IP addresses
Problem 6: “Auto Scaling stuck - won’t scale in (decrease instances)”
- Why: Scale-in protection enabled on instances, cooldown period preventing scale-in, or Application Auto Scaling suspended during ECS deployments
- Fix: Check if “Turn off scale-in” option is enabled in scaling policy (should be disabled unless you have specific reason)
- Quick test: Verify Auto Scaling Group activities:
aws autoscaling describe-scaling-activities --auto-scaling-group-name YOUR_ASG --max-records 10 - For ECS: Application Auto Scaling suspends scale-in during deployments and resumes after completion - this is by design
- Cooldown: Default 300-second cooldown between scaling activities prevents rapid fluctuations - be patient or reduce cooldown
Problem 7: “Session state lost when traffic moves to different instance”
- Why: Application stores session data in memory (not stateless), new instance doesn’t have session info
- Fix: Use ElastiCache (Redis/Memcached) or DynamoDB for shared session storage across instances
- Alternative: Enable ALB sticky sessions (session affinity) - but this defeats auto-scaling benefits if sessions are long-lived
- Best practice: Make application truly stateless - session data in Redis, user uploads in S3, not on instance filesystem
- Quick test: Hit ALB endpoint twice from browser - check if you’re logged out (lost session) between requests
Problem 8: “CloudWatch alarms not triggering - metrics show no data”
- Why: CloudWatch agent not installed or configured on EC2 instances, IAM role missing CloudWatch permissions
- Fix: Ensure EC2 IAM role has
CloudWatchAgentServerPolicymanaged policy attached - Quick test: SSH to instance and check CloudWatch agent status:
sudo systemctl status amazon-cloudwatch-agent - Manual test: Push custom metric:
aws cloudwatch put-metric-data --namespace MyApp --metric-name TestMetric --value 1 - Common issue: Metric has wrong namespace or dimension name in alarm configuration vs what application publishes
Problem 9: “Deployment causes downtime - all instances replaced simultaneously”
- Why: Launch template updated, Auto Scaling Group performs rolling replacement, but replacement too fast
- Fix: Configure deployment settings: set
max_unavailableto1andmin_healthy_percentageto90%to ensure gradual rollout - For Blue/Green: Use separate target groups and weighted ALB routing to shift traffic gradually
- Best practice: Use instance refresh with checkpoint delays to validate new instances before continuing
- Quick test: During deployment, watch target group:
aws elbv2 describe-target-health --target-group-arn YOUR_TG_ARNevery 10 seconds
Problem 10: “Costs skyrocketing - Auto Scaling scaling out but never in”
- Why: Scale-in threshold not configured, metric stuck high, or zombie instances in ASG
- Fix: Set both scale-out AND scale-in policies with appropriate thresholds (e.g., scale out at >70% CPU, scale in at <30% CPU)
- Cost check: Run
aws ce get-cost-and-usageto see EC2 vs other service costs - Quick audit:
aws autoscaling describe-auto-scaling-groupsand check current capacity vs desired vs max - Reality check: Auto Scaling should oscillate around desired capacity - if it’s always at max, your scale-out threshold is too sensitive
Debugging Tools & Techniques:
- Auto Scaling Activity History: Shows why scaling happened and if it succeeded/failed
aws autoscaling describe-scaling-activities \ --auto-scaling-group-name YOUR_ASG \ --max-records 20 \ --query 'Activities[*].[StartTime,Description,Cause]' \ --output table - ALB Target Health Dashboard: Real-time view of which instances are healthy/unhealthy
- CloudWatch Container Insights: If using ECS, shows task-level CPU/memory metrics
- RDS Performance Insights: Identifies slow queries causing connection pool exhaustion
- VPC Flow Logs: Trace network traffic between ALB ↔ EC2 ↔ RDS to find security group issues
Sources:
-
[Troubleshoot auto scaling issues in Amazon ECS AWS re:Post](https://repost.aws/knowledge-center/ecs-service-auto-scaling-issues) - Troubleshooting service auto scaling in Amazon ECS
- Troubleshoot issues in Amazon EC2 Auto Scaling
- Avoiding Common Pitfalls with ECS Capacity Providers and Auto Scaling
Project 4: Container Platform (ECS Fargate or EKS)
- File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, TypeScript, Terraform HCL
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: Level 3: The “Service & Support” Model
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: Containers, Kubernetes, Orchestration
- Software or Tool: Docker, ECS, EKS, Kubernetes
- Main Book: “AWS for Solutions Architects” by Shrivastava et al.
What you’ll build: A containerized microservices application deployed on either ECS with Fargate (serverless containers) or EKS (managed Kubernetes), complete with service discovery, load balancing, and auto-scaling.
Why it teaches Containers on AWS: Containers are between EC2 and Lambda—more portable than EC2, more control than Lambda. ECS teaches you AWS’s native container orchestration; EKS teaches you Kubernetes. Both force you to understand task definitions, networking modes, and service mesh concepts.
Core challenges you’ll face:
- Task Definitions (maps to container configuration): CPU/memory allocation, environment variables, port mappings, IAM task roles
- Networking Modes (maps to container networking): awsvpc mode, service discovery, ALB integration
- Service Scaling (maps to container orchestration): Target tracking, step scaling based on metrics
- ECR Integration (maps to image management): Building, tagging, pushing container images
- For EKS: Kubernetes Fundamentals (maps to orchestration): Pods, Deployments, Services, Ingress
Resources for key challenges:
- Learn Amazon ECS over a Weekend - Fast workshop-style course
- Provision an EKS cluster with Terraform - HashiCorp official tutorial
- Terraform EKS Tutorial - Spacelift comprehensive guide
- Build secure application networks with VPC Lattice - AWS Containers Blog
Key Concepts:
- ECS Task Definitions: Amazon ECS Developer Guide
- Fargate vs EC2 Launch Type: “AWS for Solutions Architects” Ch. 9
- Kubernetes on AWS: Amazon EKS User Guide
- Terraform EKS Module: terraform-aws-eks - GitHub
Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Docker basics, Projects 1-2 completed, some Kubernetes knowledge for EKS path
Real world outcome:
- Multiple containerized services communicating with each other
- A working application accessible via ALB
- CloudWatch Container Insights showing metrics
- Ability to deploy new versions with zero downtime
- For EKS:
kubectlcommands working against your cluster
Learning milestones:
- First milestone: Single container task running on Fargate → you understand task definitions
- Second milestone: Add ALB target group with service → you understand container load balancing
- Third milestone: Add second service with service discovery → you understand microservices communication
- Fourth milestone: Add auto-scaling based on CPU → you understand container elasticity
- Final milestone: CI/CD pipeline deploying to ECS/EKS → you understand container DevOps
Real World Outcome
When you complete this project, you’ll have tangible proof of a working container platform:
For ECS Fargate:
# View your running services
$ aws ecs list-services --cluster my-cluster --no-cli-pager
{
"serviceArns": [
"arn:aws:ecs:us-east-1:123456789:service/my-cluster/api-service",
"arn:aws:ecs:us-east-1:123456789:service/my-cluster/worker-service"
]
}
# Check service health
$ aws ecs describe-services --cluster my-cluster --services api-service --no-cli-pager
{
"services": [{
"serviceName": "api-service",
"runningCount": 2,
"desiredCount": 2,
"launchType": "FARGATE",
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": ["subnet-abc123", "subnet-def456"],
"securityGroups": ["sg-xyz789"]
}
}
}]
}
# Your application responds via ALB
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "service": "api", "task_id": "abc123def456", "version": "1.2.0"}
# Service discovery working (containers finding each other by DNS)
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/api/worker-status
{"worker_service": "reachable", "tasks": 3, "queue_depth": 42}
# View container images in ECR
$ aws ecr describe-images --repository-name my-app --no-cli-pager
{
"imageDetails": [
{
"imageDigest": "sha256:abc123...",
"imageTags": ["latest", "v1.2.0"],
"imagePushedAt": "2025-12-20T10:30:00+00:00"
}
]
}
For EKS:
# Your Kubernetes cluster is accessible
$ kubectl cluster-info
Kubernetes control plane is running at https://ABC123.gr7.us-east-1.eks.amazonaws.com
# View your running workloads
$ kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
api-deployment-7d8f9c-abc12 1/1 Running 0 2h
api-deployment-7d8f9c-def34 1/1 Running 0 2h
worker-deployment-5e6f7-xyz 1/1 Running 0 1h
$ kubectl get services -n production
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
api-service LoadBalancer 10.100.50.25 a1b2c3.us-east-1.elb.amazonaws.com 80:30001/TCP
worker-svc ClusterIP 10.100.75.10 <none> 8080/TCP
# Application accessible via Kubernetes LoadBalancer
$ curl http://a1b2c3.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "pod": "api-deployment-7d8f9c-abc12", "node": "ip-10-0-1-50.ec2.internal"}
# Rolling deployment with zero downtime
$ kubectl set image deployment/api-deployment api=my-repo/api:v1.3.0 -n production
deployment.apps/api-deployment image updated
$ kubectl rollout status deployment/api-deployment -n production
Waiting for deployment "api-deployment" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
deployment "api-deployment" successfully rolled out
Container Insights Dashboard:
- CPU/Memory utilization per service/pod
- Network throughput and connections
- Task/pod startup time and failure rates
- Container-level logs aggregated in CloudWatch
Service Discovery Working:
- ECS: Cloud Map DNS names (api-service.local, worker-service.local)
- EKS: Kubernetes DNS (api-service.production.svc.cluster.local)
- Containers automatically discover each other without hardcoded IPs
Rolling Deployments:
- Deploy new version without downtime
- Watch old tasks drain and new tasks become healthy
- Automatic rollback if health checks fail
The Core Question You’re Answering
“When should I use containers vs Lambda vs EC2, and what’s the difference between ECS and EKS?”
This is THE fundamental architectural decision on AWS:
- Lambda: Event-driven, sub-15-minute executions, no state, extreme auto-scaling. Use when you have unpredictable spiky traffic and stateless operations.
- Containers (ECS/EKS): Long-running processes, stateful applications, need specific runtime control, want portability. Use when you need more than 15 minutes, WebSocket connections, background workers, or existing Dockerized apps.
- EC2: Full OS control, legacy apps, specialized hardware needs, licensed software. Use when containers/Lambda don’t fit.
ECS vs EKS:
- ECS: AWS-native, simpler, less operational overhead, great for teams new to containers. Task definitions are AWS-specific JSON.
- EKS: Standard Kubernetes, portable across clouds, richer ecosystem, more complex. Use when you need Kubernetes features or multi-cloud portability.
Fargate vs EC2 Launch Type:
- Fargate: Serverless containers—you define task CPU/memory, AWS manages infrastructure. Higher per-task cost, zero operational overhead.
- EC2 Launch Type: You manage the EC2 instances in the cluster. Lower per-task cost if you have steady baseline load, more control, more ops work.
Concepts You Must Understand First
- Container Fundamentals:
- Docker images are layered filesystems (each Dockerfile instruction = layer)
- Container registries store images (ECR, Docker Hub)
- Containers are isolated processes sharing a kernel (not VMs)
- Port mapping: container internal port → host port
- Environment variables and secrets injection
- Container Orchestration:
- Scheduling: Placing containers on available compute resources based on CPU/memory requirements
- Service Discovery: How containers find each other (DNS-based)
- Load Balancing: Distributing traffic across container replicas
- Health Checks: Determining if a container is ready to receive traffic
- Auto-Scaling: Adjusting container count based on metrics
- ECS Concepts:
- Task Definition: Blueprint for your container (image, CPU, memory, ports, environment)
- Task: Running instance of a task definition (one or more containers running together)
- Service: Maintains desired count of tasks, integrates with ALB, handles deployments
- Cluster: Logical grouping of tasks/services
- Task Role: IAM permissions for your application (what the container can do)
- Execution Role: IAM permissions for ECS agent (pulling ECR images, writing logs)
- Kubernetes Concepts (for EKS):
- Pod: Smallest unit (one or more containers, shared network/storage)
- Deployment: Declarative pod management (replicas, rolling updates)
- Service: Stable endpoint for pods (ClusterIP, LoadBalancer, NodePort)
- Ingress: HTTP routing rules (maps URLs to services)
- Namespace: Logical cluster subdivision
- ConfigMap/Secret: Configuration and sensitive data injection
- Fargate vs EC2 Launch Types:
- Fargate: Specify vCPU/memory, AWS provisions infrastructure, pay per task resource/time
- EC2: You provision EC2 instances, ECS schedules tasks on them, pay for instances (can be cheaper at scale)
- Trade-off: Fargate = simplicity, EC2 = control + potential cost savings with Reserved Instances
Questions to Guide Your Design
Architecture Decisions:
- When would you choose Fargate over EC2 launch type? (Hint: variable workload vs steady baseline, ops overhead tolerance)
- Should you use ECS or EKS? (Hint: team Kubernetes experience, multi-cloud needs, ecosystem requirements)
- How many containers should run in a single task/pod? (Hint: tightly coupled = same task, independent = separate tasks)
Networking:
- How do containers in the same task communicate? (Hint: localhost, shared network namespace)
- How do containers in different tasks communicate? (Hint: service discovery via DNS)
- What’s the difference between ECS service discovery (Cloud Map) and an ALB? (Hint: internal service-to-service vs external clients)
- Which networking mode should you use (awsvpc, bridge, host)? (Hint: Fargate requires awsvpc)
Security:
- How do you handle secrets in containerized applications? (Hint: Secrets Manager/Parameter Store + task definition secrets, NOT environment variables in plaintext)
- What’s the difference between task role and execution role? (Hint: execution = ECS needs it, task = your app needs it)
- How do you restrict which services can talk to each other? (Hint: security groups with awsvpc mode)
Scaling & Performance:
- Should you scale based on CPU, memory, or request count? (Hint: depends on bottleneck—CPU-bound vs I/O-bound workloads)
- How do you handle database connection pooling with auto-scaling containers? (Hint: RDS Proxy or application-level pooling)
- What happens during a deployment? (Hint: rolling update drains old tasks while starting new ones)
Operational:
- How do you get logs from containers? (Hint: awslogs driver → CloudWatch)
- How do you debug a failing container startup? (Hint: CloudWatch logs, check execution role for ECR pull permissions)
- How do you do zero-downtime deployments? (Hint: ALB health checks + rolling update strategy)
Thinking Exercise
Map Kubernetes concepts to ECS equivalents:
| Kubernetes | ECS Equivalent | Notes |
|---|---|---|
| Pod | Task | Both are one or more containers with shared resources |
| Deployment | Service | Both maintain desired count and handle updates |
| Service (ClusterIP) | Service Discovery (Cloud Map) | Internal DNS-based discovery |
| Service (LoadBalancer) | ALB Target Group | External load balancing |
| Container in Pod spec | Container Definition in Task Definition | Both define image, ports, env vars |
| ConfigMap | SSM Parameter Store / Secrets Manager | Both inject configuration |
| Secret | Secrets Manager | Both handle sensitive data |
| Namespace | Cluster (loosely) | Logical separation (but ECS clusters are less strict) |
| Ingress | ALB Listener Rules | HTTP routing rules |
| HorizontalPodAutoscaler | Service Auto Scaling | Both scale based on metrics |
Key differences:
- Kubernetes is more declarative (desired state via YAML)
- ECS is more imperative (API calls to create/update services)
- Kubernetes has richer networking (network policies, service mesh)
- ECS is simpler for AWS-only deployments
Design exercise: If you have a microservices app with 5 services, should they all go in one task definition or separate services?
- Answer: Separate ECS services (or Kubernetes deployments). Each microservice should scale independently.
- Exception: If two containers are tightly coupled (app + sidecar proxy), same task/pod makes sense.
The Interview Questions They’ll Ask
Basic Level:
- Q: What’s the difference between a Docker image and a container?
- A: An image is a read-only template (layers of filesystem changes). A container is a running instance of an image with a writable layer on top. Analogy: image = class, container = object instance.
- Q: Explain ECS task vs service.
- A: A task is a running instantiation of a task definition (one or more containers). A service maintains a desired count of tasks, integrates with load balancers, and handles deployments. Tasks are ephemeral; services ensure they keep running.
- Q: What is Fargate?
- A: Serverless compute for containers. You specify CPU/memory in your task definition, AWS provisions and manages the underlying infrastructure. No EC2 instances to manage.
- Q: How do containers in the same ECS task communicate?
- A: Via localhost. Containers in the same task share a network namespace, so they can reach each other on 127.0.0.1 using different ports.
Intermediate Level:
- Q: When would you use ECS over Lambda?
- A: When you need: (1) longer than 15-minute execution, (2) WebSocket/long-lived connections, (3) stateful processing, (4) specific runtime dependencies not available in Lambda layers, (5) existing Dockerized applications.
- Q: Explain the difference between task role and execution role in ECS.
- A: Execution role: permissions ECS needs to set up your task (pull ECR images, write CloudWatch logs). Task role: permissions your application code needs (read S3, query DynamoDB). Never confuse these—execution role is infrastructure, task role is application.
- Q: How does ECS service discovery work?
- A: Uses AWS Cloud Map to create DNS records for tasks. When you enable service discovery, each task gets a DNS entry (e.g., api-service.local). Other services query this DNS name and get IPs of healthy tasks. Updates automatically as tasks start/stop.
- Q: How do you implement zero-downtime deployments in ECS?
- A: Use rolling update deployment type with ALB. Configure minimum healthy percent (e.g., 100%) and maximum percent (e.g., 200%). ECS starts new tasks, waits for ALB health checks to pass, then drains and stops old tasks. If health checks fail, deployment rolls back.
Advanced Level:
- Q: You have an ECS service that keeps failing health checks and restarting. How do you debug?
- A: (1) Check CloudWatch logs for application errors. (2) Verify health check endpoint in ALB target group matches application. (3) Check security groups allow ALB → tasks traffic. (4) Verify task role has permissions app needs. (5) Check container startup time vs health check interval (may need longer initial delay). (6) Exec into a running task to test manually:
aws ecs execute-command --cluster X --task Y --container Z --command /bin/sh.
- A: (1) Check CloudWatch logs for application errors. (2) Verify health check endpoint in ALB target group matches application. (3) Check security groups allow ALB → tasks traffic. (4) Verify task role has permissions app needs. (5) Check container startup time vs health check interval (may need longer initial delay). (6) Exec into a running task to test manually:
- Q: How would you handle database connection pooling with auto-scaling ECS services?
- A: Options: (1) Use RDS Proxy (connection pooling/multiplexing at AWS layer). (2) Application-level pooling with conservative pool size per container (max_connections / expected_task_count). (3) Use connection pool libraries that handle connection reuse. Problem: Each task creates its own pool, so 10 tasks with 10 connections each = 100 DB connections. RDS Proxy solves this.
- Q: Compare ECS awsvpc networking mode with bridge mode.
- A: awsvpc: Each task gets its own ENI with private IP from VPC subnet. Security groups apply at task level. Required for Fargate. Better isolation. Bridge: Containers share host’s network via Docker bridge. Port mapping required (host port != container port). Multiple tasks on same host need different host ports. awsvpc is recommended for new deployments.
- Q: When would you choose EKS over ECS?
- A: When you need: (1) Kubernetes-specific features (CRDs, operators, Helm charts), (2) multi-cloud portability (same K8s manifests work on GKE/AKS), (3) existing Kubernetes expertise on team, (4) vendor-neutral orchestration. ECS is simpler and AWS-native; EKS is more complex but standard.
- Q: How do you handle secrets in containerized apps on AWS?
- A: Store secrets in AWS Secrets Manager or SSM Parameter Store (SecureString). Reference them in task definition secrets field (NOT environment variables). ECS retrieves secrets at task startup using execution role permissions, injects them as environment variables into container. Secrets are encrypted at rest and in transit. Never hardcode secrets in Dockerfile or pass as plaintext env vars.
- Q: Explain a scenario where you’d use Fargate EC2 launch type instead of Fargate.
- A: When you have: (1) steady baseline load (Reserved Instances cheaper than Fargate per-task pricing), (2) need for specific EC2 instance types (GPU, high memory), (3) tasks require privileged mode or host networking, (4) need to run your own AMI with custom configs. Trade-off: lower cost and more control, but you manage cluster capacity and OS patching.
- Q: Your containerized application experiences cold start delays. How do you optimize?
- A: (1) Reduce image size (multi-stage builds, minimal base images like Alpine). (2) Optimize Dockerfile layer caching (COPY dependencies before code). (3) Pre-warm containers if predictable traffic spikes. (4) Use smaller task CPU/memory if over-provisioned (faster scheduling). (5) For EKS: Use cluster autoscaler with appropriate scaling configs. (6) Consider keeping minimum task count > 0 to avoid cold starts entirely.
Hints in Layers
Level 1: Getting Started
- Start with a single-container task definition running nginx or a simple Python/Node.js app
- Use Fargate to avoid EC2 cluster management
- Deploy to public subnet first (simpler than private with NAT)
- Use AWS Console to create your first task definition—you’ll see all the options
- Put “latest” tag on your first image (iterate fast)
Level 2: Adding Realism
- Move to private subnets with NAT gateway (production pattern)
- Add ALB in front of your service for stable endpoint
- Create second container that calls the first (understand inter-service communication)
- Use ECR instead of Docker Hub (AWS-native, faster pulls)
- Start using specific version tags (v1.0.0, not “latest”)
Level 3: Production Patterns
- Enable service discovery (Cloud Map) for service-to-service DNS
- Configure auto-scaling based on ALB request count or custom CloudWatch metrics
- Set up proper health checks (readiness vs liveness)
- Add Container Insights for metrics
- Implement rolling deployments with deployment circuit breaker (auto-rollback on failure)
Level 4: Advanced Scenarios
- Multi-container task with sidecar pattern (app + logging agent)
- Task role granting only necessary permissions (least privilege)
- Secrets injection from Secrets Manager (no plaintext env vars)
- Blue/green deployments using CodeDeploy
- For EKS: Implement Horizontal Pod Autoscaler and Cluster Autoscaler together
Level 5: Mastery
- CI/CD pipeline: GitHub Actions → build image → push to ECR → update ECS service
- Canary deployments (send 10% traffic to new version, monitor, then 100%)
- Service mesh (App Mesh for ECS or Istio for EKS) for advanced routing/observability
- Cross-region replication for DR (replicate ECR images, deploy to multiple regions)
- Custom metrics from containers to CloudWatch for scaling (e.g., queue depth)
Books That Will Help
| Book Title | Author(s) | Relevant Chapters | Why It Helps |
|---|---|---|---|
| AWS for Solutions Architects | Saurabh Shrivastava | Ch. 9: Containers on AWS | Best coverage of ECS architecture patterns, Fargate vs EC2 decisions |
| Docker Deep Dive | Nigel Poulton | Ch. 3-5, 8-9 | Container fundamentals, image layers, networking modes |
| Kubernetes in Action | Marko Luksa | Ch. 3-7 | Core K8s concepts (pods, services, deployments) for EKS path |
| Kubernetes Up & Running | Kelsey Hightower et al. | Ch. 5-6, 9-10 | Practical K8s patterns, service discovery, load balancing |
| Amazon Web Services in Action | Andreas Wittig & Michael Wittig | Ch. 14: Containers | Step-by-step ECS tutorial with CloudFormation examples |
| The DevOps Handbook | Gene Kim et al. | Part IV: Technical Practices | CI/CD for containers, deployment strategies (blue/green, canary) |
| Site Reliability Engineering | Ch. 7, 21 | Load balancing, monitoring containerized systems at scale | |
| Production Kubernetes | Josh Rosso et al. | Ch. 4-6, 11 | Production-grade EKS: networking, security, observability |
| Container Security | Liz Rice | Ch. 2-4, 7 | Securing container images, runtime, orchestrator (critical for prod) |
| Designing Data-Intensive Applications | Martin Kleppmann | Ch. 11: Stream Processing | Understanding when to use containers for stateful vs stateless workloads |
Quick Reference:
- New to containers? Start with “Docker Deep Dive” Ch. 3-5
- Choosing ECS? Focus on “AWS for Solutions Architects” Ch. 9
- Choosing EKS? Read “Kubernetes Up & Running” Ch. 5-6 first
- Ready for production? “Container Security” is mandatory
- Need CI/CD? “The DevOps Handbook” Part IV
Common Pitfalls & Debugging
Problem 1: “ECS tasks fail to start - stuck in PENDING or PROVISIONING state”
- Why: Insufficient resources in cluster (EC2 launch type), ENI limit reached in VPC (Fargate), or invalid task definition
- Fix: Check ECS events tab in AWS Console for exact error message
- Quick test:
aws ecs describe-tasks --cluster YOUR_CLUSTER --tasks TASK_ARN --query 'tasks[0].stoppedReason' - Common causes:
- ENI limit: Each Fargate task needs its own ENI - VPC subnet exhausted available IPs or hit ENI quota
- Resource constraints: For EC2 launch type, container instances don’t have enough CPU/memory
- Image pull failure: ECR permissions missing on execution role or image doesn’t exist
- Solution: For ENI issues, use larger CIDR block for private subnets or request ENI quota increase
Problem 2: “Cannot pull image from ECR - ‘CannotPullContainerError’“
- Why: ECS execution role lacks ECR permissions, or VPC doesn’t have route to ECR
- Fix: Ensure execution role has
AmazonECSTaskExecutionRolePolicyattached - Quick test: Manually pull image from EC2 instance in same VPC using task’s IAM role:
aws ecr get-login-password | docker login --username AWS --password-stdin ECR_URI - VPC issue: If using private subnets without NAT, create VPC endpoints for:
com.amazonaws.REGION.ecr.dkr,com.amazonaws.REGION.ecr.api, andcom.amazonaws.REGION.s3 - Cost savings: VPC endpoints eliminate NAT Gateway data transfer costs for ECR pulls ($0.045/GB vs free with endpoints)
Problem 3: “ECS service keeps restarting - tasks fail health checks repeatedly”
- Why: Health check configuration mismatch, application not binding to correct port, or startup time exceeds grace period
- Fix: Verify ALB target group health check settings match application endpoint
- Quick test: Exec into running task and curl the health check endpoint:
aws ecs execute-command --cluster X --task Y --container Z --interactive --command "/bin/sh" - Common mistakes:
- Application binds to
localhostinstead of0.0.0.0- ALB can’t reach it - Health check path wrong (e.g.,
/healthvs/healthz) - Health check interval too aggressive - 5 seconds with 30-second app startup = guaranteed failure
- Application binds to
- Fix: Increase
healthCheckGracePeriodSecondsin service definition to 60-120 seconds for slow-starting apps
Problem 4: “ECS task can’t connect to RDS database in same VPC”
- Why: Security group misconfiguration - RDS security group doesn’t allow inbound from ECS task security group
- Fix: RDS security group must have inbound rule:
Type: PostgreSQL/MySQL, Source: ECS_TASK_SECURITY_GROUP_ID - Quick test: From ECS task, run
nc -zv RDS_ENDPOINT 5432to test connectivity - Network debugging: Check task’s ENI is in same VPC as RDS and route tables allow local VPC traffic
- Common mistake: Using CIDR block instead of security group ID - tasks get dynamic IPs with awsvpc mode
Problem 5: “ECS service auto-scaling not working - tasks not scaling up/down”
- Why: CloudWatch alarm not configured correctly, Application Auto Scaling policies not attached, or metrics not being published
- Fix: Verify target tracking policy exists:
aws application-autoscaling describe-scaling-policies --service-namespace ecs - Quick test: Check CloudWatch metric is being published:
aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name CPUUtilization --dimensions Name=ServiceName,Value=YOUR_SERVICE - ECS-specific issue: Application Auto Scaling turns OFF scale-in during deployments and resumes after - this is expected behavior
- Capacity provider issue: If using EC2 launch type with capacity providers, ensure managed scaling is enabled and target capacity is set (typically 100%)
Problem 6: “Container logs not appearing in CloudWatch Logs”
- Why: Execution role lacks CloudWatch Logs permissions, log group doesn’t exist, or awslogs driver not configured in task definition
- Fix: Check execution role has
logs:CreateLogStreamandlogs:PutLogEventspermissions - Quick test: Verify log group exists:
aws logs describe-log-groups --log-group-name-prefix /ecs/YOUR_TASK_FAMILY - Task definition fix: Ensure
logConfigurationusesawslogsdriver with correct region and log group name - Auto-create logs: Set
awslogs-create-group: truein task definition to auto-create log groups
Problem 7: “ECS task ‘CannotStartContainerError’ - container keeps crashing on startup”
- Why: Application error in container code, missing environment variables, incorrect entrypoint/command, or resource limits too restrictive
- Fix: Check stopped task reason:
aws ecs describe-tasks --tasks TASK_ARN --query 'tasks[0].containers[0].reason' - Debug locally: Run same image locally with same env vars:
docker run -e VAR=value YOUR_IMAGE - Memory issue: If
Essential container in task exited, increase task memory - JVM applications often need 2-4x more than you think - ECS Exec: For running tasks, use
aws ecs execute-commandto exec into container and debug live
Problem 8: “EKS pods pending - ‘Insufficient cpu’ or ‘Insufficient memory’“
- Why: Node group doesn’t have capacity for pod resource requests, cluster autoscaler not configured, or resource requests too high
- Fix: Check pending pod events:
kubectl describe pod POD_NAME | grep -A 10 Events - Quick fix: Reduce pod resource requests or add more nodes to cluster
- Long-term solution: Configure Cluster Autoscaler or Karpenter for automatic node provisioning based on pending pods
- Cost optimization: Use spot instances in node group for non-critical workloads (70% cost savings)
Problem 9: “Fargate tasks costing more than expected - bill is huge”
- Why: Over-provisioned CPU/memory, tasks running 24/7 when not needed, or data transfer charges from NAT Gateway
- Fix: Analyze CloudWatch Container Insights to see actual CPU/memory usage vs allocated
- Right-sizing: If consistently using <50% of allocated resources, reduce task size (Fargate charges per vCPU-hour and GB-hour)
- Cost comparison: Fargate costs ~$0.04/hour for 0.25 vCPU + 0.5GB; EC2 t3.micro costs ~$0.01/hour - but you manage EC2 instances
- Data transfer: Use VPC endpoints for ECR/S3/other AWS services to eliminate NAT Gateway charges ($0.045/GB)
Problem 10: “Docker image push to ECR failing - ‘denied: Your authorization token has expired’“
- Why: ECR login token is valid for only 12 hours and expired
- Fix: Re-authenticate:
aws ecr get-login-password --region REGION | docker login --username AWS --password-stdin ECR_URI - CI/CD automation: In pipelines, always run
aws ecr get-login-passwordbefore pushing images - Alternative: Use Docker credential helper to automatically refresh tokens: install
amazon-ecr-credential-helper
Problem 11: “ECS service deployment stuck - new tasks start but old tasks don’t drain”
- Why: ALB connection draining taking too long, application not handling SIGTERM gracefully, or deployment configuration issues
- Fix: Check deployment circuit breaker events in ECS console
- Quick test: View service events:
aws ecs describe-services --cluster X --services Y --query 'services[0].events[:10]' - Graceful shutdown: Ensure application handles SIGTERM and closes connections within stop timeout (default 30 seconds)
- Deployment config: Verify
minimumHealthyPercent(e.g., 100%) andmaximumPercent(e.g., 200%) allow room for new tasks
Problem 12: “EKS nodes unable to join cluster - nodes in NotReady state”
- Why: IAM role for nodes missing policies, security groups blocking kubelet communication, or wrong AMI/userdata
- Fix: Check node IAM role has:
AmazonEKSWorkerNodePolicy,AmazonEC2ContainerRegistryReadOnly,AmazonEKS_CNI_Policy - Quick test: SSH to node and check kubelet status:
systemctl status kubelet - Security groups: Cluster security group must allow inbound from node security group on ports 443 (API) and 10250 (kubelet)
- Logs: Check
/var/log/cloud-init-output.logon node for userdata errors
Debugging Tools & Techniques:
- ECS Exec: Interactive debugging in running containers without SSH
aws ecs execute-command \ --cluster my-cluster \ --task TASK_ID \ --container app \ --interactive \ --command "/bin/bash" - CloudWatch Container Insights: CPU, memory, network, disk metrics at task/pod level
- AWS X-Ray: Distributed tracing for containerized microservices (requires X-Ray daemon sidecar)
- kubectl logs/describe: Essential EKS debugging
kubectl logs POD_NAME -f --tail=50 kubectl describe pod POD_NAME kubectl get events --sort-by='.lastTimestamp' - ECS Service Events: Check ECS console Events tab - shows deployment progress and errors in real-time
- VPC Flow Logs: Diagnose network connectivity issues between containers and databases/services
Sources:
-
[Troubleshoot scaling issues with Amazon ECS capacity providers AWS re:Post](https://repost.aws/knowledge-center/ecs-capacity-provider-scaling) - Troubleshooting service auto scaling in Amazon ECS
- Amazon ECS troubleshooting
- Amazon EKS troubleshooting
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| VPC from Scratch | Intermediate | 1-2 weeks | ⭐⭐⭐⭐⭐ (Foundational) | ⭐⭐⭐ |
| Serverless Pipeline | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ (Event-driven) | ⭐⭐⭐⭐ |
| Auto-Scaling Web App | Intermediate | 2-3 weeks | ⭐⭐⭐⭐⭐ (Classic AWS) | ⭐⭐⭐ |
| Container Platform | Advanced | 2-4 weeks | ⭐⭐⭐⭐⭐ (Modern AWS) | ⭐⭐⭐⭐⭐ |
Recommendation
Start with Project 1 (VPC from Scratch). Everything else depends on understanding networking. A misconfigured VPC will break Lambda VPC access, ECS service discovery, RDS connectivity—everything. Once you can confidently explain why your private subnet routes to a NAT gateway and your public subnet routes to an IGW, move on.
Then do Project 2 (Serverless Pipeline) to understand event-driven architecture—this is where modern AWS shines. Step Functions + Lambda is incredibly powerful once you “get it.”
Then Project 3 (Auto-Scaling Web App) for the “traditional” AWS architecture that many existing systems use. This teaches you fundamentals that apply everywhere.
Finally Project 4 (Containers) brings it all together—you need VPC knowledge, you can integrate with Lambda/Step Functions, and you’ll build on auto-scaling concepts.
Project 5: Full-Stack SaaS Platform
What you’ll build: A complete SaaS application with:
- Multi-tenant architecture with isolated VPCs per environment (dev/staging/prod)
- API Gateway + Lambda for REST/GraphQL endpoints
- ECS Fargate running background workers
- Step Functions orchestrating complex business workflows
- RDS Aurora for relational data
- DynamoDB for high-speed session/cache data
- S3 + CloudFront for static assets and file uploads
- Cognito for authentication
- CI/CD with CodePipeline or GitHub Actions
- Infrastructure defined entirely in Terraform
- CloudWatch dashboards, X-Ray tracing, and alarms
Why this is the ultimate AWS learning project: This forces you to make real architectural decisions—when to use Lambda vs ECS, how to structure your VPCs for multi-environment deployments, how services talk to each other across boundaries. You’ll hit every AWS gotcha: Lambda cold starts affecting user experience, ECS task role permissions, RDS connection pooling, S3 CORS issues, CloudFront cache invalidation timing.
Core challenges you’ll face:
- Multi-Environment Architecture: Terraform workspaces or separate state files, environment-specific configurations
- API Design: REST vs GraphQL, API Gateway vs ALB, authentication flows
- Data Architecture: When to use RDS vs DynamoDB, cross-service data access patterns
- Async Processing: SQS queues, dead-letter queues, exactly-once processing
- Security: IAM policies, secrets management (Secrets Manager/Parameter Store), encryption at rest and in transit
- Observability: Distributed tracing, centralized logging, meaningful metrics
- Cost Optimization: Right-sizing, Reserved Instances vs Savings Plans, S3 lifecycle policies
Key Concepts:
- Well-Architected Framework: AWS Well-Architected - AWS Official
- SaaS Architecture: “AWS for Solutions Architects” Ch. 10-12 - Shrivastava et al.
- Infrastructure as Code at Scale: Terraform Best Practices - HashiCorp
- Serverless Patterns: “Designing Data-Intensive Applications” Ch. 11-12 - Martin Kleppmann (for async patterns)
Difficulty: Advanced Time estimate: 1-2 months Prerequisites: Projects 1-4 completed
Real world outcome:
- A working SaaS application you can demo to employers
- User registration, login, and authenticated API calls
- Background job processing visible in Step Functions console
- Multi-environment deployment from a single codebase
- Cost monitoring dashboard showing your AWS spend
- A portfolio piece that demonstrates comprehensive AWS knowledge
Learning milestones:
- First milestone: Auth working (Cognito + API Gateway) → you understand identity on AWS
- Second milestone: Core API + database working → you understand data tier patterns
- Third milestone: Background processing working → you understand async architecture
- Fourth milestone: Multi-environment deployment → you understand infrastructure management
- Final milestone: Observability + alerting working → you understand production operations
Real World Outcome
When you complete this project, you’ll have a production-ready SaaS application that demonstrates mastery of AWS:
# 1. Multi-environment infrastructure deployment
$ cd terraform
$ terraform workspace list
default
dev
* staging
production
$ terraform workspace select production
Switched to workspace "production".
$ terraform apply
Plan: 47 to add, 0 to change, 0 to destroy.
Outputs:
api_gateway_url = "https://api.yourapp.com"
cloudfront_url = "https://d1234abcd.cloudfront.net"
cognito_user_pool_id = "us-east-1_AbCd1234"
rds_endpoint = "prod-db.cluster-xyz.us-east-1.rds.amazonaws.com"
alb_dns = "prod-alb-123456789.us-east-1.elb.amazonaws.com"
# 2. User registration and authentication flow
$ curl -X POST https://api.yourapp.com/auth/signup \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com", "password": "SecurePass123!"}'
{
"message": "User created successfully",
"userId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
"confirmationRequired": true
}
# 3. Confirm user and get JWT token
$ curl -X POST https://api.yourapp.com/auth/confirm \
-d '{"email": "user@example.com", "code": "123456"}'
{
"accessToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"idToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"expiresIn": 3600
}
# 4. Make authenticated API call
$ TOKEN="eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9..."
$ curl -H "Authorization: Bearer $TOKEN" \
https://api.yourapp.com/v1/users/profile
{
"userId": "a1b2c3d4-5678-90ab-cdef-1234567890ab",
"email": "user@example.com",
"createdAt": "2025-12-28T10:30:00Z",
"subscription": "free"
}
# 5. Trigger background job processing
$ curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.yourapp.com/v1/reports/generate \
-d '{"reportType": "analytics", "dateRange": "last_30_days"}'
{
"jobId": "job-9876-5432-1234",
"status": "QUEUED",
"message": "Report generation started. Check status at /v1/jobs/job-9876-5432-1234"
}
# 6. Monitor Step Functions execution
$ aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:ReportGenerator \
--max-results 5 \
--no-cli-pager
{
"executions": [
{
"executionArn": "arn:aws:states:us-east-1:123456789012:execution:ReportGenerator:job-9876-5432-1234",
"stateMachineArn": "arn:aws:states:us-east-1:123456789012:stateMachine:ReportGenerator",
"name": "job-9876-5432-1234",
"status": "RUNNING",
"startDate": "2025-12-28T10:35:00.000Z"
}
]
}
# 7. Check ECS Fargate background workers
$ aws ecs list-tasks --cluster production-workers --service-name report-processor
{
"taskArns": [
"arn:aws:ecs:us-east-1:123456789012:task/production-workers/abc123def456"
]
}
$ aws ecs describe-tasks --cluster production-workers --tasks abc123def456 \
--query 'tasks[0].{Status:lastStatus,Health:healthStatus,CPU:cpu,Memory:memory}'
{
"Status": "RUNNING",
"Health": "HEALTHY",
"CPU": "512",
"Memory": "1024"
}
# 8. View CloudWatch metrics and logs
$ aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name Count \
--dimensions Name=ApiName,Value=YourAppAPI \
--start-time 2025-12-28T00:00:00Z \
--end-time 2025-12-28T23:59:59Z \
--period 3600 \
--statistics Sum
# API Gateway handled 1,247 requests today
$ aws logs tail /aws/lambda/api-handler --follow
2025-12-28T10:35:12 START RequestId: abc-123-def-456
2025-12-28T10:35:12 [INFO] User a1b2c3d4 requested report generation
2025-12-28T10:35:13 [INFO] Job queued to SQS: job-9876-5432-1234
2025-12-28T10:35:14 END RequestId: abc-123-def-456 Duration: 342ms
# 9. Cost monitoring dashboard
$ aws ce get-cost-and-usage \
--time-period Start=2025-12-01,End=2025-12-28 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=SERVICE
{
"ResultsByTime": [
{
"Groups": [
{"Keys": ["Amazon RDS"], "Metrics": {"BlendedCost": {"Amount": "45.23"}}},
{"Keys": ["AWS Lambda"], "Metrics": {"BlendedCost": {"Amount": "12.67"}}},
{"Keys": ["Amazon S3"], "Metrics": {"BlendedCost": {"Amount": "8.34"}}},
{"Keys": ["Amazon CloudFront"], "Metrics": {"BlendedCost": {"Amount": "5.12"}}},
{"Keys": ["Amazon ECS"], "Metrics": {"BlendedCost": {"Amount": "22.89"}}}
]
}
]
}
# Total monthly cost: ~$94 for a production SaaS platform
# 10. X-Ray distributed tracing
$ aws xray get-trace-summaries \
--start-time 2025-12-28T10:00:00 \
--end-time 2025-12-28T11:00:00 \
--filter-expression 'service("api-handler")'
# Shows complete request path: API Gateway → Lambda → DynamoDB → Step Functions → SQS
What you’ll see in the AWS Console:
- API Gateway: REST API with Cognito authorizer, request/response logs, throttling metrics
- Cognito: User pool with users, MFA settings, identity providers
- Lambda Functions: Multiple functions (auth, API handlers, processors), invocation metrics, error rates
- Step Functions: State machine executions with visual workflow, input/output at each step
- ECS Fargate: Running tasks processing background jobs, CloudWatch Container Insights showing CPU/memory
- RDS Aurora: Multi-AZ PostgreSQL cluster with read replicas, Performance Insights showing query performance
- DynamoDB: Tables with on-demand billing, global secondary indexes, point-in-time recovery enabled
- S3 + CloudFront: Buckets for uploads/static assets, CloudFront distribution with HTTPS, cache hit ratio
- CloudWatch: Custom dashboards showing API latency, error rates, Lambda cold starts, database connections
- X-Ray: Service map showing all interconnections, trace timeline showing bottlenecks
The Core Question You’re Answering
“How do I architect a production-ready, multi-service AWS application that is secure, observable, cost-effective, and can scale from 10 to 10,000 users?”
This is the question AWS Solutions Architects answer daily. By building this project, you’ll make every decision they make: compute choices, data storage patterns, security boundaries, monitoring strategies.
Concepts You Must Understand First
Before starting this capstone project, ensure you’ve mastered these concepts from Projects 1-4:
1. VPC Networking (from Project 1)
- Why you need separate VPCs for dev/staging/prod (or VPC peering if shared)
- How to design CIDR blocks that don’t overlap between environments
- Book Reference: Review Project 1 concepts before starting
2. Serverless Architecture (from Project 2)
- Lambda cold starts and how they affect user-facing APIs
- Step Functions orchestration patterns for complex workflows
- S3 event-driven processing
- Book Reference: “AWS Lambda in Action” by Danilo Poccia, Ch. 4, 8
3. Auto-Scaling and Load Balancing (from Project 3)
- When to use Lambda vs ECS for different workloads
- RDS connection pooling with auto-scaled services
- Book Reference: “Designing Data-Intensive Applications” Ch. 1
4. Container Orchestration (from Project 4)
- ECS Fargate for background workers vs Lambda for short tasks
- Container security and IAM task roles
- Book Reference: “AWS for Solutions Architects” Ch. 9
5. IAM Security Model
- Principle of least privilege for service-to-service communication
- Resource-based policies vs identity-based policies
- Book Reference: “AWS Security” by Dylan Shield, Ch. 3
6. Cognito Authentication
- User pools vs identity pools
- JWT token validation in API Gateway
- Reference: Cognito Developer Guide
Questions to Guide Your Design
Architecture Decisions:
- When should you use Lambda vs ECS Fargate?
- Hint: Lambda for < 15 min, stateless, event-driven. ECS for long-running, WebSockets, background workers
- How do you handle secrets across multiple services?
- Hint: Secrets Manager with automatic rotation, reference in Lambda/ECS via ARN
- Should you use RDS or DynamoDB for your data tier?
- Hint: RDS for relational queries (users, orders), DynamoDB for high-throughput key-value (sessions, caching)
- How do you structure Terraform for multi-environment deployments?
- Option 1: Terraform workspaces (shared code, different state)
- Option 2: Separate directories per environment (more explicit)
Security Questions:
- How do you prevent API Gateway from being called without authentication?
- Hint: Cognito authorizer on API Gateway, validates JWT on every request
- What IAM permissions does Lambda need to call Step Functions?
- Hint:
states:StartExecutionon the state machine ARN
- Hint:
- How do you secure S3 buckets while allowing CloudFront access?
- Hint: Origin Access Identity (OAI) or Origin Access Control (OAC), bucket policy allowing only CloudFront
Cost Questions:
- What’s the most expensive component in your architecture?
- Likely answer: Multi-AZ RDS Aurora (can be $100+/month), NAT Gateway ($0.045/GB), Fargate tasks running 24/7
- How do you reduce Lambda costs?
- Right-size memory (Lambda Power Tuning), reduce package size, use ARM64 Graviton2 processors (20% cost savings)
- Should you use on-demand or provisioned capacity for DynamoDB?
- On-demand for unpredictable traffic, provisioned for steady predictable loads (cheaper at scale)
Thinking Exercise
The “Multi-Tenant Data Isolation” Problem
Your SaaS has 100 customers. Should you:
- A) Use one RDS database with a
tenant_idcolumn in every table? - B) Create a separate RDS instance per customer?
- C) Create a separate database per customer on a shared RDS cluster?
- D) Use DynamoDB with partition keys including
tenant_id?
Trace the trade-offs: Cost, isolation, backup complexity, query performance, schema changes.
Most teams choose: C for high-value customers (compliance/isolation), A for small customers (cost-effective). This is called a “tiered” multi-tenant architecture.
The Interview Questions They’ll Ask
Basic Level:
- “Explain the difference between Cognito User Pools and Identity Pools.”
- User Pools: Authentication (sign up, sign in, get JWTs). Identity Pools: Authorization (exchange tokens for AWS credentials to access S3/DynamoDB directly)
- “What’s the maximum Lambda execution time and how do you handle longer tasks?”
- 15 minutes. For longer: Use ECS Fargate, or chain multiple Lambda invocations via Step Functions
- “How does API Gateway integrate with Lambda?”
- Proxy integration: API Gateway passes HTTP request as event to Lambda. Lambda returns HTTP response format
Intermediate Level:
- “Your API has 1000 req/sec. Should you use Lambda or ECS?”
- Lambda can handle it (burst concurrency 1000-3000), but if consistent: consider Provisioned Concurrency or ECS for cost-effectiveness
- “How do you implement zero-downtime database migrations in RDS?”
- Blue/Green deployments with RDS, or use Aurora Serverless v2 with instant scaling
- “Explain how CloudFront caching works with S3 origins.”
- CloudFront caches at edge locations based on TTL. S3 bucket policy allows only CloudFront OAI. Cache invalidations cost $0.005 per path
- “What’s the difference between Step Functions Standard and Express workflows?”
- Standard: Long-running (up to 1 year), exactly-once execution, audit history. Express: Short-lived (<5 min), high-volume (100k/sec), at-least-once, cheaper
Advanced Level:
- “Your Lambda functions in VPC have 10-second cold starts. Debug and fix.”
- VPC cold starts = ENI provisioning. Solutions: Use VPC endpoints for AWS services, increase Lambda memory, consider Hyperplane ENIs (modern approach), or move Lambda out of VPC if possible
- “You have 100 concurrent Lambda functions connecting to RDS. What breaks?”
- RDS max_connections exhausted (typically 150-500). Fix: Use RDS Proxy for connection pooling/multiplexing
- “Design a disaster recovery strategy for this SaaS platform (RTO < 1 hour, RPO < 15 minutes).”
- Multi-region: Aurora Global Database (replicates to secondary region), S3 cross-region replication, Route53 health checks for failover. RPO: Aurora replication lag typically <1 second. RTO: Automated failover + Terraform apply in DR region
- “How do you handle API versioning (v1, v2) without breaking existing clients?”
- API Gateway stages (v1, v2) pointing to different Lambda versions, or path-based routing (/v1/, /v2/). Use Lambda aliases for gradual rollout
- “Your CloudFront distribution has a 20% cache hit ratio (should be 80%+). Why?”
- Query strings/headers breaking cache keys, dynamic content not cacheable, TTL too short, incorrect Cache-Control headers from origin
Hints in Layers
Hint 1: Start with one environment (dev) in one region Don’t build all 3 environments at once. Get dev working end-to-end first.
Hint 2: Use Terraform modules for reusable components
module "vpc" {
source = "./modules/vpc"
environment = var.environment
cidr_block = var.vpc_cidr
}
module "api" {
source = "./modules/api-gateway"
cognito_user_pool_id = module.auth.user_pool_id
}
Hint 3: Enable CORS on API Gateway for web apps If your frontend calls API from browser, CORS is required:
responses:
default:
headers:
Access-Control-Allow-Origin: "'*'"
Access-Control-Allow-Headers: "'Content-Type,Authorization'"
Hint 4: Use Lambda Powertools for structured logging
from aws_lambda_powertools import Logger, Tracer
logger = Logger()
tracer = Tracer()
@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
logger.info("Processing request", extra={"userId": event['userId']})
Hint 5: Implement health checks for all services
- Lambda: Create
/healthendpoint that checks database connectivity - ECS: Configure ALB health checks with reasonable intervals (30s) and unhealthy threshold (3)
- RDS: Use CloudWatch Enhanced Monitoring
Books That Will Help
| Book | Author(s) | Relevant Chapters | Why It Helps |
|---|---|---|---|
| AWS for Solutions Architects | Saurabh Shrivastava | Ch. 8 (Serverless), Ch. 9 (Containers), Ch. 10-12 (SaaS Architecture) | Complete AWS architecture patterns for multi-service apps |
| Designing Data-Intensive Applications | Martin Kleppmann | Ch. 1 (Scalability), Ch. 5-7 (Replication, Partitioning), Ch. 11 (Streams) | Foundational understanding of distributed systems, data architecture decisions |
| Terraform: Up & Running | Yevgeniy Brikman | Ch. 3 (State), Ch. 5 (Modules), Ch. 7 (Multi-Environment) | Infrastructure as Code best practices, Terraform workspaces vs directories |
| AWS Security | Dylan Shield | Ch. 3 (IAM), Ch. 5 (Data Protection), Ch. 7 (Monitoring) | Security architecture, secrets management, least privilege principles |
| AWS Lambda in Action | Danilo Poccia | Ch. 4 (Event Sources), Ch. 8 (Performance), Ch. 10 (Deployment) | Lambda optimization, API Gateway integration patterns |
| Site Reliability Engineering | Ch. 4 (Service Level Objectives), Ch. 6 (Monitoring), Ch. 22 (Cascading Failures) | Production operations mindset, what to monitor and why | |
| The DevOps Handbook | Gene Kim et al. | Part III (Feedback), Part IV (CI/CD) | Continuous delivery, deployment strategies, monitoring and observability |
| Production-Ready Microservices | Susan Fowler | Ch. 2-4 (Stability, Reliability, Scalability) | Production readiness checklist for each microservice |
Reading Strategy:
- Start with “AWS for Solutions Architects” Ch. 10-12 for SaaS architecture overview
- Read “Terraform: Up & Running” Ch. 5-7 while building infrastructure
- Reference “AWS Security” Ch. 3 when implementing IAM policies
- Use “Designing Data-Intensive Applications” Ch. 5-7 when designing data tier
- Review “SRE” Ch. 4, 6 after deployment for observability insights
Common Pitfalls & Debugging
Problem 1: “Terraform apply fails with ‘Error creating VPC: VpcLimitExceeded’“
- Why: AWS default limit is 5 VPCs per region, you’ve hit the limit
- Fix: Request VPC quota increase via AWS Service Quotas, or delete unused VPCs
- Quick test:
aws ec2 describe-vpcs --query 'length(Vpcs)'shows current VPC count - Best practice: Use VPC peering or Transit Gateway instead of creating VPC per environment if hitting limits
Problem 2: “API Gateway returns 403 Forbidden with Cognito authorizer”
- Why: JWT token expired (default 1 hour), invalid token, or authorizer misconfigured
- Fix: Verify token hasn’t expired, check Cognito authorizer is pointing to correct user pool
- Quick test: Decode JWT at jwt.io and check
exp(expiration) claim - Common mistake: Using ID token instead of access token, or vice versa - API Gateway authorizer expects ID token
Problem 3: “Lambda in VPC can’t access DynamoDB - timeout errors”
- Why: No route to DynamoDB from VPC, or NAT Gateway not configured for outbound internet
- Fix: Create VPC endpoint for DynamoDB:
com.amazonaws.REGION.dynamodb(free, no NAT needed) - Quick test: From Lambda CloudWatch Logs, check if timeout occurs at 3 seconds (DNS) or 30 seconds (connectivity)
- Cost savings: VPC endpoints eliminate NAT Gateway charges for AWS service access
Problem 4: “RDS connection pool exhausted - ‘too many connections’“
- Why: Each Lambda invocation creates new connection, doesn’t reuse across invocations
- Fix: Use RDS Proxy (connection pooling service) between Lambda and RDS
- Alternative: Initialize database connection OUTSIDE handler function to reuse across warm starts
- Math: 100 concurrent Lambdas × 5 connections each = 500 connections (exceeds most RDS instance limits)
Problem 5: “S3 CORS errors when uploading from browser - ‘No Access-Control-Allow-Origin header’“
- Why: S3 bucket CORS policy not configured for your frontend domain
- Fix: Add CORS configuration to S3 bucket allowing your CloudFront distribution or API Gateway domain
- Quick test: Check browser console Network tab for preflight OPTIONS request response
- Bucket policy:
{ "CORSRules": [{ "AllowedOrigins": ["https://yourapp.com"], "AllowedMethods": ["GET", "PUT", "POST"], "AllowedHeaders": ["*"], "MaxAgeSeconds": 3000 }] }
Problem 6: “CloudFront not serving updated files - still showing old version”
- Why: CloudFront edge caches haven’t expired (default TTL can be 24 hours)
- Fix: Create CloudFront invalidation:
aws cloudfront create-invalidation --distribution-id ID --paths "/*" - Cost: $0.005 per path invalidated (first 1000 free per month)
- Best practice: Use versioned file names (app.v123.js) instead of invalidations for deployments
Problem 7: “Step Functions execution fails - ‘Lambda function returned unexpected result’“
- Why: Lambda return value doesn’t match expected JSON format, or Lambda threw unhandled exception
- Fix: Check Lambda CloudWatch Logs for actual error, ensure Lambda returns proper JSON
- Quick test: Test Lambda independently via AWS Console Test function
- Step Functions debugging: View execution graph in AWS Console, check input/output of failed state
Problem 8: “Cognito sign-up fails - ‘UsernameExistsException’ but user doesn’t exist in pool”
- Why: User exists in unconfirmed state (signed up but didn’t verify email/phone)
- Fix: Resend confirmation code or delete unconfirmed user via Cognito Console
- Prevention: Implement auto-confirmation for development environments, or send reminder emails for unconfirmed users
Problem 9: “Multi-environment deployment overwrites production accidentally”
- Why: Wrong Terraform workspace selected, or statefile conflict
- Fix: Always verify workspace before apply:
terraform workspace show - Safety: Use Terraform backend with workspace prefix in S3 key, enable versioning on state bucket
- Best practice: Require approval for production deploys, implement CI/CD gates
Problem 10: “AWS bill is $500/month - expected $50”
- Why: Common culprits: NAT Gateway ($32/month + $0.045/GB), Multi-AZ RDS running 24/7, CloudWatch Logs retention
- Fix: Analyze with AWS Cost Explorer, check top 5 services
- Quick wins:
- Use VPC endpoints instead of NAT for AWS services
- Reduce RDS instance size or use Aurora Serverless v2
- Set CloudWatch Logs retention to 7 days (default is never expire)
- Use S3 lifecycle policies to transition old data to Glacier
- Monitoring: Set up billing alarms in CloudWatch for early warning
Debugging Tools & Techniques:
- AWS X-Ray Service Map: Visual graph showing all service dependencies and error rates
- CloudWatch Logs Insights: Query logs across all services
fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m) - CloudWatch Dashboards: Custom dashboard showing API latency, Lambda errors, RDS connections, Step Functions executions
- Terraform Plan: Always run
terraform planbeforeapplyto see what will change - AWS CloudFormation StackSets: For true multi-account/multi-region deployments beyond this project
Key Learning Resources Summary
| Resource | Best For |
|---|---|
| “AWS for Solutions Architects” - Shrivastava et al. | Comprehensive coverage of all services |
| AWS DevOps Zero to Hero | Free hands-on projects |
| HashiCorp Terraform Tutorials | Infrastructure as Code |
| AWS Official Tutorials | Service-specific guides |
| “Designing Data-Intensive Applications” - Kleppmann | Distributed systems concepts |
Sources
Market Statistics & Trends (2025)
- AWS Market Share 2025: Insights into the Buyer Landscape - HG Insights
- AWS Stats 2025: Cloud Market Share & Growth Insights - eSparkInfo
- Global Cloud Market Share Report & Statistics 2025 - TekRevol
- Cloud Market Share Trends to Watch in 2026 - Emma
- Cloud Market Share: A Look at the Cloud Ecosystem in 2025 - Kinsta
- Cloud infrastructure spending hit $102.6 billion in Q3 2025 - IT Pro
- AWS marketplace number of services 2025 - Statista
- Top AWS Stats You Should Know About in 2026 - Simplilearn
- AWS Statistics, User Count and Facts for 2025 - Expanded Ramblings
AWS Lambda Performance & Cold Starts
- AWS Lambda Cold Starts in 2025: When They Matter and What They Cost - EdgeDelta
- Lambda Cold Starts benchmark - maxday
- Understanding and Remediating Cold Starts: An AWS Lambda Perspective - AWS Blog
- AWS Lambda Cold Start Optimization in 2025: What Actually Works - Zircon Tech
- Operating Lambda: Performance optimization – Part 1 - AWS Compute Blog
AWS Learning Resources & Projects
- AWS Practice Projects - Coursera
- AWS DevOps Zero to Hero - GitHub
- Top AWS Project Ideas - KnowledgeHut
- AWS Step Functions Tutorials - AWS Docs
- Provision EKS with Terraform - HashiCorp
- Terraform EKS Tutorial - Spacelift
- 13 Best AWS Books - Hackr.io
AWS Best Practices & Architecture
- AWS VPC Security Best Practices - AWS Docs
- VPC Design Evolution - AWS Architecture Blog
- AWS Well-Architected Framework - AWS Docs
Summary
This learning path covers AWS cloud infrastructure mastery through 5 comprehensive hands-on projects. Here’s the complete list:
| # | Project Name | Main Language | Difficulty | Time Estimate | Key Services |
|---|---|---|---|---|---|
| 1 | Production-Ready VPC from Scratch | HCL (Terraform) | Intermediate | 1-2 weeks | VPC, Subnets, NAT Gateway, Security Groups |
| 2 | Serverless Data Pipeline | Python | Intermediate | 1-2 weeks | Lambda, Step Functions, S3, CloudWatch |
| 3 | Auto-Scaling Web Application | Python/Node.js | Intermediate | 2-3 weeks | EC2, ALB, RDS, Auto Scaling, S3 |
| 4 | Container Platform | YAML/Docker | Advanced | 2-3 weeks | ECS Fargate/EKS, ECR, Service Mesh |
| 5 | Full-Stack SaaS Platform | Python/TypeScript | Advanced | 1-2 months | Multi-service architecture combining all above |
Recommended Learning Paths
For AWS beginners (Cloud-native path):
- Start with Project 1 (VPC) - Foundation of everything
- Move to Project 2 (Serverless) - Modern AWS patterns
- Try Project 3 (Auto-Scaling Web App) - Traditional architectures
- Advance to Project 4 (Containers) - Industry-standard deployments
- Build Project 5 (Full-Stack SaaS) - Comprehensive capstone
For infrastructure engineers (Traditional to cloud):
- Project 1 (VPC) - Translate network knowledge to cloud
- Project 3 (Auto-Scaling Web App) - Familiar EC2-based patterns
- Project 4 (Containers) - Containerization in cloud
- Project 2 (Serverless) - Event-driven paradigm shift
- Project 5 (Full-Stack SaaS) - Architecting at scale
For developers (Application-focused):
- Project 2 (Serverless) - Code without infrastructure
- Project 1 (VPC) - Networking essentials (can’t skip)
- Project 4 (Containers) - Package and deploy apps
- Project 3 (Auto-Scaling Web App) - Full-stack deployment
- Project 5 (Full-Stack SaaS) - Production-ready system
For solutions architects (Design-focused):
- Project 1 (VPC) - Network architecture decisions
- Project 3 (Auto-Scaling Web App) - Scalability patterns
- Project 4 (Containers) - Orchestration trade-offs
- Project 2 (Serverless) - Event-driven design
- Project 5 (Full-Stack SaaS) - Well-Architected Framework application
Expected Outcomes
After completing these projects, you will:
Technical Skills
- Design and deploy production VPCs with proper subnet segmentation, routing, and security layers
- Build serverless event-driven pipelines using Lambda, Step Functions, and S3 with comprehensive error handling
- Architect auto-scaling web applications across multiple availability zones with load balancing and database replication
- Deploy containerized workloads on ECS Fargate or EKS with service discovery and monitoring
- Construct full-stack SaaS platforms integrating multiple AWS services with proper security and observability
Conceptual Understanding
- VPC networking fundamentals: CIDR blocks, route tables, Internet Gateways, NAT Gateways, security groups vs NACLs
- IAM security model: Policy evaluation logic, least privilege, role-based access, cross-account access patterns
- Serverless execution model: Cold starts, concurrent limits, event sources, state management, orchestration
- Compute decision framework: When to use EC2 vs Lambda vs Fargate vs EKS - trade-offs and appropriate use cases
- Data architecture patterns: RDS vs DynamoDB, read replicas, caching strategies, backup and disaster recovery
- Infrastructure as Code: Terraform workflow, state management, module design, multi-environment deployments
- AWS Well-Architected Framework: Operational excellence, security, reliability, performance efficiency, cost optimization
Real-World Capabilities
- Troubleshoot networking issues: Understand VPC Flow Logs, trace packets, diagnose connectivity problems
- Optimize AWS costs: Right-size resources, leverage spot instances, implement lifecycle policies, monitor spending
- Design for failure: Multi-AZ deployments, retry logic, circuit breakers, graceful degradation
- Implement comprehensive observability: CloudWatch metrics/logs/alarms, X-Ray tracing, custom dashboards
- Secure cloud infrastructure: Defense in depth, encryption at rest/in transit, secrets management, audit logging
- Deploy with confidence: CI/CD pipelines, blue-green deployments, canary releases, rollback strategies
Interview & Career Readiness
- Answer AWS architecture questions confidently with real implementation experience
- Discuss trade-offs intelligently: Cost vs performance vs reliability, managed services vs self-managed
- Explain real-world failures you’ve debugged and lessons learned
- Demonstrate hands-on expertise through portfolio projects that employers can review
- Navigate AWS console and CLI efficiently for both development and troubleshooting
- Read and write CloudFormation/Terraform to understand infrastructure definitions in job codebases
Time Investment & Commitment
Total estimated time: 3-6 months depending on:
- Your current AWS/cloud experience (more if starting from zero)
- How deep you go into each project (minimum viable vs production-grade)
- Whether you build all 5 projects or focus on 2-3 most relevant to your goals
- Time spent on debugging, research, and experimentation (the real learning!)
Realistic weekly commitment:
- 10 hours/week: 4-6 months to complete all 5 projects
- 20 hours/week: 2-3 months for comprehensive coverage
- Full-time (40+ hours/week): 1-2 months intensive learning
Remember: The goal isn’t speed—it’s deep understanding. You’re not just deploying resources; you’re internalizing AWS patterns, failure modes, and architectural decision-making that takes most engineers years to develop.
What Makes This Learning Path Different
Unlike tutorials that show you how to click through the AWS console or copy-paste CloudFormation templates:
- You understand the “why” - Every architectural decision is explained with real-world context
- You experience failure - Projects include common pitfalls and debugging sections because breaking things teaches more than success
- You build from scratch - Infrastructure as Code forces you to understand every component explicitly
- You verify understanding - Interview questions, thinking exercises, and self-assessment ensure you’re not just following steps
- You connect to fundamentals - Each project maps to computer science concepts (networking, distributed systems, security)
- You use real tools - Terraform, AWS CLI, CloudWatch—the same stack professional engineers use daily
The AWS Shared Responsibility in Practice
By the end of these projects, you’ll viscerally understand what “shared responsibility model” means:
- AWS manages: Physical infrastructure, hypervisor, availability zones, service APIs
- YOU manage: Everything from the OS up—patching, security groups, IAM policies, data encryption, application code, network topology
When an EC2 instance is compromised because of a weak SSH key, that’s your responsibility. When an AWS region experiences an outage, that’s AWS’s responsibility—but your multi-AZ architecture is YOUR responsibility for surviving that outage.
Your Next Steps
- Set up AWS account with MFA enabled, create IAM user (never use root), configure AWS CLI
- Enable billing alerts to avoid surprise charges ($10 threshold for learning projects)
- Start with Project 1 - Build your first production VPC with Terraform
- Document your journey - Take notes on what you learn, capture screenshots, write blog posts
- Join the community - AWS subreddit, Discord servers, local meetups to discuss challenges and solutions
- Build in public - Push code to GitHub, write README files explaining your architecture decisions
- Iterate and improve - Come back to early projects after finishing later ones and refactor with new knowledge
The Reality Check
AWS is vast. You won’t master all 200+ services, and you don’t need to. These 5 projects give you:
- Deep understanding of 15-20 core services (the 80/20 rule applies)
- Mental models that apply to the other 180+ services
- Confidence to learn new AWS services as needed
- Battle-tested debugging skills from building real systems
Most importantly: After these projects, you won’t be intimidated by AWS. You’ll know how to:
- Read AWS documentation efficiently
- Design architectures that match business requirements
- Debug production issues systematically
- Estimate costs and optimize spending
- Make informed build-vs-buy decisions
You’ll have built 5 working systems you can demo, explain, and expand. That’s infinitely more valuable than “I completed an AWS certification course.”
Now go build something.
This learning path represents approximately 150-200 hours of hands-on building, debugging, and learning. The projects are ordered by dependency, but feel free to adjust based on your interests and career goals. Remember: breaking things is part of the process. Every error message is a teacher.