AWS DEEP DIVE LEARNING PROJECTS
Deep Understanding of AWS Through Building
Goal: Master AWS cloud infrastructure by building real systems that break, fail, and teach you why networking, security, and scalability patterns exist. You’ll go from clicking through the console to understanding why a private subnet needs a NAT Gateway, how IAM policy evaluation actually works, and when to choose Lambda over ECS over EC2.
Why AWS Mastery Requires Building
You’re tackling a massive ecosystem with 200+ services. AWS is not something you can understand by reading documentation—you need to build systems where misconfigured security groups break things, where wrong subnet routing causes timeouts, and where Lambda cold starts ruin your latency. That’s when the concepts become real.
The AWS Learning Problem
Most developers learn AWS backwards:
- Click through console tutorials
- Copy-paste CloudFormation templates
- Wonder why things break in production
- Panic when asked “why did you choose this architecture?”
The right approach:
- Understand the problem each service solves
- Build it wrong first (feel the pain)
- Fix it using IaC (understand every line)
- Break it intentionally (learn failure modes)
- Explain your architecture to others
The AWS Shared Responsibility Model
Before building anything, understand who is responsible for what:
┌─────────────────────────────────────────────────────────────────────────────┐
│ YOUR RESPONSIBILITY │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Customer Data │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Platform, Applications, Identity & Access Management │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Operating System, Network & Firewall Configuration │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Client-Side Data Encryption │ Server-Side Encryption │ Network Traffic│ │
│ │ & Data Integrity Auth │ (File System/Data) │ Protection │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────────────┤
│ AWS RESPONSIBILITY │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Compute │ Storage │ Database │ Networking │ │
│ ├───────────────────────────────────────────────────────────────────────┤ │
│ │ Hardware/AWS Global Infrastructure │ │
│ │ (Regions, Availability Zones, Edge Locations) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Key insight: AWS manages the physical infrastructure. YOU manage everything from the OS up. If your EC2 instance gets hacked because of a weak password, that’s on you. If an AWS data center floods, that’s on them.
Understanding AWS Architecture: The Mental Model
The Three Pillars of AWS
Every AWS architecture decision comes down to balancing three concerns:
┌─────────────────┐
│ │
│ RELIABILITY │
│ (Multi-AZ, │
│ Redundancy) │
│ │
└────────┬────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ │ │ │
│ COST │◄─────────►│ PERFORMANCE │
│ (Right-sizing, │ │ (Low latency, │
│ Reserved, │ │ High throughput│
│ Spot) │ │ Scaling) │
│ │ │ │
└─────────────────┘ └─────────────────┘
The AWS Well-Architected Trade-off Triangle
Every architecture decision involves trade-offs:
- Multi-AZ RDS is more reliable but costs 2x
- Spot instances are 70% cheaper but can be terminated anytime
- Lambda has zero idle cost but cold starts add latency
AWS Global Infrastructure
Understanding the hierarchy is critical:
┌─────────────────────────────────────────────────────────────────────────────┐
│ AWS GLOBAL │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Region: us-east-1 (N. Virginia) ││
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ ││
│ │ │ AZ 1 │ │ AZ 2 │ │ AZ 3 │ ... ││
│ │ │ us-east-1a │ │ us-east-1b │ │ us-east-1c │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ ┌────────────┐ │ ││
│ │ │ │ Data Center│ │ │ │ Data Center│ │ │ │ Data Center│ │ ││
│ │ │ │ │ │ │ │ │ │ │ │ │ │ ││
│ │ │ │ • EC2 │ │ │ │ • EC2 │ │ │ │ • EC2 │ │ ││
│ │ │ │ • RDS │ │ │ │ • RDS │ │ │ │ • RDS │ │ ││
│ │ │ │ • EBS │ │ │ │ • EBS │ │ │ │ • EBS │ │ ││
│ │ │ └────────────┘ │ │ └────────────┘ │ │ └────────────┘ │ ││
│ │ └────────────────┘ └────────────────┘ └────────────────┘ ││
│ │ ││
│ │ ← Low latency connections between AZs (< 2ms) → ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ Region: eu-west-1 (Ireland) ││
│ │ ...similar AZ structure... ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
│ Edge Locations (CloudFront): 400+ worldwide for content delivery │
└─────────────────────────────────────────────────────────────────────────────┘
Key insight for projects:
- Region: Contains all your resources, data residency compliance
- AZ (Availability Zone): Isolated data centers, failure boundary
- Multi-AZ: Deploy across 2+ AZs for high availability (if one AZ fails, others continue)
VPC Networking: The Foundation of Everything
Every AWS resource you deploy lives in a network. Understanding VPC is non-negotiable.
What is a VPC?
A Virtual Private Cloud (VPC) is your isolated network within AWS. Think of it as your own private data center in the cloud, with complete control over:
- IP address ranges (CIDR blocks)
- Subnets (network segments)
- Route tables (traffic rules)
- Gateways (internet access)
- Security (firewalls)
The Anatomy of a Production VPC
┌─────────────────────────────────────────────────────────────────────────────────┐
│ VPC: 10.0.0.0/16 (65,536 IP addresses) │
│ │
│ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐│
│ │ Availability Zone A (us-east-1a) │ │ Availability Zone B (us-east-1b) ││
│ │ │ │ ││
│ │ ┌─────────────────────────────┐ │ │ ┌─────────────────────────────┐ ││
│ │ │ PUBLIC SUBNET: 10.0.1.0/24 │ │ │ │ PUBLIC SUBNET: 10.0.2.0/24 │ ││
│ │ │ (256 IPs) │ │ │ │ (256 IPs) │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌───────┐ ┌───────┐ │ │ │ │ ┌───────┐ ┌───────┐ │ ││
│ │ │ │Bastion│ │ NAT │ │ │ │ │ │Bastion│ │ NAT │ │ ││
│ │ │ │ Host │ │Gateway│ │ │ │ │ │ (HA) │ │Gateway│ │ ││
│ │ │ └───────┘ └───┬───┘ │ │ │ │ └───────┘ └───┬───┘ │ ││
│ │ └───────────────────┼────────┘ │ │ └───────────────────┼────────┘ ││
│ │ │ │ │ │ ││
│ │ ┌───────────────────┼────────┐ │ │ ┌───────────────────┼────────┐ ││
│ │ │ PRIVATE SUBNET: │ │ │ │ │ PRIVATE SUBNET: │ │ ││
│ │ │ 10.0.10.0/24 │ │ │ │ │ 10.0.20.0/24 │ │ ││
│ │ │ (App Tier) ▼ │ │ │ │ (App Tier) ▼ │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌───────┐ ┌───────┐ │ │ │ │ ┌───────┐ ┌───────┐ │ ││
│ │ │ │ EC2 │ │ EC2 │ │ │ │ │ │ EC2 │ │ EC2 │ │ ││
│ │ │ │ (App) │ │ (App) │ │ │ │ │ │ (App) │ │ (App) │ │ ││
│ │ │ └───────┘ └───────┘ │ │ │ │ └───────┘ └───────┘ │ ││
│ │ └────────────────────────────┘ │ │ └────────────────────────────┘ ││
│ │ │ │ ││
│ │ ┌────────────────────────────┐ │ │ ┌────────────────────────────┐ ││
│ │ │ DATA SUBNET: 10.0.100.0/24 │ │ │ │ DATA SUBNET: 10.0.200.0/24 │ ││
│ │ │ (Database Tier) │ │ │ │ (Database Tier) │ ││
│ │ │ │ │ │ │ │ ││
│ │ │ ┌────────────────────┐ │ │ │ │ ┌────────────────────┐ │ ││
│ │ │ │ RDS Primary │───┼───┼─┼──│ │ RDS Standby │ │ ││
│ │ │ │ (PostgreSQL) │ │ │ │ │ │ (Sync Replica) │ │ ││
│ │ │ └────────────────────┘ │ │ │ │ └────────────────────┘ │ ││
│ │ └────────────────────────────┘ │ │ └────────────────────────────┘ ││
│ └─────────────────────────────────────┘ └─────────────────────────────────────┘│
│ │
│ ┌─────────────────┐ │
│ │ Internet Gateway│ ◄──── Allows public subnets to reach internet │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Internet│ │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
Understanding CIDR Notation
CIDR (Classless Inter-Domain Routing) defines IP address ranges:
IP Address: 10.0.0.0
CIDR: /16
Binary breakdown:
10.0.0.0/16 means:
├── First 16 bits are FIXED (network portion)
│ 10 . 0 (00001010.00000000)
│
└── Last 16 bits are FREE (host portion)
x.x (00000000.00000000 to 11111111.11111111)
Result: 10.0.0.0 to 10.0.255.255 = 65,536 addresses
Common CIDR blocks for VPC design:
┌─────────┬─────────────────┬──────────────────────────────┐
│ CIDR │ # of IPs │ Use Case │
├─────────┼─────────────────┼──────────────────────────────┤
│ /16 │ 65,536 │ Entire VPC │
│ /20 │ 4,096 │ Large subnet (production) │
│ /24 │ 256 │ Standard subnet │
│ /28 │ 16 │ Small subnet (bastion hosts) │
└─────────┴─────────────────┴──────────────────────────────┘
⚠️ AWS reserves 5 IPs per subnet:
.0 Network address
.1 VPC router
.2 DNS server
.3 Reserved for future use
.255 Broadcast (not supported, but reserved)
So a /24 subnet (256 IPs) actually has 251 usable IPs.
Public vs Private Subnets: The Critical Difference
PUBLIC SUBNET PRIVATE SUBNET
───────────── ──────────────
Route Table: Route Table:
┌────────────────────────────┐ ┌────────────────────────────┐
│ Destination │ Target │ │ Destination │ Target │
├───────────────┼────────────┤ ├───────────────┼────────────┤
│ 10.0.0.0/16 │ local │ │ 10.0.0.0/16 │ local │
│ 0.0.0.0/0 │ igw-xxxxx │◄──IGW │ 0.0.0.0/0 │ nat-xxxxx │◄──NAT
└───────────────┴────────────┘ └───────────────┴────────────┘
Key differences:
┌─────────────────┬──────────────────────┬──────────────────────┐
│ │ Public Subnet │ Private Subnet │
├─────────────────┼──────────────────────┼──────────────────────┤
│ Internet access │ Via Internet Gateway │ Via NAT Gateway │
│ Inbound traffic │ Can receive from web │ Cannot receive │
│ Public IP │ Can have Elastic IP │ No public IP │
│ Use case │ Load balancers, │ App servers, │
│ │ bastion hosts │ databases │
└─────────────────┴──────────────────────┴──────────────────────┘
Security Groups vs NACLs: Two Layers of Defense
┌─────────────────────────────────────────────────────────────────────────────┐
│ NACL (Subnet Level) │
│ - Stateless │
│ - Explicit allow/deny │
│ - Rule numbers (processed in order) │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Security Group (Instance Level) │ │
│ │ - Stateful (return traffic auto-allowed) │ │
│ │ - Allow rules only (implicit deny) │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ EC2 Instance │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Traffic flow example (HTTP request to web server):
1. Request arrives at VPC
2. NACL checks inbound rules (allow port 80?) → Yes? Continue
3. Security Group checks inbound rules → Yes? Continue
4. Request reaches EC2 instance
5. Response leaves EC2 instance
6. Security Group: AUTOMATICALLY allows response → Stateful!
7. NACL checks outbound rules (allow response?) → Must be explicit
8. NACL checks ephemeral port range → Rule needed!
Security Group (stateful): NACL (stateless):
Inbound: Allow TCP 80 Inbound: Allow TCP 80
Outbound: (not needed for response) Outbound: Allow TCP 1024-65535 (ephemeral)
IAM: The Security Model You Must Master
IAM (Identity and Access Management) controls WHO can do WHAT to WHICH resources.
The IAM Policy Language
Every IAM policy answers these questions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow", ◄─── Allow or Deny?
"Action": [ ◄─── What actions?
"s3:GetObject",
"s3:PutObject"
],
"Resource": [ ◄─── Which resources?
"arn:aws:s3:::my-bucket/*"
],
"Condition": { ◄─── Under what conditions?
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}
]
}
IAM Policy Evaluation Logic
When AWS evaluates permissions, it follows this order:
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM Policy Evaluation │
│ │
│ 1. By default, all requests are DENIED (implicit deny) │
│ │ │
│ ▼ │
│ 2. Check all applicable policies │
│ ┌──────────────┬──────────────┬──────────────┬──────────────┐ │
│ │ Identity │ Resource │ Permission │ Service │ │
│ │ Policies │ Policies │ Boundaries │ Control │ │
│ │ (IAM user/ │ (S3 bucket, │ (IAM) │ Policies │ │
│ │ role) │ SQS queue) │ │ (Org level) │ │
│ └──────────────┴──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ▼ │
│ 3. If ANY policy has explicit DENY ──────────────► DENIED (final) │
│ │ │
│ ▼ │
│ 4. If ANY policy has explicit ALLOW ─────────────► ALLOWED │
│ │ │
│ ▼ │
│ 5. If no explicit allow found ───────────────────► DENIED (implicit) │
│ │
│ Remember: Explicit DENY always wins! │
└─────────────────────────────────────────────────────────────────────────────┘
IAM Roles vs Users vs Groups
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM USERS │
│ - Permanent credentials (access key + secret key) │
│ - Used for: Human users, CLI access │
│ - Best practice: Use MFA, rotate keys regularly │
│ │
│ ┌─────────┐ │
│ │ User │───► Has access keys, belongs to groups │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM GROUPS │
│ - Collection of users │
│ - Policies attached to group apply to all members │
│ - Used for: Organizing users by job function (Developers, Admins) │
│ │
│ ┌─────────┐ │
│ │ Group │───► Contains users, has policies │
│ │(Devs) │ ┌──────┐ ┌──────┐ ┌──────┐ │
│ └─────────┘ │User A│ │User B│ │User C│ │
│ └──────┘ └──────┘ └──────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ IAM ROLES │
│ - Temporary credentials (auto-rotated by AWS) │
│ - Can be ASSUMED by: EC2, Lambda, ECS, other AWS accounts, SAML users │
│ - Used for: Service-to-service authentication, cross-account access │
│ │
│ ┌─────────┐ ┌─────────────────────────────────────────┐ │
│ │ Role │ │ Trust Policy: WHO can assume this role │ │
│ │(Lambda) │◄────────│ Permissions: WHAT they can do │ │
│ └─────────┘ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Lambda function assumes role → Gets temp credentials → Accesses S3 │
└─────────────────────────────────────────────────────────────────────────────┘
Best Practice Hierarchy:
┌──────────────────────────────────────────────────────────────┐
│ PREFER ROLES OVER USERS │
│ │
│ ✓ Roles: Temporary creds, auto-rotated, no keys to leak │
│ ✗ Users: Permanent creds, must rotate manually, can leak │
│ │
│ EC2 instances? Use Instance Profile (role wrapper) │
│ Lambda? Use Execution Role │
│ ECS tasks? Use Task Role │
│ Cross-account? Use AssumeRole │
└──────────────────────────────────────────────────────────────┘
Serverless Architecture: Lambda, Step Functions & Event-Driven Design
The Lambda Execution Model
Understanding Lambda’s lifecycle is critical for performance:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAMBDA EXECUTION LIFECYCLE │
│ │
│ COLD START (first invocation or after idle) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ 1. Download code 2. Start runtime 3. Initialize handler │ │
│ │ from S3 (Python, Node) (your imports) │ │
│ │ ~100ms ~200ms ~500-3000ms │ │
│ │ │ │
│ │ TOTAL COLD START: 800ms - 5s depending on package size │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ WARM START (execution environment reused) │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Environment already running, just execute handler │ │
│ │ TOTAL WARM START: <100ms │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ EXECUTION CONTEXT REUSE │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ # This code runs ONCE (cold start only) │ │
│ │ import boto3 │ │
│ │ s3_client = boto3.client('s3') # Reused across invocations! │ │
│ │ │ │
│ │ # This code runs EVERY invocation │ │
│ │ def handler(event, context): │ │
│ │ s3_client.get_object(...) # Uses pre-initialized client │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ LAMBDA LIMITS: │
│ ┌──────────────────────────┬───────────────────────────────────────────┐ │
│ │ Max execution time │ 15 minutes │ │
│ │ Max memory │ 10 GB │ │
│ │ Max /tmp storage │ 10 GB (ephemeral, NOT persistent) │ │
│ │ Max deployment package │ 50 MB (zipped), 250 MB (unzipped) │ │
│ │ Max concurrent executions│ 1000 (soft limit, can increase) │ │
│ └──────────────────────────┴───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Step Functions: Orchestrating Serverless Workflows
┌─────────────────────────────────────────────────────────────────────────────┐
│ STEP FUNCTIONS STATE MACHINE │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ StartState │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ ValidateInput │ (Task: Lambda) │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ Choice │ (Decision point) │ │
│ │ └───────┬───────┘ │ │
│ │ ┌────────┴────────┐ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Valid: Yes │ │ Valid: No │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ ProcessData │ │ SendError │ │ │
│ │ │ (Task) │ │ Notification │ │ │
│ │ └───────┬───────┘ └───────┬───────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌───────────────┐ End │ │
│ │ │ Parallel │ (Run multiple branches) │ │
│ │ │ ┌───┬───┐ │ │ │
│ │ │ │ A │ B │ │ │ │
│ │ │ └───┴───┘ │ │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ SaveResults │ │ │
│ │ │ (Task) │ │ │
│ │ └───────┬───────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ End │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ STATE TYPES: │
│ ┌──────────┬────────────────────────────────────────────────────────────┐ │
│ │ Task │ Execute Lambda, ECS task, API call, etc. │ │
│ │ Choice │ If/else branching based on input │ │
│ │ Parallel │ Execute multiple branches simultaneously │ │
│ │ Map │ Iterate over array, execute state for each item │ │
│ │ Wait │ Delay for specified time │ │
│ │ Pass │ Pass input to output (useful for transformations) │ │
│ │ Succeed │ Terminal state indicating success │ │
│ │ Fail │ Terminal state indicating failure │ │
│ └──────────┴────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Compute Options: When to Use What
┌─────────────────────────────────────────────────────────────────────────────┐
│ AWS COMPUTE DECISION TREE │
│ │
│ What are you running? │
│ │ │
│ ├──► Event-driven, short tasks (<15 min)? │
│ │ │ │
│ │ └──► Lambda (serverless, pay per invocation) │
│ │ │
│ ├──► Containerized application? │
│ │ │ │
│ │ ├──► Need Kubernetes? ─────► EKS │
│ │ │ │
│ │ └──► AWS-native? ──────────► ECS │
│ │ │ │
│ │ ├──► Don't want to manage servers? ──► Fargate │
│ │ └──► Need GPU/custom instances? ─────► EC2 │
│ │ │
│ └──► Traditional application, full OS control? │
│ │ │
│ └──► EC2 │
│ │ │
│ ├──► Predictable workload? ──► Reserved Instances │
│ ├──► Flexible timing? ───────► Spot Instances │
│ └──► Unknown/variable? ──────► On-Demand │
│ │
│ COST COMPARISON (approximate, varies by region): │
│ ┌──────────────────┬─────────────────────┬────────────────────────────┐ │
│ │ Service │ Pricing Model │ Best For │ │
│ ├──────────────────┼─────────────────────┼────────────────────────────┤ │
│ │ Lambda │ $0.20/1M requests │ Infrequent, event-driven │ │
│ │ │ + compute time │ tasks │ │
│ │ Fargate │ ~$0.04/vCPU-hour │ Containers without EC2 │ │
│ │ EC2 On-Demand │ ~$0.10/hour (t3.md) │ Unknown workloads │ │
│ │ EC2 Reserved │ ~40% cheaper │ Predictable, 1-3 year │ │
│ │ EC2 Spot │ ~70% cheaper │ Flexible, interruptible │ │
│ └──────────────────┴─────────────────────┴────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Core Concept Analysis
AWS services break down into these fundamental building blocks:
| Domain | Core Concepts to Internalize |
|---|---|
| Networking (VPC) | CIDR blocks, public/private subnets, route tables, NAT gateways, Internet gateways, security groups vs NACLs, VPC peering, Transit Gateway |
| Compute (EC2) | Instance types, AMIs, user data, auto-scaling groups, launch templates, Elastic IPs, placement groups |
| Containers (ECS/EKS) | Task definitions, services, clusters, Fargate vs EC2 launch types, service discovery, Kubernetes control plane |
| Serverless (Lambda) | Event sources, execution context, cold starts, layers, concurrency, IAM execution roles |
| Orchestration (Step Functions) | State machines, ASL (Amazon States Language), error handling, retries, parallel/map states |
| Storage (S3) | Buckets, objects, policies, versioning, lifecycle rules, storage classes, presigned URLs |
| Infrastructure as Code | CloudFormation, Terraform, resource dependencies, state management, drift detection |
Concept Summary Table
Before diving into projects, internalize these AWS concept clusters. Each project will force you to understand multiple concepts simultaneously:
| Concept Cluster | What You Need to Internalize |
|---|---|
| VPC & Networking | CIDR blocks and subnet math (how /16, /24 work), public vs private subnets (routing differences), route tables (0.0.0.0/0 means “default route”), Internet Gateway (public subnet requirement), NAT Gateway (private subnet internet access), security groups (stateful, instance-level), NACLs (stateless, subnet-level), VPC peering (connecting VPCs), Transit Gateway (hub-and-spoke networking) |
| Compute (EC2) | Instance types (compute vs memory vs storage optimized), AMIs (golden images), user data (bootstrap scripts on launch), instance profiles (IAM roles for EC2), auto-scaling groups (elasticity), launch templates (instance configuration), Elastic IPs (static public IPs), placement groups (low latency) |
| Serverless (Lambda) | Execution model (stateless, ephemeral), cold starts (initialization penalty), warm starts (reusing execution context), event sources (what triggers Lambda), layers (shared code/dependencies), concurrency (parallel executions), execution roles (IAM permissions), timeout limits (15 min max), memory allocation (128MB-10GB), /tmp storage (512MB ephemeral) |
| Containers (ECS/EKS) | Task definitions (container configuration), services (long-running tasks), clusters (logical grouping), Fargate vs EC2 launch types (serverless vs managed instances), awsvpc networking mode (each task gets ENI), service discovery (DNS-based), ALB target groups (container load balancing), EKS control plane (managed Kubernetes), ECR (container registry) |
| Orchestration (Step Functions) | State machines (workflow as code), ASL (Amazon States Language - JSON), states (Task, Choice, Parallel, Map, Wait, Pass, Fail, Succeed), error handling (Retry, Catch), parallel execution, map state (dynamic parallelism), Express vs Standard workflows |
| Storage (S3) | Buckets (global namespace), objects (key-value), S3 API operations (PUT, GET, DELETE, LIST), versioning (immutable history), lifecycle rules (transition/expiration), storage classes (Standard, IA, Glacier), presigned URLs (temporary access), CORS (cross-origin access), bucket policies vs IAM policies, S3 Select (query in place) |
| Security (IAM) | Policies (JSON documents), principals (who), actions (what), resources (where), conditions (when), identity-based policies (attached to users/roles), resource-based policies (attached to resources like S3 buckets), roles (temporary credentials), instance profiles (EC2 role wrapper), service-linked roles, policy evaluation logic (explicit deny wins) |
| Infrastructure as Code | Terraform state (source of truth), state locking (prevent concurrent modifications), CloudFormation stacks (grouped resources), drift detection (config vs reality), resource dependencies (implicit vs explicit), modules/nested stacks (reusability), workspaces/stack sets (multi-environment), import (existing resources into IaC) |
| Observability | CloudWatch Logs (log aggregation), log groups and streams, CloudWatch Metrics (time-series data), custom metrics, CloudWatch Alarms (threshold-based alerts), CloudWatch Dashboards (visualization), X-Ray tracing (distributed request tracking), segments and subsegments, service maps, VPC Flow Logs (network traffic), CloudTrail (API audit logs) |
Why this matters: AWS services don’t exist in isolation. When you build a VPC, you’re also configuring security groups, IAM roles, and CloudWatch logs. When you deploy Lambda, you need to understand IAM execution roles, VPC networking (if accessing RDS), and CloudWatch for debugging. These concepts interconnect constantly.
Deep Dive Reading by Concept
Map your project work to specific reading. Don’t read these books cover-to-cover upfront—use them as references when you hit specific challenges in your projects.
VPC & Networking Fundamentals
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| CIDR and IP Addressing | “The Linux Programming Interface” - Michael Kerrisk | Ch. 59 (Sockets: Internet Domains) | Understand IP addresses, subnetting, CIDR notation at the networking fundamentals level |
| VPC Architecture Patterns | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 3: Networking on AWS | Best comprehensive coverage of VPC design, multi-AZ patterns, and network segmentation |
| Security Groups vs NACLs | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 4: Security on AWS | Stateful vs stateless firewall rules, when to use each |
| NAT Gateway Design | AWS Architecture Blog: VPC Design Evolution | Full Article | Real-world patterns for scaling NAT, costs, and HA |
| Routing Deep Dive | AWS VPC User Guide | Route Tables Section | Official documentation on route table priority, longest prefix matching |
Compute (EC2) Deep Dive
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| EC2 Instance Types | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 6: Compute Services | When to choose compute-optimized vs memory-optimized vs storage-optimized |
| Auto Scaling Architecture | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 6: Compute Services | Scaling policies, health checks, target tracking vs step scaling |
| AMI Best Practices | AWS EC2 User Guide | AMIs Section | Golden images, versioning, cross-region copying |
| Instance Metadata Service | AWS EC2 User Guide | Instance Metadata Section | How user data works, IMDSv2, security implications |
Serverless (Lambda & Step Functions)
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| Lambda Execution Model | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 8: Serverless Architecture | Cold starts, execution context reuse, concurrency models |
| Event-Driven Architecture | “Designing Data-Intensive Applications” - Martin Kleppmann | Ch. 11: Stream Processing | Foundational understanding of event-driven systems, exactly-once processing |
| Step Functions State Machines | AWS Step Functions Developer Guide | ASL Specification | Learn the state machine language, error handling patterns |
| Lambda Best Practices | AWS Lambda Developer Guide | Best Practices Section | Performance optimization, error handling, testing strategies |
Containers (ECS/EKS)
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| ECS Architecture | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 9: Container Services | Task definitions, Fargate vs EC2 launch type, service discovery |
| Kubernetes Fundamentals | Amazon EKS Best Practices Guide | Full Guide | Networking, security, observability for EKS |
| Container Networking | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 9: Container Services | awsvpc mode, VPC integration, load balancer integration |
| Service Mesh Concepts | AWS App Mesh Documentation | What is App Mesh | Service-to-service communication, retries, circuit breakers |
Storage (S3) Deep Dive
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| S3 Data Model | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 5: Storage Services | Object storage concepts, consistency model, versioning |
| S3 Performance | S3 Performance Guidelines | Full Document | Request rate optimization, multipart upload, transfer acceleration |
| S3 Security | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 5: Storage Services | Bucket policies, ACLs, presigned URLs, encryption options |
| Lifecycle Management | S3 Lifecycle Configuration | Lifecycle Section | Storage class transitions, expiration policies, cost optimization |
Security & IAM
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| IAM Policy Fundamentals | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 4: Security on AWS | Policy structure, evaluation logic, least privilege |
| IAM Roles Deep Dive | AWS IAM User Guide | Roles Section | Trust policies, assume role, instance profiles, cross-account access |
| Security Best Practices | AWS Security Best Practices | Full Document | MFA, key rotation, policy conditions, service control policies |
| Secrets Management | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 4: Security on AWS | Secrets Manager vs Parameter Store, rotation strategies |
Infrastructure as Code
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| Terraform Fundamentals | HashiCorp Terraform Tutorials | Get Started on AWS | State management, resource dependencies, modules |
| CloudFormation Deep Dive | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 13: Infrastructure as Code | Stack operations, drift detection, change sets |
| Terraform State Management | Terraform State Documentation | State Section | Remote state, locking, workspaces, state migration |
| IaC Best Practices | Terraform Best Practices | Full Guide | Module structure, naming conventions, security scanning |
Observability & Monitoring
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| CloudWatch Fundamentals | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 12: Monitoring and Logging | Logs, metrics, alarms, dashboards |
| Distributed Tracing | “Designing Data-Intensive Applications” - Martin Kleppmann | Ch. 1: Reliable, Scalable, Maintainable | Understanding observability in distributed systems |
| X-Ray Deep Dive | AWS X-Ray Developer Guide | Full Guide | Trace segments, service maps, sampling rules |
| Log Analysis Patterns | CloudWatch Logs Insights Tutorial | Insights Query Syntax | Query language, aggregation, performance debugging |
Multi-Service Architecture
| Topic | Book/Resource | Chapter/Section | Why Read This |
|---|---|---|---|
| Well-Architected Framework | AWS Well-Architected Framework | All 6 Pillars | Operational Excellence, Security, Reliability, Performance, Cost, Sustainability |
| Distributed Systems Patterns | “Designing Data-Intensive Applications” - Martin Kleppmann | Ch. 8-12 | Replication, partitioning, consistency, consensus, batch vs stream processing |
| SaaS Architecture | “AWS for Solutions Architects” - Saurabh Shrivastava | Ch. 10-11: Advanced Architectures | Multi-tenancy, isolation models, data partitioning strategies |
| Event-Driven Patterns | AWS Serverless Patterns Collection | Browse Patterns | Real-world architectures, EventBridge patterns, async workflows |
How to use this table: When you hit a specific challenge in a project (e.g., “My Lambda can’t access RDS in my VPC”), consult the relevant sections rather than reading books sequentially. Learn just-in-time based on what you’re building.
Project 1: Production-Ready VPC from Scratch (with Terraform)
- File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
- Programming Language: HCL (Terraform)
- Coolness Level: Level 1: Pure Corporate Snoozefest
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cloud Networking / Infrastructure as Code
- Software or Tool: AWS / Terraform
- Main Book: “AWS for Solutions Architects” by Saurabh Shrivastava
What you’ll build: A multi-AZ VPC with public/private subnets, NAT gateways, bastion host, and a deployed web application—all defined in Terraform that you can destroy and recreate at will.
Why it teaches AWS Networking: You cannot understand VPCs by clicking through the console. You need to see what happens when a private subnet has no route to a NAT gateway, when a security group blocks outbound traffic, or when your CIDR blocks overlap with a peered VPC. Building with IaC forces you to explicitly declare every component and understand their relationships.
Core challenges you’ll face:
- CIDR Planning (maps to IP addressing): Designing non-overlapping IP ranges that allow future growth and peering
- Routing Logic (maps to network architecture): Understanding why private subnets route to NAT vs public subnets route to IGW
- Security Layers (maps to defense in depth): Configuring security groups (stateful) vs NACLs (stateless) and knowing when to use each
- High Availability (maps to AZ architecture): Deploying across multiple AZs with redundant NAT gateways
- Bastion Access (maps to secure access patterns): SSH tunneling through a jump host to reach private instances
Resources for key challenges:
- “AWS for Solutions Architects” by Saurabh Shrivastava (Ch. 3-4) - Best coverage of VPC architecture patterns
- AWS VPC Best Practices - Official security recommendations
- One to Many: Evolving VPC Design - AWS Architecture Blog on scaling VPC patterns
Key Concepts:
- Subnets & CIDR: AWS VPC Demystified - MAICOLO
- Security Groups vs NACLs: “AWS for Solutions Architects” Ch. 4 - Shrivastava et al.
- Infrastructure as Code: Provision an EKS cluster - HashiCorp Developer (VPC module patterns)
- NAT Gateway Architecture: AWS Architecture Blog
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic AWS console navigation, understanding of IP addresses, familiarity with any IaC tool
Real world outcome:
- A fully functional VPC you can SSH into via bastion host
- A simple web server accessible via public IP
- Terraform code you can
terraform destroyandterraform applyto recreate everything - VPC Flow Logs showing actual traffic patterns in CloudWatch
Learning milestones:
- First milestone: Deploy VPC with public subnet only, web server reachable from internet → you understand IGW + route tables
- Second milestone: Add private subnet with NAT, move app server there, bastion-only access → you understand public vs private architecture
- Third milestone: Multi-AZ deployment with ALB → you understand high availability patterns
- Final milestone: Add VPC Flow Logs and analyze traffic → you understand network observability
Real World Outcome
This is what your working VPC infrastructure looks like when you complete this project:
# 1. Deploy the infrastructure with Terraform
$ cd terraform/vpc-project
$ terraform init
Initializing provider plugins...
- Downloading hashicorp/aws v5.31.0...
Terraform has been successfully initialized!
$ terraform apply
Plan: 23 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Enter a value: yes
aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.
Outputs:
vpc_id = "vpc-0abc123def456789"
public_subnet_ids = ["subnet-0aaa111", "subnet-0bbb222"]
private_subnet_ids = ["subnet-0ccc333", "subnet-0ddd444"]
bastion_public_ip = "54.123.45.67"
web_server_private_ip = "10.0.10.50"
nat_gateway_ip = "52.87.123.45"
# 2. SSH into the bastion host
$ ssh -i ~/.ssh/aws-bastion.pem ec2-user@54.123.45.67
The authenticity of host '54.123.45.67' can't be established.
Are you sure you want to continue connecting (yes/no)? yes
__| __|_ )
_| ( / Amazon Linux 2
___|\___|___|
[ec2-user@bastion ~]$ hostname
ip-10-0-1-25.ec2.internal
# 3. From bastion, SSH into private web server (SSH agent forwarding)
[ec2-user@bastion ~]$ ssh 10.0.10.50
[ec2-user@webserver ~]$ hostname
ip-10-0-10-50.ec2.internal
# 4. Verify the web server can reach internet via NAT
[ec2-user@webserver ~]$ curl -s https://ifconfig.me
52.87.123.45 # This is the NAT Gateway's IP, NOT the web server's private IP!
# 5. Verify the web server cannot be reached directly from internet
$ curl http://10.0.10.50 # From your local machine
curl: (7) Failed to connect to 10.0.10.50 port 80: No route to host
# CORRECT! Private subnet is not reachable from internet
# 6. Check VPC Flow Logs in CloudWatch
$ aws logs filter-log-events \
--log-group-name /aws/vpc/flowlogs \
--filter-pattern "[version, account, eni, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action, log_status]" \
--limit 5 \
--profile douglascorrea_io --no-cli-pager
{
"events": [
{
"message": "2 123456789012 eni-0abc123 10.0.1.25 54.123.45.67 22 52341 6 25 4500 1703264400 1703264460 ACCEPT OK",
"ingestionTime": 1703264465000
},
{
"message": "2 123456789012 eni-0def456 10.0.10.50 52.87.123.45 443 32145 6 10 1500 1703264400 1703264460 ACCEPT OK",
"ingestionTime": 1703264465000
}
]
}
# Flow logs show: SSH to bastion (ACCEPT), HTTPS from private instance via NAT (ACCEPT)
# 7. Describe your VPC and subnets
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 \
--query 'Vpcs[0].{VpcId:VpcId,CIDR:CidrBlock,State:State}' \
--output table --no-cli-pager
-----------------------------------------
| DescribeVpcs |
+-------+------------------+------------+
| CIDR | State | VpcId |
+-------+------------------+------------+
|10.0.0.0/16| available |vpc-0abc123 |
+-------+------------------+------------+
$ aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
--query 'Subnets[*].{SubnetId:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock,Public:MapPublicIpOnLaunch}' \
--output table --no-cli-pager
------------------------------------------------------------
| DescribeSubnets |
+--------------+---------------+------------+--------------+
| AZ | CIDR | Public | SubnetId |
+--------------+---------------+------------+--------------+
|us-east-1a |10.0.1.0/24 | True |subnet-0aaa111|
|us-east-1b |10.0.2.0/24 | True |subnet-0bbb222|
|us-east-1a |10.0.10.0/24 | False |subnet-0ccc333|
|us-east-1b |10.0.20.0/24 | False |subnet-0ddd444|
+--------------+---------------+------------+--------------+
# 8. View route tables to understand routing
$ aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-0abc123def456789" \
--query 'RouteTables[*].{RouteTableId:RouteTableId,Routes:Routes[*].{Destination:DestinationCidrBlock,Target:GatewayId||NatGatewayId}}' \
--no-cli-pager
# Public route table: 0.0.0.0/0 → igw-xxxxx (Internet Gateway)
# Private route table: 0.0.0.0/0 → nat-xxxxx (NAT Gateway)
# 9. Clean up when done (saves money!)
$ terraform destroy
Plan: 0 to add, 0 to change, 23 to destroy.
Do you want to perform these actions?
Enter a value: yes
aws_instance.web_server: Destroying...
aws_nat_gateway.main: Destroying...
...
Destroy complete! Resources: 23 destroyed.
AWS Console View:
- VPC Dashboard showing your VPC with proper CIDR block
- Subnet map showing public subnets with Internet Gateway icon, private subnets with NAT icon
- Route tables tab showing explicit routes for each subnet type
- Security Groups showing inbound/outbound rules
- VPC Flow Logs showing real network traffic in CloudWatch
The Core Question You’re Answering
“What IS a VPC, and how does network traffic actually flow between the internet, public subnets, private subnets, and databases?”
Before you write any Terraform code, sit with this question. Most developers have a vague sense that “VPCs are networks” but can’t explain:
- Why a public subnet can receive traffic from the internet but a private subnet cannot
- What the difference between a route table entry and a security group rule is
- Why you need a NAT Gateway (and why it costs money) for private instances to download packages
- How data flows when a user hits your website: ALB → EC2 → RDS and back
This project forces you to confront these questions because misconfiguration means your infrastructure doesn’t work.
Concepts You Must Understand First
Stop and research these before coding:
- CIDR Blocks and Subnetting
- What does 10.0.0.0/16 actually mean in binary?
- How many IP addresses are in a /24 subnet?
- Why does AWS reserve 5 IPs per subnet?
- What happens if two VPCs have overlapping CIDR blocks and you try to peer them?
- Book Reference: “Computer Systems: A Programmer’s Perspective” Ch. 11 (Network Programming) - Bryant & O’Hallaron
- Routing and the Default Gateway (0.0.0.0/0)
- What does 0.0.0.0/0 mean in a route table?
- When a packet leaves an EC2 instance, how does the VPC know where to send it?
- What’s the difference between
localroute andigw-xxxxxroute? - Why does longest prefix match matter?
- Book Reference: “TCP/IP Illustrated, Volume 1” Ch. 9 (IP Routing) - W. Richard Stevens
- Internet Gateway vs NAT Gateway
- What does “stateful NAT” mean?
- Why can’t you just put an Internet Gateway on a private subnet?
- How does a NAT Gateway translate private IPs to public IPs?
- Why does the NAT Gateway need to be in a public subnet?
- Book Reference: “AWS for Solutions Architects” Ch. 3 - Saurabh Shrivastava
- Security Groups (Stateful) vs NACLs (Stateless)
- What does “stateful” mean in terms of network connections?
- If you allow inbound port 80, do you need an outbound rule for the response?
- Why do NACLs need explicit rules for ephemeral ports?
- When would you use a NACL instead of just security groups?
- Book Reference: “AWS for Solutions Architects” Ch. 4 (Security on AWS) - Saurabh Shrivastava
- High Availability and Availability Zones
- What is an Availability Zone physically?
- Why do you need subnets in multiple AZs?
- What happens if us-east-1a fails but you only deployed to that AZ?
- How does an ALB distribute traffic across AZs?
- Book Reference: “AWS for Solutions Architects” Ch. 2 (AWS Global Infrastructure)
Questions to Guide Your Design
Before implementing, think through these:
- CIDR Planning
- How many IP addresses will you need now? In 5 years?
- What if you need to add a third AZ later?
- What if you need to peer with another VPC that uses 10.0.0.0/16?
- How will you segment: web tier, app tier, database tier?
- Public vs Private Decisions
- What resources MUST be in public subnets? (Hint: very few)
- What resources should NEVER be in public subnets? (Hint: databases!)
- How will private resources get software updates from the internet?
- Bastion Host Design
- Why use a bastion instead of putting EC2 in public subnet with SSH?
- How do you secure the bastion itself?
- Should you use Session Manager instead of SSH? (Yes, probably)
- Security Group Strategy
- Can you reference one security group from another?
- How do you allow app servers to talk to databases without opening the database to everything?
- What’s the principle of least privilege in security group terms?
- Cost Considerations
- How much does a NAT Gateway cost per hour? Per GB transferred?
- Do you need one NAT Gateway per AZ or can you share?
- What’s the trade-off between cost and availability?
Thinking Exercise
Before coding, draw this diagram on paper:
Your Task: Fill in the missing pieces
Internet
│
▼
┌───────────────────────────────────────────────────────────────┐
│ VPC: 10.0.0.0/16 │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Public Subnet A │ │ Public Subnet B │ │
│ │ CIDR: ____________ │ │ CIDR: ____________ │ │
│ │ │ │ │ │
│ │ What goes here? │ │ What goes here? │ │
│ │ ___________________ │ │ ___________________ │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ Private Subnet A │ │ Private Subnet B │ │
│ │ CIDR: ____________ │ │ CIDR: ____________ │ │
│ │ │ │ │ │
│ │ What goes here? │ │ What goes here? │ │
│ │ ___________________ │ │ ___________________ │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
│ │
│ Route Table (Public): │
│ ___________________ → ___________________ │
│ ___________________ → ___________________ │
│ │
│ Route Table (Private): │
│ ___________________ → ___________________ │
│ ___________________ → ___________________ │
└────────────────────────────────────────────────────────────────┘
Questions while drawing:
- Why does each public subnet need a different CIDR?
- What AWS resource creates the connection to the internet?
- How does traffic from the private subnet reach the internet?
- What happens to the response traffic?
The Interview Questions They’ll Ask
Prepare to answer these:
- “Walk me through how a request from a user’s browser reaches your web server in a private subnet.”
- Expected: User → Internet → ALB (public subnet) → EC2 (private subnet) → Response reverses
- “Why is your database in a private subnet? How does it get software updates?”
- Expected: Security - no direct internet access. Updates via NAT Gateway or VPC endpoints.
- “Your app in us-east-1a can’t reach the database in us-east-1b. What do you check?”
- Expected: Security groups (allow from app SG?), NACLs (if modified), Route tables, VPC peering (if different VPCs)
- “What’s the difference between an Internet Gateway and a NAT Gateway?”
- Expected: IGW provides two-way communication (public IPs). NAT provides outbound-only for private resources.
- “Your NAT Gateway bill is $500/month. How do you reduce it?”
- Expected: Check if you have one per AZ (maybe share), use VPC endpoints for AWS services (S3, DynamoDB), check for excessive data transfer
- “Explain security groups vs NACLs. When would you use each?”
- Expected: SG = stateful, instance-level, allow rules only. NACL = stateless, subnet-level, allow/deny rules. Use SG for app logic, NACL for subnet-wide rules or explicit denies.
- “What CIDR block would you use for a VPC? Why?”
- Expected: RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16). Consider future growth and peering requirements.
Hints in Layers
Hint 1: Start with the VPC resource
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "production-vpc"
}
}
Hint 2: Create subnets with data source for AZs
data "aws_availability_zones" "available" {
state = "available"
}
resource "aws_subnet" "public" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true # This makes it "public"
tags = {
Name = "public-${count.index + 1}"
}
}
Hint 3: Internet Gateway requires explicit route
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "public" {
count = length(aws_subnet.public)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
Hint 4: NAT Gateway needs Elastic IP first
resource "aws_eip" "nat" {
domain = "vpc"
}
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public[0].id # NAT must be in PUBLIC subnet
depends_on = [aws_internet_gateway.main]
}
Hint 5: Security Group allowing SSH from bastion only
resource "aws_security_group" "web" {
name = "web-server-sg"
vpc_id = aws_vpc.main.id
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
security_groups = [aws_security_group.bastion.id] # Only from bastion!
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Books That Will Help
| Topic | Book | Specific Chapters | Why It Helps |
|---|---|---|---|
| VPC Fundamentals | “AWS for Solutions Architects” by Saurabh Shrivastava | Ch. 3: Networking on AWS | Best comprehensive coverage of VPC architecture patterns and design decisions |
| Security Groups & IAM | “AWS for Solutions Architects” by Saurabh Shrivastava | Ch. 4: Security on AWS | Security group vs NACL differences, IAM roles for EC2 |
| TCP/IP Fundamentals | “TCP/IP Illustrated, Volume 1” by W. Richard Stevens | Ch. 1-3, 9 (IP Routing) | Deep understanding of how packets flow and routing works |
| CIDR & IP Addressing | “Computer Networks” by Tanenbaum & Wetherall | Ch. 5: Network Layer | Mathematical foundation of IP addressing and subnetting |
| Terraform Basics | “Terraform: Up & Running” by Yevgeniy Brikman | Ch. 2-3: Terraform State | Managing infrastructure as code, state management |
| Network Security | “The Linux Programming Interface” by Michael Kerrisk | Ch. 59-61: Sockets | Understanding network connections at the OS level |
| AWS Well-Architected | AWS Well-Architected Framework (free) | Security Pillar | Official AWS best practices for secure VPC design |
Reading strategy:
- Start with “AWS for Solutions Architects” Ch. 3 (VPC overview)
- Read “TCP/IP Illustrated” Ch. 9 if routing concepts are unclear
- Refer to “Terraform: Up & Running” as you write your code
- Use AWS Well-Architected Security Pillar as a checklist
Project 2: Serverless Data Pipeline (Lambda + Step Functions + S3)
- File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: TypeScript, Go, Java
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: Level 3: The “Service & Support” Model
- Difficulty: Level 2: Intermediate (The Developer)
- Knowledge Area: Serverless, Event-Driven Architecture
- Software or Tool: AWS Lambda, Step Functions, S3
- Main Book: “AWS for Solutions Architects” by Shrivastava et al.
What you’ll build: An automated data processing pipeline that triggers when files land in S3, orchestrates multiple Lambda functions through Step Functions, handles errors gracefully, and outputs processed results—all without a single server to manage.
Why it teaches Serverless: Serverless is not “just deploy a function.” It’s about understanding event-driven architecture, dealing with cold starts, managing state across stateless functions, and designing for failure. Step Functions force you to think about workflow as a first-class concept.
Core challenges you’ll face:
- Event-Driven Triggers (maps to decoupled architecture): Configuring S3 event notifications to invoke Lambda
- State Machine Design (maps to workflow orchestration): Modeling your pipeline as explicit states with transitions
- Error Handling (maps to resilience): Implementing retries, catch blocks, and fallback logic in Step Functions
- Lambda Limits (maps to serverless constraints): Working within 15-minute timeout, /tmp storage, memory limits
- IAM Execution Roles (maps to least-privilege security): Granting Lambda only the permissions it needs
Resources for key challenges:
- Create a Serverless Workflow with Step Functions - AWS Official Tutorial
- Mastering AWS Step Functions - DataCamp comprehensive guide
- AWS Step Functions Deep Dive - Shukla Chandan
Key Concepts:
- Event-Driven Architecture: “AWS for Solutions Architects” Ch. 8 - Shrivastava et al.
- Step Functions State Machine: AWS Step Functions Getting Started - AWS Docs
- Lambda Execution Model: AWS Lambda Developer Guide - AWS Docs
- S3 Event Notifications: “AWS for Solutions Architects” Ch. 5
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic Python/Node.js, understanding of JSON, Project 1 completed (VPC knowledge)
Real world outcome:
- Drop a CSV file into an S3 bucket and watch it get processed automatically
- Step Functions visual console showing your workflow executing in real-time
- Processed output appearing in a destination bucket
- CloudWatch logs showing each Lambda invocation
- Error handling that sends notifications when things fail
Learning milestones:
- First milestone: Single Lambda triggered by S3 upload → you understand event sources
- Second milestone: Chain 3 Lambdas via Step Functions → you understand state machines
- Third milestone: Add parallel processing and error handling → you understand resilience patterns
- Final milestone: Add SNS notifications and CloudWatch alarms → you understand observability in serverless
Real World Outcome
This is what your working pipeline looks like in action:
# Upload a CSV file to trigger the pipeline
$ aws s3 cp sales_data.csv s3://my-pipeline-bucket/input/ --profile douglascorrea_io --no-cli-pager
upload: ./sales_data.csv to s3://my-pipeline-bucket/input/sales_data.csv
# Check Step Functions execution status
$ aws stepfunctions list-executions \
--state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:DataProcessingPipeline \
--profile douglascorrea_io --no-cli-pager
{
"executions": [
{
"executionArn": "arn:aws:states:us-east-1:123456789012:execution:DataProcessingPipeline:execution-2024-12-20-15-30-45",
"name": "execution-2024-12-20-15-30-45",
"status": "SUCCEEDED",
"startDate": "2024-12-20T15:30:45.123Z",
"stopDate": "2024-12-20T15:32:10.456Z"
}
]
}
# View CloudWatch logs for the validation Lambda
$ aws logs tail /aws/lambda/validate-data --since 5m --profile douglascorrea_io --no-cli-pager
2024-12-20T15:30:46.234Z START RequestId: abc123-def456 Version: $LATEST
2024-12-20T15:30:46.567Z INFO Validating CSV structure for sales_data.csv
2024-12-20T15:30:46.789Z INFO Found 1247 records, all valid
2024-12-20T15:30:47.012Z END RequestId: abc123-def456
2024-12-20T15:30:47.045Z REPORT Duration: 811ms Billed Duration: 900ms Memory: 512MB Max Memory Used: 128MB
# View the processed output
$ aws s3 ls s3://my-pipeline-bucket/output/ --profile douglascorrea_io --no-cli-pager
2024-12-20 15:32:10 45632 processed_sales_data.json
2024-12-20 15:32:10 1248 processing_summary.json
Step Functions Visual Workflow (in AWS Console shows real-time execution):
- ValidateData Lambda: SUCCESS (Duration: 811ms)
- TransformData Lambda: SUCCESS (Duration: 58.2s)
- Parallel Processing: Both branches SUCCESS
- CalculateStatistics: 2.3s
- GenerateReport: 1.8s
- WriteOutput Lambda: SUCCESS (Duration: 245ms)
- SNS Notification: Delivered
CloudWatch Logs Insights showing complete pipeline flow with timestamps, error-free execution, and performance metrics for each Lambda invocation.
The Core Question You’re Answering
“How do I build systems that react to events without managing servers, and how do I coordinate multiple functions reliably?”
More specifically:
- How do I design a system where dropping a file automatically triggers processing without a cron job?
- How do I chain multiple processing steps when each Lambda is stateless and limited to 15 minutes?
- How do I handle failures gracefully when any step might time out or throw an error?
- How do I pass data between Lambdas without coupling them tightly?
- How do I make my pipeline idempotent so reprocessing the same file produces the same result?
Concepts You Must Understand First
1. Event-Driven Architecture vs Request-Response
Request-Response (traditional):
Client → Server → Database → Server → Client
(synchronous, client waits for entire operation)
Event-Driven (serverless):
Event Producer → Event → Event Consumer
(asynchronous, producer fires and forgets)
Key: Your S3 upload doesn’t “call” your Lambda. It emits an event. This decoupling makes serverless scalable but harder to debug.
2. Lambda Execution Model (Cold Starts, Execution Context Reuse)
Lambda lifecycle:
- INIT Phase (cold start): Download code, start environment, initialize runtime (100ms-3s)
- INVOKE Phase: Run your handler function
- SHUTDOWN Phase: Environment terminated after idle timeout
Context reuse:
# Runs ONCE per execution environment (cold start)
import boto3
s3_client = boto3.client('s3')
# Runs EVERY invocation (warm or cold)
def handler(event, context):
s3_client.get_object(Bucket='my-bucket', Key='file.csv')
Why it matters: Cold starts add latency. Initialize connections outside the handler to reuse them.
3. State Machines and Workflow Orchestration
With Step Functions:
- Visual workflow in console
- Declarative retry/error handling
- State maintained between steps
- Execution history for debugging
Step Functions is your distributed transaction coordinator.
4. Idempotency and Exactly-Once Processing
The problem: S3 events can be delivered more than once. Step Functions retries can duplicate processing.
Idempotent design:
# Generate deterministic ID from input
idempotency_key = hashlib.sha256(json.dumps(event, sort_keys=True).encode()).hexdigest()
# Check if already processed
try:
s3_client.head_object(Bucket='results', Key=f'processed/{idempotency_key}.json')
return {"status": "already_processed"}
except s3_client.exceptions.NoSuchKey:
# Not processed yet, continue
pass
5. IAM Execution Roles vs Resource-Based Policies
- Execution Role (what Lambda can do): s3:GetObject, s3:PutObject
- Resource-Based Policy (who can invoke Lambda): Allow S3 service to invoke
You need BOTH for S3 event triggers to work.
Questions to Guide Your Design
- What triggers your first Lambda? S3 event notification? SNS? EventBridge?
- How do you pass data between Lambda functions? Via Step Functions state? S3? SQS?
- What happens when a Lambda times out mid-execution? Does Step Functions retry? From scratch or resume?
- How do you handle partial failures? If 3 out of 5 parallel tasks succeed, proceed or fail?
- What errors are transient vs permanent? ThrottlingException (retry) vs InvalidDataFormat (alert)?
- What if the input file is 5GB? Lambda has 10GB /tmp max, 15-minute timeout.
- How do you maintain data lineage? Which output came from which input?
- How do you debug a failed execution? Step Functions history? CloudWatch Logs?
Thinking Exercise
Draw your event flow diagram:
S3 Bucket (input/)
|
| S3 Event Notification
↓
Lambda: ValidateData
|
↓
Step Functions: DataProcessingStateMachine
|
├→ Choice: Valid data?
| ├─ NO → Lambda: SendAlert → END
| └─ YES ↓
|
├→ Lambda: TransformData
↓
├→ Parallel State:
| ├─ Lambda: CalculateStatistics
| └─ Lambda: GenerateReport
↓
├→ Lambda: WriteOutput
↓
└→ SNS: SendSuccessNotification
Now answer:
- Where does data live at each stage?
- What happens if execution is paused for 1 year? (max execution time)
- Maximum file size your pipeline can handle?
- Cost for processing 1000 files?
The Interview Questions They’ll Ask
Q: “Explain synchronous vs asynchronous Lambda invocations.”
Expected answer:
- Synchronous: Caller waits, gets return value (API Gateway, ALB)
- Asynchronous: Caller gets 202 Accepted, Lambda processes later (S3, SNS)
- Async has built-in retry (2x) and dead-letter queue support
- In your pipeline: S3 invokes ValidateData async, Step Functions invokes Lambdas synchronously
Q: “How do you handle Lambda timeout mid-processing?”
Expected answer:
- Chunked processing: Read in chunks, maintain cursor in DynamoDB
- Recursive invocation: Lambda invokes itself with updated state
- Step Functions Map state: Split into chunks, process in parallel
- Alternative: Use Fargate for long-running tasks (no 15min limit)
Q: “What are Lambda cold starts and how did they affect your pipeline?”
Expected answer:
- Cold start = INIT phase (100ms-3s) when scaling up or after idle
- Mitigation: Provisioned concurrency, smaller packages, init code outside handler
- In your pipeline: “Measured ~800ms cold starts, acceptable for batch processing”
Q: “Why use Step Functions instead of chaining Lambdas?”
Expected answer:
- Visual workflow, declarative error handling, state management
- Execution history shows inputs/outputs/durations
- Parallel, Map, Wait states built-in
Q: “How do you pass large datasets between Step Functions states?”
Expected answer:
- Step Functions has 256KB limit per state
- Store data in S3, pass S3 URI in state
- Use ResultPath to merge results without duplicating data
Q: “IAM permissions for S3 to trigger Lambda?”
Expected answer:
- Lambda execution role (what Lambda can do): s3:GetObject, logs:CreateLogGroup
- Lambda resource-based policy (who can invoke): Allow S3 service, condition on bucket ARN
- S3 bucket notification config pointing to Lambda ARN
Q: “Cost for running pipeline 1000 times/day?”
Expected answer breakdown:
- Lambda: Invocations ($0.20/1M) + Duration ($0.0000166667/GB-sec)
- Step Functions: State transitions ($25/1M)
- S3: Requests + Storage
- Show calculation
Hints in Layers
Hint 1.1: What triggers the pipeline? Use S3 Event Notifications with object created events. Configure filter by prefix (input/) and suffix (.csv).
Hint 1.2: How should Lambdas communicate? Use Step Functions to orchestrate. Pass small metadata in state, store large data in S3.
Hint 2.1: Lambda function structure
import boto3
from aws_lambda_powertools import Logger, Tracer
logger = Logger()
tracer = Tracer()
s3_client = boto3.client('s3') # Initialize outside handler
@logger.inject_lambda_context
@tracer.capture_lambda_handler
def handler(event, context):
bucket = event['bucket']
key = event['key']
response = s3_client.get_object(Bucket=bucket, Key=key)
# Process...
return {"valid": True, "rowCount": 1247}
Hint 3.1: S3 event not triggering Lambda? Check:
- CloudWatch Logs for Lambda
- S3 notification config:
aws s3api get-bucket-notification-configuration - Lambda resource-based policy:
aws lambda get-policy - Event filter matches your file
Hint 3.2: Lambda cold starts too slow?
- Check package size
- Lazy import heavy libraries
- Use Lambda Layers
- Consider Provisioned Concurrency
Books That Will Help
| Topic | Book | Specific Chapters | Why It Helps |
|---|---|---|---|
| Serverless Architecture | “AWS for Solutions Architects” by Saurabh Shrivastava | Ch. 8-9: Serverless, Containers vs Lambda | When to use Lambda vs Fargate, event-driven patterns |
| Distributed Systems | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 11-12: Stream Processing | Event-driven architecture, idempotency, exactly-once semantics |
| Lambda Deep Dive | “AWS Lambda in Action” by Danilo Poccia | Ch. 2, 4, 8: First Lambda, Data Streams, Optimization | Cold start optimization, event source integration |
| Step Functions | AWS Step Functions Developer Guide | Error Handling, ASL Specification | State machine syntax, retry/catch patterns |
| IAM Security | “AWS Security” by Dylan Shield | Ch. 3, 5: IAM, Data Protection | Lambda execution roles, resource-based policies |
| Terraform | “Terraform: Up & Running” by Yevgeniy Brikman | Ch. 3, 7: State, Multiple Providers | Deploy Step Functions + Lambda as code |
| Observability | “Practical Monitoring” by Mike Julian | Ch. 4, 6: Applications, Alerting | Metrics and alerts for serverless |
| Cost Optimization | “AWS Cost Optimization” by Brandon Carroll | Ch. 5, 7: Lambda, Storage | Memory tuning, S3 lifecycle policies |
Reading strategy:
- Start with “AWS for Solutions Architects” Ch. 8 (serverless overview)
- Read “Designing Data-Intensive Applications” Ch. 11 (event-driven concepts)
- Dive into “AWS Lambda in Action” Ch. 4 (event sources) and Ch. 8 (optimization)
- Refer to Step Functions Developer Guide when writing state machine
- Use “Terraform: Up & Running” Ch. 3 as you build infrastructure
Project 3: Auto-Scaling Web Application (EC2 + ALB + RDS + S3)
- File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
- Programming Language: HCL (Terraform)
- Coolness Level: Level 1: Pure Corporate Snoozefest
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cloud Infrastructure / Scalability
- Software or Tool: AWS (EC2, ALB, RDS)
- Main Book: “AWS for Solutions Architects” by Saurabh Shrivastava
What you’ll build: A traditional multi-tier web application with load-balanced EC2 instances that scale based on CPU/request metrics, backed by RDS (Aurora or PostgreSQL), with static assets served from S3/CloudFront.
Why it teaches EC2 & Core AWS: This is the “classic” AWS architecture. Understanding auto-scaling groups, launch templates, health checks, and how ALB distributes traffic is foundational. You’ll also learn why RDS simplifies database ops and how S3+CloudFront offloads static content.
Core challenges you’ll face:
- Launch Templates (maps to instance configuration): Defining AMI, instance type, user data scripts, IAM instance profiles
- Auto Scaling Policies (maps to elasticity): Configuring scaling based on CloudWatch metrics (CPU, request count, custom)
- Health Checks (maps to reliability): Understanding EC2 vs ELB health checks and how they affect scaling
- Database Connectivity (maps to networking): RDS in private subnet, security group rules from app tier only
- Static Asset Optimization (maps to performance): S3 origin with CloudFront distribution, cache behaviors
Resources for key challenges:
- AWS Hands-On Guide: Build and Deploy Full Cloud Architecture - Udemy step-by-step
- AWS DevOps Zero to Hero - GitHub free resource
- “AWS for Solutions Architects” by Shrivastava (Ch. 6-7) - Multi-tier architecture patterns
Key Concepts:
- Auto Scaling Groups: “AWS for Solutions Architects” Ch. 6
- Application Load Balancer: AWS ALB Documentation
- RDS Multi-AZ: “AWS for Solutions Architects” Ch. 7
- CloudFront CDN: CloudFront Developer Guide
Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Project 1 (VPC), basic web development, familiarity with databases
Real world outcome:
- A working web application URL you can share
- Watch instances spin up when you load test (use
heyor Apache Bench) - See instances terminate when load decreases
- Access your database via RDS console or psql through bastion
- Fast-loading static assets via CloudFront with cache hits
Learning milestones:
- First milestone: Single EC2 with user data script serving a web app → you understand instance bootstrapping
- Second milestone: Add ALB + 2 instances in target group → you understand load balancing
- Third milestone: Add Auto Scaling with CPU-based policy → you understand elasticity
- Fourth milestone: Add RDS in private subnet → you understand data tier security
- Final milestone: Add S3 + CloudFront for static assets → you understand CDN patterns
Real World Outcome
Here’s what success looks like when you complete this project:
# 1. Deploy the infrastructure
$ terraform apply
...
Apply complete! Resources: 23 added, 0 changed, 0 destroyed.
Outputs:
alb_dns_name = "my-app-alb-123456789.us-east-1.elb.amazonaws.com"
rds_endpoint = "myapp-db.c9akl1.us-east-1.rds.amazonaws.com:5432"
cloudfront_domain = "d111111abcdef8.cloudfront.net"
# 2. Verify the application is running
$ curl http://my-app-alb-123456789.us-east-1.elb.amazonaws.com
<html>
<head><title>My Scalable App</title></head>
<body>
<h1>Hello from instance i-0abc123def456!</h1>
<p>Current time: 2025-12-22 14:32:15</p>
<p>Database connection: OK</p>
</body>
</html>
# 3. Load test to trigger auto-scaling (using 'hey' tool)
$ hey -n 10000 -c 100 http://my-app-alb-123456789.us-east-1.elb.amazonaws.com
Summary:
Total: 45.3216 secs
Slowest: 2.3451 secs
Fastest: 0.0234 secs
Average: 0.4532 secs
Requests/sec: 220.75
Status code distribution:
[200] 10000 responses
# 4. Watch instances scale up in real-time (in another terminal)
$ watch "aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-app-asg \
--query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[*].[InstanceId,LifecycleState]]' \
--output table --no-cli-pager"
Every 2.0s: aws autoscaling...
---------------------------------------------
| DescribeAutoScalingGroups |
+-------------------------------------------+
|| 2 | 1 | 5 || # Started with 2 instances
+-------------------------------------------+
||| i-0abc123def456 | InService |||
||| i-0def789ghi012 | InService |||
+-------------------------------------------+
# After load increases, watch it scale to 4 instances:
+-------------------------------------------+
|| 4 | 1 | 5 || # Scaled up!
+-------------------------------------------+
||| i-0abc123def456 | InService |||
||| i-0def789ghi012 | InService |||
||| i-0jkl345mno678 | InService |||
||| i-0pqr901stu234 | InService |||
+-------------------------------------------+
# 5. Check CloudWatch metrics showing scaling activity
$ aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name TargetResponseTime \
--dimensions Name=LoadBalancer,Value=app/my-app-alb/50dc6c495c0c9188 \
--start-time 2025-12-22T14:00:00Z \
--end-time 2025-12-22T15:00:00Z \
--period 300 \
--statistics Average \
--no-cli-pager
Datapoints:
- Timestamp: 2025-12-22T14:00:00Z, Average: 0.032
- Timestamp: 2025-12-22T14:05:00Z, Average: 0.125 # Load increasing
- Timestamp: 2025-12-22T14:10:00Z, Average: 0.456 # Scaling triggered
- Timestamp: 2025-12-22T14:15:00Z, Average: 0.089 # Back to normal after scale-up
# 6. Check ALB target health
$ aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-app-tg/50dc6c495c0c9188 \
--no-cli-pager
TargetHealthDescriptions:
- Target:
Id: i-0abc123def456
Port: 80
HealthCheckPort: '80'
TargetHealth:
State: healthy
Reason: Target passed health checks
- Target:
Id: i-0def789ghi012
Port: 80
HealthCheckPort: '80'
TargetHealth:
State: healthy
Reason: Target passed health checks
# 7. View Auto Scaling activity history
$ aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-app-asg \
--max-records 5 \
--no-cli-pager
Activities:
- ActivityId: 1234abcd-5678-90ef-gh12-ijklmnopqrst
Description: "Launching a new EC2 instance: i-0jkl345mno678"
Cause: "At 2025-12-22T14:12:00Z a monitor alarm TargetTracking-my-app-asg-AlarmHigh-... in state ALARM triggered policy my-app-scaling-policy causing a change to the desired capacity from 2 to 4."
StartTime: 2025-12-22T14:12:15Z
EndTime: 2025-12-22T14:13:42Z
StatusCode: Successful
# 8. Check RDS database connection
$ psql -h myapp-db.c9akl1.us-east-1.rds.amazonaws.com -U dbadmin -d myappdb
Password:
myappdb=> SELECT current_database(), current_user, inet_server_addr();
current_database | current_user | inet_server_addr
------------------+--------------+------------------
myappdb | dbadmin | 10.0.3.45
(1 row)
myappdb=> SELECT COUNT(*) FROM app_requests;
count
-------
10247 # Showing all the requests that were logged
(1 row)
# 9. View CloudFront cache statistics
$ aws cloudfront get-distribution-config \
--id E1234ABCDEFGH \
--query 'DistributionConfig.Origins[0].DomainName' \
--output text --no-cli-pager
my-app-static-assets.s3.us-east-1.amazonaws.com
# Test CloudFront delivery
$ curl -I https://d111111abcdef8.cloudfront.net/images/logo.png
HTTP/2 200
content-type: image/png
content-length: 45678
x-cache: Hit from cloudfront # Cache HIT - fast delivery!
x-amz-cf-pop: SFO5-C1
x-amz-cf-id: abc123...
# 10. After load subsides, watch instances scale down
$ watch "aws autoscaling describe-auto-scaling-groups ..."
+-------------------------------------------+
|| 1 | 1 | 5 || # Scaled back down to minimum
+-------------------------------------------+
||| i-0abc123def456 | InService |||
+-------------------------------------------+
CloudWatch Dashboard View:
- CPU Utilization graph showing spike from 20% → 75% → back to 25%
- Request Count showing 50 req/sec → 800 req/sec → 60 req/sec
- Target Response Time showing latency spike then recovery
- Healthy Host Count showing 2 → 4 → 1 instances over time
- RDS connections showing stable connection pool usage
- ALB HTTP 200 responses at 99.8% (a few timeouts during initial spike)
What You’ll See in AWS Console:
- EC2 Auto Scaling Groups showing scaling activities
- CloudWatch alarms transitioning: OK → ALARM → OK
- ALB target groups with health check status
- RDS Performance Insights showing query patterns
- S3 bucket with static assets
- CloudFront distribution with cache statistics
The Core Question You’re Answering
“How do I build applications that automatically handle 10x traffic spikes without falling over, and intelligently shrink when traffic subsides to save costs?”
This is the fundamental problem that auto-scaling solves. Traditional infrastructure requires you to provision for peak load—meaning you’re paying for idle capacity 90% of the time. Auto-scaling lets you provision for average load and dynamically expand when needed.
Why this matters in the real world:
- E-commerce sites during Black Friday sales
- News sites during breaking news events
- SaaS platforms during business hours (scale down at night)
- API backends handling unpredictable mobile app traffic
- Gaming servers during new release launches
Without auto-scaling, you have two bad options:
- Over-provision: Run 10 servers 24/7 even though you only need them for 2 hours a day → wasted money
- Under-provision: Run 2 servers and accept that your site crashes during traffic spikes → lost revenue
Auto-scaling gives you the best of both worlds: cost efficiency during normal periods, reliability during peaks.
Concepts You Must Understand First
Before diving into implementation, you need to internalize these foundational concepts:
1. Horizontal vs Vertical Scaling
Vertical Scaling (Scale Up): Make your server bigger
- EC2 instance: t3.medium → t3.large → t3.xlarge
- More CPU, more RAM, same single instance
- Limitation: Hard ceiling (largest EC2 instance), requires downtime, single point of failure
Horizontal Scaling (Scale Out): Add more servers
- 1 instance → 3 instances → 10 instances
- Distribute load across multiple machines
- Advantage: Theoretically unlimited, no downtime, fault-tolerant
This project teaches horizontal scaling, which is how modern cloud applications achieve massive scale.
2. Stateless Application Design
Stateless: Each request is independent, no memory of previous requests
# Stateless - GOOD for auto-scaling
@app.route('/api/user/<user_id>')
def get_user(user_id):
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
return jsonify(user)
# Any instance can handle this request
Stateful: Application remembers information between requests
# Stateful - BAD for auto-scaling
user_sessions = {} # In-memory storage
@app.route('/api/cart/add')
def add_to_cart(item_id):
session_id = request.cookies.get('session')
user_sessions[session_id].append(item_id) # Only works if same instance
return "Added"
# BREAKS when load balancer sends next request to different instance
Why stateless matters for auto-scaling:
- Load balancer can send requests to any instance
- Instances can be terminated without data loss
- New instances can immediately serve traffic
How to handle state:
- Store sessions in Redis/ElastiCache (shared across instances)
- Store sessions in DynamoDB
- Use JWT tokens (state stored client-side)
- Use sticky sessions (not recommended, defeats auto-scaling benefits)
3. Load Balancer Algorithms
Round Robin (default for ALB):
Request 1 → Instance A
Request 2 → Instance B
Request 3 → Instance C
Request 4 → Instance A (back to start)
Pro: Simple, fair distribution Con: Doesn’t account for instance load
Least Outstanding Requests (ALB option):
Instance A: 5 active requests
Instance B: 2 active requests ← Send here
Instance C: 8 active requests
Next request goes to Instance B (least busy)
Pro: Better for variable request durations Con: Requires tracking active connections
Weighted Target Groups:
Instance A: weight 100 (gets 50% of traffic)
Instance B: weight 50 (gets 25% of traffic)
Instance C: weight 50 (gets 25% of traffic)
Use case: Blue/green deployments, A/B testing
4. Health Checks: EC2 vs ELB Health Checks
EC2 Health Check (Auto Scaling Group):
- Checks: Is the instance running? Is the status “OK”?
- Fails if: Instance stopped, hardware failure, status check failures
- Does NOT check: Is the application responding?
ELB Health Check (Application Load Balancer):
- Checks: HTTP GET to
/healthendpoint, expects 200 response - Fails if: Application crashed, database unreachable, timeout (default 5 sec)
- More comprehensive than EC2 check
Critical difference:
Scenario: EC2 instance is running, but your web app crashed
EC2 Health Check: PASS (instance is running)
ELB Health Check: FAIL (app not responding)
Result: Auto Scaling thinks instance is healthy, keeps it running
Load balancer marks it unhealthy, stops sending traffic
Solution: Configure ASG to use ELB health checks!
Best practice configuration:
resource "aws_autoscaling_group" "app" {
health_check_type = "ELB" # Use ELB checks, not EC2
health_check_grace_period = 300 # Wait 5 min for app to start
# ... other config
}
resource "aws_lb_target_group" "app" {
health_check {
path = "/health"
interval = 30 # Check every 30 seconds
timeout = 5 # Fail if no response in 5 sec
healthy_threshold = 2 # 2 consecutive passes = healthy
unhealthy_threshold = 3 # 3 consecutive fails = unhealthy
}
}
5. Launch Templates vs Launch Configurations
Launch Configuration (DEPRECATED, but you’ll see it in old code):
- Cannot be modified (must create new version)
- Limited instance type options
- No instance metadata service v2 support
Launch Template (USE THIS):
- Versioned (can modify and rollback)
- Supports mixed instance types
- Supports Spot instances
- Supports newer EC2 features
resource "aws_launch_template" "app" {
name_prefix = "my-app-"
image_id = "ami-0c55b159cbfafe1f0" # Your AMI
instance_type = "t3.medium"
# User data script to configure instance on boot
user_data = base64encode(<<-EOF
#!/bin/bash
cd /opt/myapp
export DB_HOST=${aws_db_instance.main.endpoint}
./start-app.sh
EOF
)
# IAM role for instance
iam_instance_profile {
arn = aws_iam_instance_profile.app.arn
}
# Security group
vpc_security_group_ids = [aws_security_group.app.id]
# Enable detailed monitoring
monitoring {
enabled = true
}
}
Key understanding: Launch template is the blueprint for every instance that auto-scaling creates.
Questions to Guide Your Design
Ask yourself these questions as you build. If you can’t answer them, you don’t understand the architecture yet.
1. How does the ALB know which instances are healthy?
Answer: The ALB continuously sends HTTP requests to each instance’s health check endpoint (e.g., GET /health). If it receives a 200 OK response within the timeout period (default 5 seconds), the instance is marked healthy. If it fails the unhealthy threshold number of times (default 3 consecutive failures), it’s marked unhealthy and removed from rotation.
Follow-up: What should your /health endpoint check?
- Database connectivity?
- Disk space?
- Memory available?
- Dependency services reachable?
Best practice: Start simple (just return 200), then add checks for critical dependencies.
2. What metric should trigger scaling?
Common options:
CPU Utilization (most common starting point):
resource "aws_autoscaling_policy" "scale_up" {
name = "cpu-scale-up"
scaling_adjustment = 1
adjustment_type = "ChangeInCapacity"
cooldown = 300
autoscaling_group_name = aws_autoscaling_group.app.name
}
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "cpu-utilization-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 120
statistic = "Average"
threshold = 70 # Scale up when CPU > 70%
alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}
Problems with CPU-based scaling:
- I/O-bound applications may never hit CPU threshold
- Doesn’t capture actual user experience (latency)
Request Count Per Target (better for web apps):
resource "aws_cloudwatch_metric_alarm" "requests_high" {
metric_name = "RequestCountPerTarget"
namespace = "AWS/ApplicationELB"
threshold = 1000 # 1000 requests/target in 5 min
# ... more config
}
Target Tracking Scaling (recommended - AWS manages it):
resource "aws_autoscaling_policy" "target_tracking" {
name = "target-tracking-policy"
policy_type = "TargetTrackingScaling"
autoscaling_group_name = aws_autoscaling_group.app.name
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 50.0 # Keep average CPU at 50%
}
}
Decision criteria:
- Web apps with predictable load per request: Request count
- CPU-intensive tasks: CPU utilization
- Most cases: Target tracking with CPU, let AWS handle it
- Advanced: Custom metrics (latency from your app logs)
3. How do you handle session state with multiple instances?
Option 1: Don’t use sessions (best for APIs)
- Use JWT tokens
- All state in the token payload
- Stateless, works with any instance
Option 2: Shared session store (for traditional web apps)
from flask import Flask, session
from flask_session import Session
import redis
app = Flask(__name__)
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.from_url('redis://elasticache-endpoint:6379')
Session(app)
# Now sessions are stored in Redis, not in-memory
# Any instance can read/write session data
Option 3: Sticky sessions (not recommended)
- ALB always routes user to same instance
- Defeats purpose of auto-scaling
- Lose sessions when instance terminates
Real-world recommendation: Use ElastiCache (Redis) for sessions if you need them.
4. What’s the difference between desired, minimum, and maximum capacity?
resource "aws_autoscaling_group" "app" {
min_size = 2 # Never go below 2 (for high availability)
max_size = 10 # Never exceed 10 (cost protection)
desired_capacity = 4 # Start with 4, let scaling adjust
# ... more config
}
How they work together:
Scenario 1: Normal operation
Current: 4 instances (= desired)
Traffic increases, CPU hits 80%
Scaling policy triggers: desired_capacity = 4 + 2 = 6
Result: Launch 2 new instances
Scenario 2: Traffic spike
Current: 8 instances
Traffic spike continues, CPU still high
Scaling policy wants: desired_capacity = 8 + 2 = 10
Result: Launch 2 more instances (at max_size, stops)
Scenario 3: Traffic drops
Current: 6 instances
CPU drops to 20%, scale-down alarm triggers
Scaling policy: desired_capacity = 6 - 2 = 4
Result: Terminate 2 instances
Scenario 4: Major incident, all instances failing
Current: 0 healthy instances
Auto Scaling: "I need to meet min_size!"
Result: Launches 2 new instances (even if health checks failing)
Best practices:
min_size: Enough for high availability (at least 2 in different AZs)max_size: Based on budget and realistic peak loaddesired_capacity: Let auto-scaling manage it, or set initial value
5. What happens during a deployment to a running auto-scaled application?
Bad approach (causes downtime):
# Update launch template
# Terminate all instances
# New instances launch with new code
# → Downtime while instances come up
Good approach (rolling update):
resource "aws_autoscaling_group" "app" {
# ...
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 50 # Keep at least 50% healthy during update
}
}
}
Process:
- Update launch template with new AMI/user data
- Trigger instance refresh
- Auto Scaling terminates one instance
- New instance launches with new config
- Wait for health checks to pass
- Repeat until all instances replaced
Blue/Green deployment (zero downtime):
- Create new ASG with new version
- Attach to same target group
- Gradually shift traffic (weighted target groups)
- Monitor error rates
- Fully switch or rollback
Thinking Exercise
Problem: You’re building a web application where each EC2 instance can reliably handle 100 requests per second. You expect peak traffic of 500 requests per second during business hours (9 AM - 5 PM), and only 50 requests per second during off-hours.
Question 1: How many instances do you need at minimum for peak load?
Click for answer
**Answer**: 500 req/sec ÷ 100 req/sec per instance = **5 instances minimum** at peak But you should add buffer for: - Instance failures - Traffic spikes beyond expected peak - Performance degradation under load **Recommended**: - `max_size = 8` (160% of calculated minimum, allows for 60% spike) - `min_size = 2` (high availability during off-hours) - `desired_capacity = 6` (20% buffer above minimum requirement)Question 2: If each instance costs $0.05/hour, how much do you save per month with auto-scaling vs. running peak capacity 24/7?
Click for answer
**Without auto-scaling** (running 5 instances 24/7): - Cost = 5 instances × $0.05/hour × 24 hours × 30 days = **$180/month** **With auto-scaling**: - Peak hours (9 AM - 5 PM = 8 hours): 5 instances - Off-hours (16 hours): 2 instances Daily cost: - Peak: 5 instances × $0.05 × 8 hours = $2.00 - Off-hours: 2 instances × $0.05 × 16 hours = $1.60 - Total per day = $3.60 Monthly cost: - $3.60 × 30 days = **$108/month** **Savings: $180 - $108 = $72/month (40% reduction)** With realistic scaling (more granular adjustments), savings are often 50-70%.Question 3: Your health check interval is 30 seconds, unhealthy threshold is 3. An instance’s application crashes. How long until the ALB stops sending traffic to it?
Click for answer
**Answer**: - Health check every 30 seconds - Need 3 consecutive failures - Time = 30 seconds × 3 = **90 seconds minimum** During this time, users may experience: - Timeout errors (if health check timeout is 5 sec, user requests timeout too) - Failed requests sent to unhealthy instance **Optimization**: Reduce interval to 10 seconds, unhealthy threshold to 2: - Time to detect = 10 × 2 = **20 seconds** - Trade-off: More frequent health checks = slightly more trafficQuestion 4: You set scale-up to trigger at 70% CPU, scale-down at 30% CPU. What problem might you encounter?
Click for answer
**Problem**: **Flapping** (constant scale-up/scale-down oscillation) **Scenario**: 1. 3 instances at 75% CPU → scale up to 4 instances 2. Load distributed across 4 → CPU drops to 56% (still above 30%) 3. Stays stable... then one instance terminates unexpectedly 4. Load redistributes to 3 instances → CPU back to 75% → scale up again 5. Repeat forever **Or worse**: 1. 3 instances at 75% CPU → scale up to 4 2. CPU drops to 28% → scale down to 3 3. CPU jumps to 75% → scale up to 4 4. Infinite loop, costs money, destabilizes app **Solution**: Add **cooldown periods** and **wider gap**: ```hcl resource "aws_autoscaling_policy" "scale_up" { cooldown = 300 # Wait 5 minutes before scaling again # ... } # Use thresholds with buffer: # Scale up at 70% CPU # Scale down at 20% CPU (50 point gap prevents flapping) ``` **Better solution**: Use **target tracking scaling**: ```hcl target_tracking_configuration { target_value = 50.0 # AWS automatically manages scale-up/down to maintain this } ``` AWS's algorithm is smarter, prevents flapping, considers cooldowns automatically.The Interview Questions They’ll Ask
When you claim AWS auto-scaling experience on your resume, expect these questions:
1. Basic Understanding
Q: “Explain the difference between vertical and horizontal scaling. When would you use each?”
Expected answer:
- Vertical = bigger instance (limited by hardware, downtime required)
- Horizontal = more instances (unlimited, no downtime, requires stateless design)
- Use vertical for: Legacy apps that can’t distribute, databases (until you move to RDS)
- Use horizontal for: Web apps, APIs, stateless services
Q: “What’s the difference between an Application Load Balancer and a Network Load Balancer?”
Expected answer:
- ALB: Layer 7 (HTTP/HTTPS), content-based routing, WebSockets, host/path routing
- NLB: Layer 4 (TCP/UDP), ultra-low latency, static IP, millions of req/sec
- Use ALB for: Web applications, microservices with path routing
- Use NLB for: Non-HTTP protocols, extreme performance requirements, static IP needed
2. Design Scenarios
Q: “You’re seeing 5xx errors from your ALB. How do you troubleshoot?”
Expected approach:
- Check target health in target group (unhealthy instances?)
- Check ALB access logs (which endpoints returning 5xx?)
- Check application logs on instances (app crashes? database timeouts?)
- Check security group rules (instances can reach database?)
- Check CloudWatch metrics (CPU/memory maxed out?)
Q: “Your application needs to maintain user sessions. How do you architect this with auto-scaling?”
Expected answer:
- Option 1: ElastiCache (Redis/Memcached) as shared session store
- Option 2: DynamoDB for session storage
- Option 3: JWT tokens (no server-side sessions)
- NOT sticky sessions (defeats auto-scaling benefits, data loss on instance termination)
3. Scaling Logic
Q: “You set your ASG to min=2, max=10, desired=5. You manually terminate an instance. What happens?”
Expected answer:
- Current instances: 4 (after termination)
- Desired capacity: still 5
- Auto Scaling detects current < desired
- Launches 1 new instance to reach desired=5
Q: “What’s the difference between target tracking scaling and step scaling?”
Expected answer:
- Target tracking: Set a target (e.g., “maintain 50% CPU”), AWS automatically scales up/down to maintain it. Simpler, recommended for most use cases.
- Step scaling: Define explicit rules (e.g., “if CPU > 70%, add 2 instances; if CPU > 85%, add 4 instances”). More control, more complex, use for non-linear scaling needs.
4. Real-World Problem Solving
Q: “Your auto-scaling isn’t triggering when you expect. How do you debug?”
Expected approach:
- Check CloudWatch alarms (are they in ALARM state?)
- Check alarm history (has threshold actually been crossed?)
- Check alarm configuration (right metric? right threshold? evaluation periods?)
- Check ASG configuration (is policy attached? cooldown preventing scale?)
- Check instance metrics (is data actually being reported?)
Q: “You deployed a new version and now instances are failing health checks. What do you check?”
Expected approach:
- SSH to instance, check application logs
- Test health check endpoint manually:
curl localhost:80/health - Check if app started correctly (check user data script logs)
- Check security group (does instance allow traffic on health check port?)
- Check health check configuration (path correct? timeout too short?)
- Check grace period (is app given enough time to start before checks?)
5. Cost Optimization
Q: “How would you reduce costs for an auto-scaled application?”
Expected strategies:
- Right-size instances (use smaller instance types if CPU consistently low)
- Use Spot instances for fault-tolerant workloads (70-90% cheaper)
- Implement aggressive scale-down (reduce min_size during known low-traffic periods)
- Use scheduled scaling (scale down automatically at night/weekends)
- Reserved Instances or Savings Plans for baseline capacity
- Monitor and optimize unhealthy instance replacement (failing fast vs. retrying)
Q: “Explain Spot instances in the context of auto-scaling. What are the risks?”
Expected answer:
- Spot = unused EC2 capacity at up to 90% discount
- Risk: AWS can terminate with 2-minute notice if capacity needed
- Use in ASG with mixed instance types (Spot + On-Demand)
- Configure Spot allocation strategy (price-capacity-optimized)
- Not suitable for: Stateful apps, databases, single-instance workloads
- Perfect for: Batch processing, web front-ends (with On-Demand baseline)
Hints in Layers
When you get stuck, reveal hints progressively instead of jumping to the solution:
Problem: Instances launching but failing health checks immediately
Hint 1 (First check)
Check the **health check grace period**. Your application might need time to start up. ```hcl resource "aws_autoscaling_group" "app" { health_check_grace_period = 300 # Seconds to wait before health checks } ``` If your app takes 3 minutes to initialize but grace period is 30 seconds, instances will be terminated before they're ready.Hint 2 (If still failing)
SSH into a failing instance and test the health check endpoint manually: ```bash # From within the instance curl -v http://localhost:80/health # Check if the application is actually running ps aux | grep your-app-name # Check application logs tail -f /var/log/your-app/app.log ``` Is the application even starting? Is it listening on the correct port?Hint 3 (Security check)
Verify security group rules allow health checks: ```bash aws ec2 describe-security-groups --group-ids sg-xxxxx --no-cli-pager # Look for: # - Inbound rule allowing ALB security group on port 80 # - Or inbound rule allowing the VPC CIDR on port 80 ``` The ALB needs network access to perform health checks.Hint 4 (Application check)
Check your user data script logs: ```bash # On Amazon Linux/Ubuntu cat /var/log/cloud-init-output.log # Look for errors in your bootstrap script # Did database connection fail? # Did dependencies install correctly? ``` A failing user data script means your app never starts.Solution (Last resort)
Common causes and fixes: 1. **Application takes too long to start**: ```hcl health_check_grace_period = 600 # Increase to 10 minutes ``` 2. **Wrong health check path**: ```hcl resource "aws_lb_target_group" "app" { health_check { path = "/health" # Make sure this endpoint exists! } } ``` 3. **Health check endpoint requires database, database unreachable**: - Fix security group rules allowing app tier → database tier - Or simplify health check to not require database 4. **Application listening on wrong port**: ```python # Your app app.run(host='0.0.0.0', port=80) # Must match target group port ``` 5. **User data script has errors, app never starts**: - Test user data script locally first - Add error handling: `set -e` to fail fast - Check logs: `/var/log/cloud-init-output.log`Problem: Auto-scaling not triggering when CPU is high
Hint 1
Check if your CloudWatch alarm is actually in ALARM state: ```bash aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager ``` Look at `StateValue`. If it's `OK`, the threshold hasn't been crossed.Hint 2
Check your alarm configuration: ```bash aws cloudwatch describe-alarms --alarm-names "cpu-high-alarm" --no-cli-pager # Verify: # - Threshold: Is it too high? (e.g., 99% vs 70%) # - EvaluationPeriods: Does CPU need to be high for multiple periods? # - Period: Is it too long? (e.g., 5 minutes vs 1 minute) # - Statistic: Average vs Maximum vs Minimum ``` Example: If `EvaluationPeriods=3` and `Period=300`, CPU must be high for **15 minutes** before alarm triggers.Hint 3
Check if scaling is in cooldown: ```bash aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg --no-cli-pager # Look for recent scaling activities aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg --max-records 5 --no-cli-pager ``` If a scaling action just happened, cooldown period prevents another one (default 300 seconds).Solution
Common fixes: 1. **Alarm threshold too high**: ```hcl threshold = 70 # Not 90 ``` 2. **Evaluation period too long**: ```hcl evaluation_periods = 2 # Not 5 period = 60 # 1 minute, not 5 ``` 3. **Cooldown preventing scaling**: ```hcl resource "aws_autoscaling_policy" "scale_up" { cooldown = 60 # Reduce from 300 } ``` 4. **Alarm not attached to scaling policy**: ```hcl resource "aws_cloudwatch_metric_alarm" "cpu_high" { # ... alarm_actions = [aws_autoscaling_policy.scale_up.arn] # Must be set! } ``` 5. **Use target tracking instead**: ```hcl resource "aws_autoscaling_policy" "target_tracking" { policy_type = "TargetTrackingScaling" target_tracking_configuration { predefined_metric_specification { predefined_metric_type = "ASGAverageCPUUtilization" } target_value = 50.0 } } # AWS handles everything automatically ```Books That Will Help
| Book | Author | What It Teaches | Best Sections for This Project |
|---|---|---|---|
| AWS for Solutions Architects | Saurabh Shrivastava et al. | Comprehensive AWS architecture patterns | Ch. 6 (Auto Scaling Groups), Ch. 7 (RDS), Ch. 10 (High Availability) |
| AWS Certified Solutions Architect Study Guide | Ben Piper, David Clinton | Exam-focused AWS fundamentals | Ch. 4 (EC2), Ch. 5 (ELB), Ch. 7 (CloudWatch) |
| Designing Data-Intensive Applications | Martin Kleppmann | Distributed systems theory (applies to auto-scaling) | Ch. 1 (Scalability), Ch. 8 (Distributed Systems) |
| Amazon Web Services in Action | Michael Wittig, Andreas Wittig | Hands-on AWS with practical examples | Ch. 3 (Infrastructure Automation), Ch. 6 (Scaling Up and Down) |
| The Phoenix Project | Gene Kim et al. | DevOps principles (why auto-scaling matters) | Part 2 (First Way - Flow), Part 3 (Second Way - Feedback) |
| Site Reliability Engineering | Google SRE Team | Operational practices for scaled systems | Ch. 6 (Monitoring), Ch. 22 (Cascading Failures - why you need auto-scaling) |
| Terraform: Up & Running | Yevgeniy Brikman | Infrastructure as Code for AWS | Ch. 2 (Terraform Syntax), Ch. 5 (State Management), Ch. 7 (Modules) |
Reading strategy:
- Start with “AWS for Solutions Architects” Ch. 6-7 for AWS-specific patterns
- Read “Designing Data-Intensive Applications” Ch. 1 to understand why systems need to scale
- Use “Terraform: Up & Running” Ch. 2 as a reference while coding
- Read “SRE” Ch. 22 after completing the project to understand failure modes you just protected against
Project 4: Container Platform (ECS Fargate or EKS)
- File: AWS_DEEP_DIVE_LEARNING_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, TypeScript, Terraform HCL
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: Level 3: The “Service & Support” Model
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: Containers, Kubernetes, Orchestration
- Software or Tool: Docker, ECS, EKS, Kubernetes
- Main Book: “AWS for Solutions Architects” by Shrivastava et al.
What you’ll build: A containerized microservices application deployed on either ECS with Fargate (serverless containers) or EKS (managed Kubernetes), complete with service discovery, load balancing, and auto-scaling.
Why it teaches Containers on AWS: Containers are between EC2 and Lambda—more portable than EC2, more control than Lambda. ECS teaches you AWS’s native container orchestration; EKS teaches you Kubernetes. Both force you to understand task definitions, networking modes, and service mesh concepts.
Core challenges you’ll face:
- Task Definitions (maps to container configuration): CPU/memory allocation, environment variables, port mappings, IAM task roles
- Networking Modes (maps to container networking): awsvpc mode, service discovery, ALB integration
- Service Scaling (maps to container orchestration): Target tracking, step scaling based on metrics
- ECR Integration (maps to image management): Building, tagging, pushing container images
- For EKS: Kubernetes Fundamentals (maps to orchestration): Pods, Deployments, Services, Ingress
Resources for key challenges:
- Learn Amazon ECS over a Weekend - Fast workshop-style course
- Provision an EKS cluster with Terraform - HashiCorp official tutorial
- Terraform EKS Tutorial - Spacelift comprehensive guide
- Build secure application networks with VPC Lattice - AWS Containers Blog
Key Concepts:
- ECS Task Definitions: Amazon ECS Developer Guide
- Fargate vs EC2 Launch Type: “AWS for Solutions Architects” Ch. 9
- Kubernetes on AWS: Amazon EKS User Guide
- Terraform EKS Module: terraform-aws-eks - GitHub
Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Docker basics, Projects 1-2 completed, some Kubernetes knowledge for EKS path
Real world outcome:
- Multiple containerized services communicating with each other
- A working application accessible via ALB
- CloudWatch Container Insights showing metrics
- Ability to deploy new versions with zero downtime
- For EKS:
kubectlcommands working against your cluster
Learning milestones:
- First milestone: Single container task running on Fargate → you understand task definitions
- Second milestone: Add ALB target group with service → you understand container load balancing
- Third milestone: Add second service with service discovery → you understand microservices communication
- Fourth milestone: Add auto-scaling based on CPU → you understand container elasticity
- Final milestone: CI/CD pipeline deploying to ECS/EKS → you understand container DevOps
Real World Outcome
When you complete this project, you’ll have tangible proof of a working container platform:
For ECS Fargate:
# View your running services
$ aws ecs list-services --cluster my-cluster --no-cli-pager
{
"serviceArns": [
"arn:aws:ecs:us-east-1:123456789:service/my-cluster/api-service",
"arn:aws:ecs:us-east-1:123456789:service/my-cluster/worker-service"
]
}
# Check service health
$ aws ecs describe-services --cluster my-cluster --services api-service --no-cli-pager
{
"services": [{
"serviceName": "api-service",
"runningCount": 2,
"desiredCount": 2,
"launchType": "FARGATE",
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": ["subnet-abc123", "subnet-def456"],
"securityGroups": ["sg-xyz789"]
}
}
}]
}
# Your application responds via ALB
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "service": "api", "task_id": "abc123def456", "version": "1.2.0"}
# Service discovery working (containers finding each other by DNS)
$ curl http://my-app-alb-1234567890.us-east-1.elb.amazonaws.com/api/worker-status
{"worker_service": "reachable", "tasks": 3, "queue_depth": 42}
# View container images in ECR
$ aws ecr describe-images --repository-name my-app --no-cli-pager
{
"imageDetails": [
{
"imageDigest": "sha256:abc123...",
"imageTags": ["latest", "v1.2.0"],
"imagePushedAt": "2025-12-20T10:30:00+00:00"
}
]
}
For EKS:
# Your Kubernetes cluster is accessible
$ kubectl cluster-info
Kubernetes control plane is running at https://ABC123.gr7.us-east-1.eks.amazonaws.com
# View your running workloads
$ kubectl get pods -n production
NAME READY STATUS RESTARTS AGE
api-deployment-7d8f9c-abc12 1/1 Running 0 2h
api-deployment-7d8f9c-def34 1/1 Running 0 2h
worker-deployment-5e6f7-xyz 1/1 Running 0 1h
$ kubectl get services -n production
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
api-service LoadBalancer 10.100.50.25 a1b2c3.us-east-1.elb.amazonaws.com 80:30001/TCP
worker-svc ClusterIP 10.100.75.10 <none> 8080/TCP
# Application accessible via Kubernetes LoadBalancer
$ curl http://a1b2c3.us-east-1.elb.amazonaws.com/health
{"status": "healthy", "pod": "api-deployment-7d8f9c-abc12", "node": "ip-10-0-1-50.ec2.internal"}
# Rolling deployment with zero downtime
$ kubectl set image deployment/api-deployment api=my-repo/api:v1.3.0 -n production
deployment.apps/api-deployment image updated
$ kubectl rollout status deployment/api-deployment -n production
Waiting for deployment "api-deployment" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "api-deployment" rollout to finish: 1 old replicas are pending termination...
deployment "api-deployment" successfully rolled out
Container Insights Dashboard:
- CPU/Memory utilization per service/pod
- Network throughput and connections
- Task/pod startup time and failure rates
- Container-level logs aggregated in CloudWatch
Service Discovery Working:
- ECS: Cloud Map DNS names (api-service.local, worker-service.local)
- EKS: Kubernetes DNS (api-service.production.svc.cluster.local)
- Containers automatically discover each other without hardcoded IPs
Rolling Deployments:
- Deploy new version without downtime
- Watch old tasks drain and new tasks become healthy
- Automatic rollback if health checks fail
The Core Question You’re Answering
“When should I use containers vs Lambda vs EC2, and what’s the difference between ECS and EKS?”
This is THE fundamental architectural decision on AWS:
- Lambda: Event-driven, sub-15-minute executions, no state, extreme auto-scaling. Use when you have unpredictable spiky traffic and stateless operations.
- Containers (ECS/EKS): Long-running processes, stateful applications, need specific runtime control, want portability. Use when you need more than 15 minutes, WebSocket connections, background workers, or existing Dockerized apps.
- EC2: Full OS control, legacy apps, specialized hardware needs, licensed software. Use when containers/Lambda don’t fit.
ECS vs EKS:
- ECS: AWS-native, simpler, less operational overhead, great for teams new to containers. Task definitions are AWS-specific JSON.
- EKS: Standard Kubernetes, portable across clouds, richer ecosystem, more complex. Use when you need Kubernetes features or multi-cloud portability.
Fargate vs EC2 Launch Type:
- Fargate: Serverless containers—you define task CPU/memory, AWS manages infrastructure. Higher per-task cost, zero operational overhead.
- EC2 Launch Type: You manage the EC2 instances in the cluster. Lower per-task cost if you have steady baseline load, more control, more ops work.
Concepts You Must Understand First
- Container Fundamentals:
- Docker images are layered filesystems (each Dockerfile instruction = layer)
- Container registries store images (ECR, Docker Hub)
- Containers are isolated processes sharing a kernel (not VMs)
- Port mapping: container internal port → host port
- Environment variables and secrets injection
- Container Orchestration:
- Scheduling: Placing containers on available compute resources based on CPU/memory requirements
- Service Discovery: How containers find each other (DNS-based)
- Load Balancing: Distributing traffic across container replicas
- Health Checks: Determining if a container is ready to receive traffic
- Auto-Scaling: Adjusting container count based on metrics
- ECS Concepts:
- Task Definition: Blueprint for your container (image, CPU, memory, ports, environment)
- Task: Running instance of a task definition (one or more containers running together)
- Service: Maintains desired count of tasks, integrates with ALB, handles deployments
- Cluster: Logical grouping of tasks/services
- Task Role: IAM permissions for your application (what the container can do)
- Execution Role: IAM permissions for ECS agent (pulling ECR images, writing logs)
- Kubernetes Concepts (for EKS):
- Pod: Smallest unit (one or more containers, shared network/storage)
- Deployment: Declarative pod management (replicas, rolling updates)
- Service: Stable endpoint for pods (ClusterIP, LoadBalancer, NodePort)
- Ingress: HTTP routing rules (maps URLs to services)
- Namespace: Logical cluster subdivision
- ConfigMap/Secret: Configuration and sensitive data injection
- Fargate vs EC2 Launch Types:
- Fargate: Specify vCPU/memory, AWS provisions infrastructure, pay per task resource/time
- EC2: You provision EC2 instances, ECS schedules tasks on them, pay for instances (can be cheaper at scale)
- Trade-off: Fargate = simplicity, EC2 = control + potential cost savings with Reserved Instances
Questions to Guide Your Design
Architecture Decisions:
- When would you choose Fargate over EC2 launch type? (Hint: variable workload vs steady baseline, ops overhead tolerance)
- Should you use ECS or EKS? (Hint: team Kubernetes experience, multi-cloud needs, ecosystem requirements)
- How many containers should run in a single task/pod? (Hint: tightly coupled = same task, independent = separate tasks)
Networking:
- How do containers in the same task communicate? (Hint: localhost, shared network namespace)
- How do containers in different tasks communicate? (Hint: service discovery via DNS)
- What’s the difference between ECS service discovery (Cloud Map) and an ALB? (Hint: internal service-to-service vs external clients)
- Which networking mode should you use (awsvpc, bridge, host)? (Hint: Fargate requires awsvpc)
Security:
- How do you handle secrets in containerized applications? (Hint: Secrets Manager/Parameter Store + task definition secrets, NOT environment variables in plaintext)
- What’s the difference between task role and execution role? (Hint: execution = ECS needs it, task = your app needs it)
- How do you restrict which services can talk to each other? (Hint: security groups with awsvpc mode)
Scaling & Performance:
- Should you scale based on CPU, memory, or request count? (Hint: depends on bottleneck—CPU-bound vs I/O-bound workloads)
- How do you handle database connection pooling with auto-scaling containers? (Hint: RDS Proxy or application-level pooling)
- What happens during a deployment? (Hint: rolling update drains old tasks while starting new ones)
Operational:
- How do you get logs from containers? (Hint: awslogs driver → CloudWatch)
- How do you debug a failing container startup? (Hint: CloudWatch logs, check execution role for ECR pull permissions)
- How do you do zero-downtime deployments? (Hint: ALB health checks + rolling update strategy)
Thinking Exercise
Map Kubernetes concepts to ECS equivalents:
| Kubernetes | ECS Equivalent | Notes |
|---|---|---|
| Pod | Task | Both are one or more containers with shared resources |
| Deployment | Service | Both maintain desired count and handle updates |
| Service (ClusterIP) | Service Discovery (Cloud Map) | Internal DNS-based discovery |
| Service (LoadBalancer) | ALB Target Group | External load balancing |
| Container in Pod spec | Container Definition in Task Definition | Both define image, ports, env vars |
| ConfigMap | SSM Parameter Store / Secrets Manager | Both inject configuration |
| Secret | Secrets Manager | Both handle sensitive data |
| Namespace | Cluster (loosely) | Logical separation (but ECS clusters are less strict) |
| Ingress | ALB Listener Rules | HTTP routing rules |
| HorizontalPodAutoscaler | Service Auto Scaling | Both scale based on metrics |
Key differences:
- Kubernetes is more declarative (desired state via YAML)
- ECS is more imperative (API calls to create/update services)
- Kubernetes has richer networking (network policies, service mesh)
- ECS is simpler for AWS-only deployments
Design exercise: If you have a microservices app with 5 services, should they all go in one task definition or separate services?
- Answer: Separate ECS services (or Kubernetes deployments). Each microservice should scale independently.
- Exception: If two containers are tightly coupled (app + sidecar proxy), same task/pod makes sense.
The Interview Questions They’ll Ask
Basic Level:
- Q: What’s the difference between a Docker image and a container?
- A: An image is a read-only template (layers of filesystem changes). A container is a running instance of an image with a writable layer on top. Analogy: image = class, container = object instance.
- Q: Explain ECS task vs service.
- A: A task is a running instantiation of a task definition (one or more containers). A service maintains a desired count of tasks, integrates with load balancers, and handles deployments. Tasks are ephemeral; services ensure they keep running.
- Q: What is Fargate?
- A: Serverless compute for containers. You specify CPU/memory in your task definition, AWS provisions and manages the underlying infrastructure. No EC2 instances to manage.
- Q: How do containers in the same ECS task communicate?
- A: Via localhost. Containers in the same task share a network namespace, so they can reach each other on 127.0.0.1 using different ports.
Intermediate Level:
- Q: When would you use ECS over Lambda?
- A: When you need: (1) longer than 15-minute execution, (2) WebSocket/long-lived connections, (3) stateful processing, (4) specific runtime dependencies not available in Lambda layers, (5) existing Dockerized applications.
- Q: Explain the difference between task role and execution role in ECS.
- A: Execution role: permissions ECS needs to set up your task (pull ECR images, write CloudWatch logs). Task role: permissions your application code needs (read S3, query DynamoDB). Never confuse these—execution role is infrastructure, task role is application.
- Q: How does ECS service discovery work?
- A: Uses AWS Cloud Map to create DNS records for tasks. When you enable service discovery, each task gets a DNS entry (e.g., api-service.local). Other services query this DNS name and get IPs of healthy tasks. Updates automatically as tasks start/stop.
- Q: How do you implement zero-downtime deployments in ECS?
- A: Use rolling update deployment type with ALB. Configure minimum healthy percent (e.g., 100%) and maximum percent (e.g., 200%). ECS starts new tasks, waits for ALB health checks to pass, then drains and stops old tasks. If health checks fail, deployment rolls back.
Advanced Level:
- Q: You have an ECS service that keeps failing health checks and restarting. How do you debug?
- A: (1) Check CloudWatch logs for application errors. (2) Verify health check endpoint in ALB target group matches application. (3) Check security groups allow ALB → tasks traffic. (4) Verify task role has permissions app needs. (5) Check container startup time vs health check interval (may need longer initial delay). (6) Exec into a running task to test manually:
aws ecs execute-command --cluster X --task Y --container Z --command /bin/sh.
- A: (1) Check CloudWatch logs for application errors. (2) Verify health check endpoint in ALB target group matches application. (3) Check security groups allow ALB → tasks traffic. (4) Verify task role has permissions app needs. (5) Check container startup time vs health check interval (may need longer initial delay). (6) Exec into a running task to test manually:
- Q: How would you handle database connection pooling with auto-scaling ECS services?
- A: Options: (1) Use RDS Proxy (connection pooling/multiplexing at AWS layer). (2) Application-level pooling with conservative pool size per container (max_connections / expected_task_count). (3) Use connection pool libraries that handle connection reuse. Problem: Each task creates its own pool, so 10 tasks with 10 connections each = 100 DB connections. RDS Proxy solves this.
- Q: Compare ECS awsvpc networking mode with bridge mode.
- A: awsvpc: Each task gets its own ENI with private IP from VPC subnet. Security groups apply at task level. Required for Fargate. Better isolation. Bridge: Containers share host’s network via Docker bridge. Port mapping required (host port != container port). Multiple tasks on same host need different host ports. awsvpc is recommended for new deployments.
- Q: When would you choose EKS over ECS?
- A: When you need: (1) Kubernetes-specific features (CRDs, operators, Helm charts), (2) multi-cloud portability (same K8s manifests work on GKE/AKS), (3) existing Kubernetes expertise on team, (4) vendor-neutral orchestration. ECS is simpler and AWS-native; EKS is more complex but standard.
- Q: How do you handle secrets in containerized apps on AWS?
- A: Store secrets in AWS Secrets Manager or SSM Parameter Store (SecureString). Reference them in task definition secrets field (NOT environment variables). ECS retrieves secrets at task startup using execution role permissions, injects them as environment variables into container. Secrets are encrypted at rest and in transit. Never hardcode secrets in Dockerfile or pass as plaintext env vars.
- Q: Explain a scenario where you’d use Fargate EC2 launch type instead of Fargate.
- A: When you have: (1) steady baseline load (Reserved Instances cheaper than Fargate per-task pricing), (2) need for specific EC2 instance types (GPU, high memory), (3) tasks require privileged mode or host networking, (4) need to run your own AMI with custom configs. Trade-off: lower cost and more control, but you manage cluster capacity and OS patching.
- Q: Your containerized application experiences cold start delays. How do you optimize?
- A: (1) Reduce image size (multi-stage builds, minimal base images like Alpine). (2) Optimize Dockerfile layer caching (COPY dependencies before code). (3) Pre-warm containers if predictable traffic spikes. (4) Use smaller task CPU/memory if over-provisioned (faster scheduling). (5) For EKS: Use cluster autoscaler with appropriate scaling configs. (6) Consider keeping minimum task count > 0 to avoid cold starts entirely.
Hints in Layers
Level 1: Getting Started
- Start with a single-container task definition running nginx or a simple Python/Node.js app
- Use Fargate to avoid EC2 cluster management
- Deploy to public subnet first (simpler than private with NAT)
- Use AWS Console to create your first task definition—you’ll see all the options
- Put “latest” tag on your first image (iterate fast)
Level 2: Adding Realism
- Move to private subnets with NAT gateway (production pattern)
- Add ALB in front of your service for stable endpoint
- Create second container that calls the first (understand inter-service communication)
- Use ECR instead of Docker Hub (AWS-native, faster pulls)
- Start using specific version tags (v1.0.0, not “latest”)
Level 3: Production Patterns
- Enable service discovery (Cloud Map) for service-to-service DNS
- Configure auto-scaling based on ALB request count or custom CloudWatch metrics
- Set up proper health checks (readiness vs liveness)
- Add Container Insights for metrics
- Implement rolling deployments with deployment circuit breaker (auto-rollback on failure)
Level 4: Advanced Scenarios
- Multi-container task with sidecar pattern (app + logging agent)
- Task role granting only necessary permissions (least privilege)
- Secrets injection from Secrets Manager (no plaintext env vars)
- Blue/green deployments using CodeDeploy
- For EKS: Implement Horizontal Pod Autoscaler and Cluster Autoscaler together
Level 5: Mastery
- CI/CD pipeline: GitHub Actions → build image → push to ECR → update ECS service
- Canary deployments (send 10% traffic to new version, monitor, then 100%)
- Service mesh (App Mesh for ECS or Istio for EKS) for advanced routing/observability
- Cross-region replication for DR (replicate ECR images, deploy to multiple regions)
- Custom metrics from containers to CloudWatch for scaling (e.g., queue depth)
Books That Will Help
| Book Title | Author(s) | Relevant Chapters | Why It Helps |
|---|---|---|---|
| AWS for Solutions Architects | Saurabh Shrivastava | Ch. 9: Containers on AWS | Best coverage of ECS architecture patterns, Fargate vs EC2 decisions |
| Docker Deep Dive | Nigel Poulton | Ch. 3-5, 8-9 | Container fundamentals, image layers, networking modes |
| Kubernetes in Action | Marko Luksa | Ch. 3-7 | Core K8s concepts (pods, services, deployments) for EKS path |
| Kubernetes Up & Running | Kelsey Hightower et al. | Ch. 5-6, 9-10 | Practical K8s patterns, service discovery, load balancing |
| Amazon Web Services in Action | Andreas Wittig & Michael Wittig | Ch. 14: Containers | Step-by-step ECS tutorial with CloudFormation examples |
| The DevOps Handbook | Gene Kim et al. | Part IV: Technical Practices | CI/CD for containers, deployment strategies (blue/green, canary) |
| Site Reliability Engineering | Ch. 7, 21 | Load balancing, monitoring containerized systems at scale | |
| Production Kubernetes | Josh Rosso et al. | Ch. 4-6, 11 | Production-grade EKS: networking, security, observability |
| Container Security | Liz Rice | Ch. 2-4, 7 | Securing container images, runtime, orchestrator (critical for prod) |
| Designing Data-Intensive Applications | Martin Kleppmann | Ch. 11: Stream Processing | Understanding when to use containers for stateful vs stateless workloads |
Quick Reference:
- New to containers? Start with “Docker Deep Dive” Ch. 3-5
- Choosing ECS? Focus on “AWS for Solutions Architects” Ch. 9
- Choosing EKS? Read “Kubernetes Up & Running” Ch. 5-6 first
- Ready for production? “Container Security” is mandatory
- Need CI/CD? “The DevOps Handbook” Part IV
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| VPC from Scratch | Intermediate | 1-2 weeks | ⭐⭐⭐⭐⭐ (Foundational) | ⭐⭐⭐ |
| Serverless Pipeline | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ (Event-driven) | ⭐⭐⭐⭐ |
| Auto-Scaling Web App | Intermediate | 2-3 weeks | ⭐⭐⭐⭐⭐ (Classic AWS) | ⭐⭐⭐ |
| Container Platform | Advanced | 2-4 weeks | ⭐⭐⭐⭐⭐ (Modern AWS) | ⭐⭐⭐⭐⭐ |
Recommendation
Start with Project 1 (VPC from Scratch). Everything else depends on understanding networking. A misconfigured VPC will break Lambda VPC access, ECS service discovery, RDS connectivity—everything. Once you can confidently explain why your private subnet routes to a NAT gateway and your public subnet routes to an IGW, move on.
Then do Project 2 (Serverless Pipeline) to understand event-driven architecture—this is where modern AWS shines. Step Functions + Lambda is incredibly powerful once you “get it.”
Then Project 3 (Auto-Scaling Web App) for the “traditional” AWS architecture that many existing systems use. This teaches you fundamentals that apply everywhere.
Finally Project 4 (Containers) brings it all together—you need VPC knowledge, you can integrate with Lambda/Step Functions, and you’ll build on auto-scaling concepts.
Final Overall Project: Full-Stack SaaS Platform
What you’ll build: A complete SaaS application with:
- Multi-tenant architecture with isolated VPCs per environment (dev/staging/prod)
- API Gateway + Lambda for REST/GraphQL endpoints
- ECS Fargate running background workers
- Step Functions orchestrating complex business workflows
- RDS Aurora for relational data
- DynamoDB for high-speed session/cache data
- S3 + CloudFront for static assets and file uploads
- Cognito for authentication
- CI/CD with CodePipeline or GitHub Actions
- Infrastructure defined entirely in Terraform
- CloudWatch dashboards, X-Ray tracing, and alarms
Why this is the ultimate AWS learning project: This forces you to make real architectural decisions—when to use Lambda vs ECS, how to structure your VPCs for multi-environment deployments, how services talk to each other across boundaries. You’ll hit every AWS gotcha: Lambda cold starts affecting user experience, ECS task role permissions, RDS connection pooling, S3 CORS issues, CloudFront cache invalidation timing.
Core challenges you’ll face:
- Multi-Environment Architecture: Terraform workspaces or separate state files, environment-specific configurations
- API Design: REST vs GraphQL, API Gateway vs ALB, authentication flows
- Data Architecture: When to use RDS vs DynamoDB, cross-service data access patterns
- Async Processing: SQS queues, dead-letter queues, exactly-once processing
- Security: IAM policies, secrets management (Secrets Manager/Parameter Store), encryption at rest and in transit
- Observability: Distributed tracing, centralized logging, meaningful metrics
- Cost Optimization: Right-sizing, Reserved Instances vs Savings Plans, S3 lifecycle policies
Key Concepts:
- Well-Architected Framework: AWS Well-Architected - AWS Official
- SaaS Architecture: “AWS for Solutions Architects” Ch. 10-12 - Shrivastava et al.
- Infrastructure as Code at Scale: Terraform Best Practices - HashiCorp
- Serverless Patterns: “Designing Data-Intensive Applications” Ch. 11-12 - Martin Kleppmann (for async patterns)
Difficulty: Advanced Time estimate: 1-2 months Prerequisites: Projects 1-4 completed
Real world outcome:
- A working SaaS application you can demo to employers
- User registration, login, and authenticated API calls
- Background job processing visible in Step Functions console
- Multi-environment deployment from a single codebase
- Cost monitoring dashboard showing your AWS spend
- A portfolio piece that demonstrates comprehensive AWS knowledge
Learning milestones:
- First milestone: Auth working (Cognito + API Gateway) → you understand identity on AWS
- Second milestone: Core API + database working → you understand data tier patterns
- Third milestone: Background processing working → you understand async architecture
- Fourth milestone: Multi-environment deployment → you understand infrastructure management
- Final milestone: Observability + alerting working → you understand production operations
Key Learning Resources Summary
| Resource | Best For |
|---|---|
| “AWS for Solutions Architects” - Shrivastava et al. | Comprehensive coverage of all services |
| AWS DevOps Zero to Hero | Free hands-on projects |
| HashiCorp Terraform Tutorials | Infrastructure as Code |
| AWS Official Tutorials | Service-specific guides |
| “Designing Data-Intensive Applications” - Kleppmann | Distributed systems concepts |
Sources
- AWS Practice Projects - Coursera
- AWS DevOps Zero to Hero - GitHub
- Top AWS Project Ideas - KnowledgeHut
- AWS Step Functions Tutorials - AWS Docs
- Provision EKS with Terraform - HashiCorp
- Terraform EKS Tutorial - Spacelift
- AWS VPC Security Best Practices
- VPC Design Evolution - AWS Architecture Blog
- 13 Best AWS Books - Hackr.io