AWS Networking Deep Dive: From Zero to Cloud Network Architect

Goal

By the end of this learning path, you will have a deep, internalized understanding of AWS networking that goes beyond console clicks. You will visualize packet flow, design secure and cost-aware topologies, and make architectural tradeoffs with confidence.

You will be able to:

Visualize packet flow through complex AWS network topologies
Debug network issues by understanding exactly where traffic can and cannot go
Design secure, cost-effective networks that follow AWS best practices
Choose connectivity patterns like VPC Peering vs Transit Gateway vs PrivateLink
Implement hybrid connectivity that bridges on-premises and cloud

Why This Matters

Every application running in AWS sits on top of a network. If you don’t understand AWS networking, you’re essentially building skyscrapers without understanding the foundation. Misconfigurations in AWS networking are among the most common causes of outages, security breaches, and cost overruns.

The Real-World Impact (2025 Statistics)

The numbers don’t lie—AWS networking misconfigurations are a critical risk:

23% of all cloud security incidents in 2025 stem from misconfigurations (Source)
82% of cloud misconfigurations are caused by human error (Source)
90% of cloud security failures are projected to result from misconfigurations by 2026 (Source)
100% of surveyed organizations experienced a security incident on AWS in the past year (Source)
Average detection time for a configuration issue: over 180 days (Source)
Unauthorized data access incidents due to misconfiguration increased 22% in 2025 (Source)

Network infrastructure is the most vulnerable component because it’s directly accessible over the internet if VPCs and NACLs aren’t properly configured. In January 2025, the Codefinger ransomware group exploited compromised AWS credentials to encrypt data in S3 buckets—highlighting how a single networking or IAM misconfiguration can lead to catastrophic breaches (Source).

The Evolution: From Data Centers to VPCs

Traditional Data Center Networking           AWS VPC Software-Defined Networking
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Physical Topology:                          Logical Topology:

┌────────────────────────────────┐          ┌────────────────────────────────┐
│  Physical Router/Switch Rack    │          │   VPC (10.0.0.0/16)            │
│                                 │          │   ┌────────────┬───────────┐   │
│  ┌──────────┐   ┌──────────┐   │          │   │Public Sub  │Private Sub│   │
│  │ VLAN 10  │   │ VLAN 20  │   │          │   │10.0.1.0/24 │10.0.10./24│   │
│  │ (Web)    │   │ (App)    │   │          │   └────────────┴───────────┘   │
│  └──────────┘   └──────────┘   │          │   Routes managed via API       │
│  Configured manually via CLI    │          └────────────────────────────────┘
└────────────────────────────────┘

• Fixed capacity (buy hardware)             • Elastic (spin up on-demand)
• Manual firewall rules (ACLs)              • Software-defined (Security Groups)
• Slow changes (submit tickets)             • Instant changes (API/Terraform)
• Physical failure = downtime               • Multi-AZ = automatic failover
• CapEx heavy ($100K+ for switches)         • OpEx model (pay per hour)

AWS VPCs solved the fundamental problem: how do you give customers isolated networks on shared physical infrastructure? The answer is software-defined networking (SDN), where AWS’s hypervisor enforces logical boundaries that are as strong as physical separation.

The Problems AWS Networking Solves:

How do you isolate workloads in a shared cloud environment? → VPCs with private IP ranges
How do resources in the cloud talk to each other securely? → Security Groups and NACLs
How do you connect your on-premises data center to AWS? → VPN and Direct Connect
How do you control who can access what, and from where? → Route tables, gateways, and firewall rules
How do you build networks that span regions and accounts? → Transit Gateway and VPC Peering

What You’ll Understand After This Learning Path:

Why VPCs exist and how they provide isolation using SDN encapsulation
How subnets, route tables, and gateways work together to control traffic flow
The difference between Security Groups and NACLs (and when to use each)
How VPC Peering vs Transit Gateway decisions are made (spoiler: Peering for 2-3 VPCs, Transit Gateway for 4+)
How hybrid connectivity (VPN, Direct Connect) works using BGP and IPSec
How to design highly-available, multi-region networks that survive AZ failures
How to secure traffic at the network perimeter using AWS Network Firewall
How to optimize costs while maintaining security (NAT Gateway alternatives, VPC Endpoints)

VPCs and Subnets: Your Logical Data Center

VPCs give you an isolated IPv4/IPv6 address space, and subnets carve that space into failure domains and routing boundaries. The most important mental model is that a subnet is a routing domain tied to an Availability Zone.

VPC 10.0.0.0/16
  |-- Public Subnet 10.0.1.0/24 (AZ-a)
  |-- Private Subnet 10.0.2.0/24 (AZ-a)
  |-- Private Subnet 10.0.3.0/24 (AZ-b)

Routing and Gateways: Where Packets Can Go

Route tables decide the next hop. Gateways (IGW, NAT, TGW, VGW) are the exits and transit points. A single missing route explains most “it can’t connect” incidents.

Instance -> Subnet Route Table -> NAT Gateway -> IGW -> Internet

Security Boundaries: Security Groups vs NACLs

Security Groups are stateful, instance-level firewalls. NACLs are stateless, subnet-level filters. You need both to reason about reachability.

Key rules: SGs allow return traffic automatically, NACLs require explicit inbound and outbound rules.

Hybrid Connectivity: VPN and Direct Connect

When you connect on-prem to AWS, BGP becomes the control plane. IPSec VPNs are fast to set up but ride the public internet; Direct Connect gives predictable latency and bandwidth with private circuits.

Multi-VPC Connectivity: Peering, Transit Gateway, PrivateLink

Peering is simple and point-to-point. Transit Gateway is a hub-and-spoke router for many VPCs. PrivateLink exposes services without exposing networks, which is crucial for multi-account security.

Observability and Troubleshooting

VPC Flow Logs, Reachability Analyzer, and Route 53 Resolver logs let you reconstruct packet paths. If you cannot explain why a flow is allowed or denied, you cannot secure it.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Before starting these projects, you should have:

1. Basic Networking Fundamentals

Understand the OSI model (at least Layers 2-4: Data Link, Network, Transport)
Know what IP addresses, subnets, and CIDR notation mean (e.g., what is 192.168.1.0/24?)
Understand TCP vs UDP at a high level
Know what DNS does (domain names → IP addresses)

2. Linux/Unix Command Line Basics

Navigate filesystems (cd, ls, pwd)
View and edit files (cat, less, vim or nano)
Basic permissions and process management
SSH into remote servers

3. AWS Account & Basic Familiarity

Have an AWS account (free tier is sufficient for most projects)
Know how to navigate the AWS Console
Understand what EC2, S3, and IAM are at a high level
Have AWS CLI installed and configured (aws configure)

4. Infrastructure as Code Experience (Helpful)

Familiarity with Terraform, CloudFormation, or AWS CDK
Understanding of declarative vs imperative programming
If new to IaC: Complete Terraform’s “Get Started” tutorial first

Helpful But Not Required

You’ll learn these concepts through the projects, but having prior exposure helps:

BGP routing basics (you’ll learn this in Direct Connect projects)
IPSec VPN fundamentals (covered in hybrid connectivity projects)
Wireshark or tcpdump packet capture experience
Understanding of stateful vs stateless firewalls
Experience with Docker/containers (helpful for ECS networking projects)
Python or Bash scripting (for automation and log parsing)

Self-Assessment Questions

Answer these to verify you’re ready:

Can you explain what 10.0.0.0/16 means? (How many IPs? What’s the subnet mask?)
Do you know the difference between a route and a firewall rule?
Can you SSH into a Linux instance and run basic commands?
Have you deployed at least one EC2 instance in AWS?
Do you understand what a CIDR block overlap means?
Can you describe the difference between public and private IP addresses?

If you answered “no” to more than 2 questions, spend 1-2 weeks on networking fundamentals first. Resources:

Book: “Computer Networks” by Tanenbaum (Chapters 1, 4, 5)
Course: “Networking Fundamentals” on Coursera or similar
Hands-on: Set up a local network lab with VirtualBox VMs

Development Environment Setup

Required Tools:

# AWS CLI
$ aws --version
aws-cli/2.x.x Python/3.x.x

# Terraform (if using IaC)
$ terraform version
Terraform v1.6+

# SSH client
$ ssh -V
OpenSSH_8.x or higher

# Basic network utilities
$ which curl ping traceroute dig

Recommended Tools:

VS Code with AWS Toolkit extension (for Terraform/CloudFormation editing)
AWS Session Manager Plugin (for SSM access to private instances)
jq (for parsing AWS CLI JSON output)
awslogs or CloudWatch Logs Insights (for log analysis)
Draw.io or Lucidchart (for diagramming network architecture)

AWS Cost Management:

Most projects can run in the AWS Free Tier, but be aware of costs:

NAT Gateway: ~$0.045/hour ($33/month) - DELETE when not in use!
VPN Connections: ~$0.05/hour ($36.50/month)
Direct Connect: Expensive ($0.30/hour for 1Gbps port + data transfer)
VPC Flow Logs to S3: Minimal (a few cents/GB)

Cost-saving tips:

Use terraform destroy or delete resources after each project
Set up AWS Budgets with $50/month alert
Use NAT Instances (t3.micro) for dev/test instead of NAT Gateway
Enable VPC Flow Logs only when needed (sampling: 1 in 10 packets)

Time Investment

Realistic time estimates per project:

Project Complexity	Time Estimate
Foundation projects (1-3)	6-10 hours each (spread over 3-5 days)
Intermediate projects (4-8)	10-15 hours each (1-2 weeks)
Advanced projects (9-13)	15-25 hours each (2-3 weeks)

Total learning path: 3-6 months if doing 1-2 projects per week alongside work.

Important Reality Check

This is NOT a tutorial. These projects are intentionally challenging. You will:

❌ Get stuck and frustrated (this is normal)
❌ Break things and need to rebuild (this is how you learn)
❌ Spend hours debugging why a route isn’t working (welcome to networking)
❌ Delete and recreate VPCs multiple times (iteration is learning)

But you will also:

✅ Build a mental model that makes AWS networking intuitive
✅ Understand WHY things work, not just HOW to configure them
✅ Be able to debug production networking issues confidently
✅ Design secure, scalable networks from scratch
✅ Pass AWS networking certification questions easily

The goal isn’t to finish fast—it’s to understand deeply.

Concept Internalization Map

This table maps each major concept cluster to what you need to deeply internalize—not just memorize, but truly understand.

Concept Cluster	What Must Be Internalized
VPC Fundamentals	• How AWS implements logical isolation using encapsulation • Why CIDR block selection matters (can’t change it later) • How VPC spans multiple AZs but subnets are AZ-specific • The relationship between VPC, subnets, route tables, and gateways • Default VPC vs custom VPC tradeoffs
Subnets & Routing	• Route table longest prefix matching algorithm • Why “public” vs “private” is just routing, not configuration • How packet flow works through route tables step-by-step • NAT Gateway vs NAT Instance tradeoffs • Implicit router at 10.0.0.0/16 + 1 (e.g., 10.0.0.1 in 10.0.0.0/16 VPC) • Why you lose 5 IPs per subnet (AWS reserved addresses)
Security Layers	• Stateful vs stateless packet filtering mechanics • Connection tracking in Security Groups (how it actually works) • Why NACLs need ephemeral port ranges (1024-65535) • Defense in depth: when to use SG, NACL, WAF, Network Firewall • Rule evaluation order (NACL) vs all-rules evaluation (SG) • How to debug “connection refused” vs “timeout” (SG vs NACL/routing)
VPC Connectivity	• Peering non-transitivity and why it exists • Transit Gateway routing and route table propagation • When to use Peering vs TGW vs PrivateLink • Cross-region peering limitations • How to avoid IP CIDR overlaps across VPCs • Shared VPC vs multi-VPC strategies
Hybrid Networking	• BGP (Border Gateway Protocol) basics for Direct Connect • IPSec tunnel establishment and packet encapsulation • How AWS uses VLAN tagging for Direct Connect virtual interfaces • VPN vs DX cost structure (port hours vs data transfer) • Failover scenarios and routing priority • MACsec encryption for Direct Connect
DNS & Service Discovery	• Route 53 Resolver and VPC DNS (.2 address in VPC) • Private Hosted Zones for internal DNS • How enableDnsHostnames and enableDnsSupport work • DNS resolution over VPC peering/TGW • Route 53 Resolver endpoints for hybrid DNS
Advanced Patterns	• VPC Endpoints (Gateway vs Interface) • PrivateLink for service exposure without peering • IPv6 dual-stack VPCs • VPC Flow Logs for network monitoring • Network Firewall for perimeter security • Transit Gateway Network Manager for multi-region

Extended Reading Notes

This section expands on the concepts above with longer-form readings and context. Use it when you want deeper background beyond the core project path.

1. VPC & Subnets → Network Fundamentals

Computer Networks, Fifth Edition by Tanenbaum

Chapter 1: Sections on network layering, internetworking concepts
Chapter 5: The Network Layer - IP addressing, subnetting, CIDR notation
- Section 5.6: IP Addresses (crucial for understanding VPC CIDR blocks)
- Section 5.6.3: CIDR and route aggregation
- Section 5.6.4: NAT (directly applicable to AWS NAT Gateways)

Why: AWS VPCs implement standard IP networking. Understanding CIDR, subnetting math, and network masks is essential for proper VPC design.

Key Takeaway: When you create a VPC with 10.0.0.0/16, you’re using CIDR notation that comes from the need to efficiently aggregate routes on the Internet. AWS didn’t invent this—they’re using established networking standards.

2. Route Tables & Packet Forwarding → Routing Algorithms

Computer Networks, Fifth Edition by Tanenbaum

Chapter 5: The Network Layer
- Section 5.2: Routing algorithms (how routers make forwarding decisions)
- Section 5.3: Hierarchical routing
- Section 5.4: Broadcast and multicast routing

Why: AWS route tables use longest prefix matching—a fundamental routing concept. Understanding how routers make forwarding decisions helps you design efficient route tables.

Key Takeaway: The implicit router in every VPC (at the +1 address) uses the same forwarding logic as any Internet router. Route table lookups aren’t AWS magic—they’re standard routing algorithms.

3. Security Groups & NACLs → Firewalls & Packet Filtering

Computer Networks, Fifth Edition by Tanenbaum

Chapter 8: Network Security
- Section 8.9: Firewalls (packet filtering, stateful inspection)

TCP/IP Illustrated, Volume 1 by Stevens

Chapter 13: TCP Connection Management
- Understanding the three-way handshake helps you understand why stateful firewalls work
- Section on connection tracking and state tables

Why: Security Groups are stateful packet filters—they track TCP connections. NACLs are stateless. Understanding the TCP state machine explains why Security Groups can auto-allow return traffic.

Key Takeaway: When you allow inbound port 80 in a Security Group, AWS maintains a connection tracking table (like conntrack in Linux iptables). This is why return traffic on ephemeral ports is automatically allowed.

4. NAT Gateway → Network Address Translation

Computer Networks, Fifth Edition by Tanenbaum

Chapter 5: The Network Layer
- Section 5.6.4: Network Address Translation (NAT)
- How NAT maintains translation tables
- Port Address Translation (PAT/NAPT)

Why: AWS NAT Gateways use PAT to allow many private IPs to share one public IP. Understanding the NAT translation table helps you debug connectivity issues.

Key Takeaway: NAT is a workaround for IPv4 address exhaustion. AWS NAT Gateways translate thousands of private IPs to a single Elastic IP using port mapping—not AWS-specific, but an industry-standard technique.

5. VPN & IPSec → Cryptographic Tunnels

Computer Networks, Fifth Edition by Tanenbaum

Chapter 8: Network Security
- Section 8.7: IPSec (the protocol AWS VPN uses)
- Section 8.8: VPNs
- Tunnel mode vs transport mode

Why: AWS Site-to-Site VPN uses IPSec. Understanding how IPSec encrypts and encapsulates packets helps you troubleshoot VPN connectivity and understand performance characteristics.

Key Takeaway: IPSec adds overhead—encryption/decryption CPU cost and packet size increase. This is why VPN throughput is limited compared to Direct Connect’s raw physical connection.

6. BGP & Direct Connect → Routing Protocols

Computer Networks, Fifth Edition by Tanenbaum

Chapter 5: The Network Layer
- Section 5.3.4: Routing in the Internet
- Border Gateway Protocol (BGP) basics

Why: AWS Direct Connect uses BGP to exchange routes between your network and AWS. Understanding BGP path selection helps you control traffic flow and implement proper failover.

Key Takeaway: BGP is how the entire Internet exchanges routing information. AWS Direct Connect uses the same protocol, so learning BGP fundamentals gives you control over how traffic routes between your data center and AWS.

7. DNS in VPC → Domain Name System

Computer Networks, Fifth Edition by Tanenbaum

Chapter 7: The Application Layer
- Section 7.1: DNS
- Recursive vs iterative queries
- DNS caching

Why: Every VPC has a DNS resolver at the +2 address (e.g., 10.0.0.2 in a 10.0.0.0/16 VPC). Route 53 Resolver endpoints allow DNS queries across hybrid networks.

Key Takeaway: AWS Route 53 Resolver is just a DNS recursive resolver—same concept as running your own DNS server, but managed by AWS.

8. VPC Flow Logs → Network Monitoring

The Practice of Network Security Monitoring by Bejtlich

Chapter 2: Network Traffic Collection
Chapter 3: Network Traffic Analysis
- Flow data vs packet capture
- NetFlow and similar technologies

Why: VPC Flow Logs capture metadata about traffic (source, destination, ports, protocol, action). Understanding flow data helps you monitor and troubleshoot network security.

Key Takeaway: VPC Flow Logs are AWS’s implementation of NetFlow/IPFIX—industry-standard network monitoring. They don’t capture packet payloads, only metadata, which is why they’re efficient.

9. TCP/IP Deep Dive → Protocol Understanding

TCP/IP Illustrated, Volume 1 by Stevens

Chapter 2: The Internet Protocol
Chapter 13: TCP Connection Establishment and Termination
Chapter 17: TCP Interactive Data Flow
Chapter 20: TCP Bulk Data Flow

Why: Debugging AWS network issues requires understanding TCP behavior—SYN floods, connection timeouts, TCP window scaling, MTU issues.

Key Takeaway: AWS networking doesn’t change how TCP works. If you understand TCP’s three-way handshake, you’ll understand why a Security Group blocking port 80 results in a timeout (SYN never gets ACK) vs a connection refused.

10. Linux Network Stack → Practical Implementation

The Linux Programming Interface by Kerrisk

Chapter 59: Sockets: Internet Domains
Chapter 61: Sockets: Advanced Topics

Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron

Chapter 11: Network Programming
- Section 11.3: The Sockets Interface
- Section 11.4: Client-Server model

Why: EC2 instances run Linux (or Windows, but Linux dominates). Understanding the socket API, network namespaces, and iptables helps you debug instance-level networking.

Key Takeaway: EC2 networking is just Linux networking with AWS-managed interfaces (ENIs). If you know how Linux network namespaces work, you’ll understand how containers on ECS/EKS get network isolation.

11. TLS/SSL & Certificates → Secure Communications

Computer Networks, Fifth Edition by Tanenbaum

Chapter 8: Network Security
- Section 8.6: Communication Security (TLS/SSL)

High Performance Browser Networking by Ilya Grigorik

Chapter 4: Transport Layer Security (TLS)
- TLS handshake
- Certificate validation
- Performance implications

Why: AWS Certificate Manager (ACM), Application Load Balancer TLS termination, and VPN encryption all use TLS/SSL. Understanding certificate chains and handshakes helps you troubleshoot HTTPS issues.

Key Takeaway: TLS adds latency (handshake) and CPU cost (encryption). Understanding this helps you make decisions about where to terminate TLS (ALB vs EC2) and when to use HTTP/2.

12. Performance & Latency → Network Optimization

High Performance Browser Networking by Ilya Grigorik

Chapter 1: Primer on Latency and Bandwidth
Chapter 2: Building Blocks of TCP
- TCP’s impact on application performance
- Slow start, congestion control

Why: AWS networking has latency characteristics—cross-AZ, cross-region, to Internet. Understanding TCP’s behavior helps you optimize application performance.

Key Takeaway: Cross-AZ traffic has ~1-2ms latency. Cross-region has 10-100+ms. TCP slow start means the first few KB of a connection are slower. This knowledge helps you design distributed systems on AWS.

How AWS Networking Actually Works: A Mental Model

Before you touch Terraform or the AWS console, you need a mental model of what’s actually happening when you create a VPC. This isn’t about memorizing console clicks—it’s about understanding the underlying mechanisms so deeply that you can debug any networking problem.

The Physical Reality Behind the Abstraction

When you create a VPC, you’re not actually creating a physical network. AWS runs a software-defined network (SDN) on top of its physical infrastructure. Here’s what’s actually happening:

┌─────────────────────────────────────────────────────────────────────────┐
│                        AWS PHYSICAL INFRASTRUCTURE                       │
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │
│  │  Physical   │    │  Physical   │    │  Physical   │                 │
│  │  Server A   │    │  Server B   │    │  Server C   │   ... thousands │
│  │  (Host)     │    │  (Host)     │    │  (Host)     │       more      │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘                 │
│         │                  │                  │                         │
│         └──────────────────┴──────────────────┘                         │
│                            │                                            │
│              ┌─────────────┴─────────────┐                              │
│              │    AWS Physical Network    │                              │
│              │    (High-speed backbone)   │                              │
│              └─────────────┬─────────────┘                              │
└─────────────────────────────┼───────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    AWS SOFTWARE-DEFINED NETWORK LAYER                    │
│                                                                         │
│   Your VPC (10.0.0.0/16) is a LOGICAL construct overlaid on physical    │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                         YOUR VPC                                 │  │
│   │                      (10.0.0.0/16)                               │  │
│   │                                                                  │  │
│   │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │  │
│   │  │Public Subnet │  │Private Subnet│  │  DB Subnet   │           │  │
│   │  │ 10.0.1.0/24  │  │ 10.0.10.0/24 │  │ 10.0.20.0/24 │           │  │
│   │  │              │  │              │  │              │           │  │
│   │  │  ┌───────┐   │  │  ┌───────┐   │  │  ┌───────┐   │           │  │
│   │  │  │ EC2-1 │   │  │  │ EC2-2 │   │  │  │  RDS  │   │           │  │
│   │  │  │10.0.1.│   │  │  │10.0.10│   │  │  │10.0.20│   │           │  │
│   │  │  │  50   │   │  │  │  50   │   │  │  │  50   │   │           │  │
│   │  │  └───────┘   │  │  └───────┘   │  │  └───────┘   │           │  │
│   │  └──────────────┘  └──────────────┘  └──────────────┘           │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Even though EC2-1 and EC2-2 might be on DIFFERENT physical servers,  │
│   they see each other as if on the same logical 10.0.0.0/16 network    │
└─────────────────────────────────────────────────────────────────────────┘

The Key Insight: Your VPC isn’t a physical network—it’s a set of rules that AWS’s SDN enforces. When EC2-1 sends a packet to EC2-2, the packet might traverse multiple physical switches and routers, but the SDN layer makes it appear as if they’re on the same local network.

How a Packet Flows Through Your VPC

Let’s trace a packet from an EC2 instance in a private subnet trying to reach the internet. This is the single most important thing to understand:

Step-by-Step: EC2 in Private Subnet → Internet

┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 1: EC2 Instance Generates Packet                                    │
│                                                                         │
│   EC2 (10.0.10.50) wants to reach api.github.com (140.82.121.4)         │
│                                                                         │
│   Packet created:                                                        │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src IP: 10.0.10.50 │ Dst IP: 140.82.121.4 │ Dst Port: 443    │     │
│   └───────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 2: Security Group Check (Outbound)                                  │
│                                                                         │
│   EC2's Security Group is checked for OUTBOUND rules                     │
│                                                                         │
│   sg-app-tier rules:                                                     │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Type     │ Protocol │ Port   │ Destination   │ Action          │    │
│   ├──────────┼──────────┼────────┼───────────────┼─────────────────┤    │
│   │ All      │ All      │ All    │ 0.0.0.0/0     │ ALLOW           │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   ✓ Outbound to 140.82.121.4:443 is ALLOWED                             │
│   (Security Groups are stateful - return traffic auto-allowed)          │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 3: Route Table Lookup                                               │
│                                                                         │
│   Private subnet's route table is consulted:                             │
│                                                                         │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Destination    │ Target                                        │    │
│   ├────────────────┼──────────────────────────────────────────────┤    │
│   │ 10.0.0.0/16    │ local (within VPC)                           │    │
│   │ 0.0.0.0/0      │ nat-gateway-id (NAT Gateway)                 │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Destination 140.82.121.4 matches 0.0.0.0/0 → Send to NAT Gateway      │
│   (Longest prefix match: /0 is the catch-all for anything not local)    │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 4: NACL Check (Outbound from Private Subnet)                        │
│                                                                         │
│   Network ACL for private subnet is checked:                             │
│                                                                         │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Rule # │ Type  │ Protocol │ Port    │ Dest      │ Allow/Deny  │    │
│   ├────────┼───────┼──────────┼─────────┼───────────┼─────────────┤    │
│   │ 100    │ All   │ All      │ All     │ 0.0.0.0/0 │ ALLOW       │    │
│   │ *      │ All   │ All      │ All     │ 0.0.0.0/0 │ DENY        │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   ✓ Rule 100 matches → ALLOW                                            │
│   (NACLs are stateless - must also allow return traffic separately!)    │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 5: NAT Gateway Performs Translation                                 │
│                                                                         │
│   NAT Gateway in PUBLIC subnet receives packet and:                      │
│   1. Replaces source IP with its Elastic IP                              │
│   2. Tracks the connection in its translation table                      │
│                                                                         │
│   Original packet:                                                       │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src: 10.0.10.50:49152 │ Dst: 140.82.121.4:443                │     │
│   └───────────────────────────────────────────────────────────────┘     │
│                              ↓                                          │
│   Translated packet:                                                     │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src: 54.123.45.67:32001 │ Dst: 140.82.121.4:443              │     │
│   └───────────────────────────────────────────────────────────────┘     │
│                                                                         │
│   Translation table entry:                                               │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Internal: 10.0.10.50:49152 ↔ External: 54.123.45.67:32001    │     │
│   │ Destination: 140.82.121.4:443                                 │     │
│   └───────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 6: Route to Internet Gateway                                        │
│                                                                         │
│   Public subnet's route table:                                           │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Destination    │ Target                                        │    │
│   ├────────────────┼──────────────────────────────────────────────┤    │
│   │ 10.0.0.0/16    │ local                                        │    │
│   │ 0.0.0.0/0      │ igw-id (Internet Gateway)                    │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Packet sent to Internet Gateway → AWS backbone → Internet             │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 7: Response Returns (Reverse Path)                                  │
│                                                                         │
│   api.github.com responds:                                               │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src: 140.82.121.4:443 │ Dst: 54.123.45.67:32001              │     │
│   └───────────────────────────────────────────────────────────────┘     │
│                                                                         │
│   1. IGW receives, routes to NAT Gateway (it's the destination)         │
│   2. NAT Gateway looks up translation table, reverse-translates:        │
│      ┌───────────────────────────────────────────────────────────┐      │
│      │ Src: 140.82.121.4:443 │ Dst: 10.0.10.50:49152             │      │
│      └───────────────────────────────────────────────────────────┘      │
│   3. Route table sends to private subnet (10.0.10.0/24 is local)        │
│   4. NACL inbound check (must ALLOW ephemeral ports 1024-65535!)        │
│   5. Security Group: return traffic auto-allowed (stateful!)            │
│   6. Packet delivered to EC2                                            │
└─────────────────────────────────────────────────────────────────────────┘

The Critical Insight: Notice how Security Groups and NACLs behave differently on the return path:

Security Group: Automatically allows return traffic (stateful)
NACL: Must explicitly allow inbound traffic on ephemeral ports (stateless)

This is why misconfigured NACLs cause “I can send but not receive” problems.

Security Groups vs NACLs: The Deep Dive

This is the most misunderstood part of AWS networking. Let’s break it down completely.

Connection Tracking: How Security Groups Work

Security Groups are stateful firewalls. This means they track connections. Here’s what happens under the hood:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HOW SECURITY GROUP CONNECTION TRACKING WORKS          │
│                                                                         │
│   When you allow inbound port 443, AWS maintains a connection table:    │
│                                                                         │
│   CLIENT (203.0.113.50:52341) ──── SYN ────► EC2 (10.0.1.50:443)        │
│                                                                         │
│   Security Group sees: "New inbound connection to allowed port 443"     │
│                                                                         │
│   Connection Table Entry Created:                                        │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ Connection ID: 12345                                             │   │
│   │ Protocol: TCP                                                    │   │
│   │ Local: 10.0.1.50:443                                             │   │
│   │ Remote: 203.0.113.50:52341                                       │   │
│   │ State: ESTABLISHED                                               │   │
│   │ Direction: INBOUND (originally)                                  │   │
│   │ Created: 2024-12-22 14:32:01                                     │   │
│   │ Last Activity: 2024-12-22 14:32:05                               │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   EC2 (10.0.1.50:443) ──── SYN-ACK ────► CLIENT (203.0.113.50:52341)   │
│                                                                         │
│   Security Group sees: "Outbound to 203.0.113.50:52341"                 │
│   Checks connection table: "This is return traffic for connection 12345"│
│   Result: ALLOWED (no outbound rule needed!)                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

NACLs: Stateless Packet Filtering

NACLs don’t track connections. Each packet is evaluated independently:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HOW NACLs EVALUATE PACKETS (STATELESS)                │
│                                                                         │
│   CLIENT (203.0.113.50:52341) ──── SYN ────► EC2 (10.0.1.50:443)        │
│                                                                         │
│   NACL Inbound Evaluation:                                               │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ Packet: Src=203.0.113.50:52341, Dst=10.0.1.50:443, TCP          │   │
│   │                                                                  │   │
│   │ Rule 100: Allow TCP 443 from 0.0.0.0/0 → MATCH! → ALLOW         │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   EC2 (10.0.1.50:443) ──── SYN-ACK ────► CLIENT (203.0.113.50:52341)   │
│                                                                         │
│   NACL Outbound Evaluation:                                              │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ Packet: Src=10.0.1.50:443, Dst=203.0.113.50:52341, TCP          │   │
│   │                                                                  │   │
│   │ Rule 100: Allow TCP 443 from 0.0.0.0/0                          │   │
│   │          Port 443? NO! Destination port is 52341                │   │
│   │          → NO MATCH                                              │   │
│   │                                                                  │   │
│   │ You need: Allow TCP 1024-65535 to 0.0.0.0/0 (ephemeral ports)   │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   ⚠️  WITHOUT EPHEMERAL PORT RULE, RETURN TRAFFIC IS BLOCKED!           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Complete Security Groups vs NACLs Comparison

Aspect	Security Groups	NACLs
State	Stateful (tracks connections)	Stateless (each packet independent)
Scope	Instance/ENI level	Subnet level
Rules	Allow only	Allow AND Deny
Rule Order	All rules evaluated	Evaluated in numbered order
Return Traffic	Automatically allowed	Must be explicitly allowed
Default	Deny all inbound, allow all outbound	Allow all (default NACL)
Use Case	Primary security layer	Block specific IPs, defense-in-depth

When a Connection Fails: Timeout vs Connection Refused

Understanding error messages helps you debug:

┌─────────────────────────────────────────────────────────────────────────┐
│                    DIAGNOSING NETWORK FAILURES                           │
│                                                                         │
│  SCENARIO 1: Connection Timeout                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│  $ curl https://10.0.2.50:443                                           │
│  curl: (28) Connection timed out after 30001 milliseconds               │
│                                                                         │
│  What happened:                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ CLIENT ──── SYN ────► [DROPPED SILENTLY] ──✗                    │    │
│  │                                                                  │    │
│  │ Packet never reached destination or response never came back    │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Causes:                                                                 │
│  • Security Group blocking (drops packet silently)                      │
│  • NACL blocking (drops packet silently)                                │
│  • Route table misconfiguration (packet sent to wrong place)            │
│  • NAT Gateway issue (no route to internet)                             │
│                                                                         │
│  ─────────────────────────────────────────────────────────────────────  │
│  SCENARIO 2: Connection Refused                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│  $ curl https://10.0.2.50:443                                           │
│  curl: (7) Failed to connect: Connection refused                        │
│                                                                         │
│  What happened:                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ CLIENT ──── SYN ────► EC2 ──── RST ────► CLIENT                 │    │
│  │                                                                  │    │
│  │ Packet reached destination, but nothing listening on that port  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Causes:                                                                 │
│  • Service not running on target port                                   │
│  • Service bound to localhost only (127.0.0.1)                          │
│  • iptables on EC2 blocking (OS-level firewall)                         │
│                                                                         │
│  KEY INSIGHT: Connection refused means network path is WORKING!         │
│  The problem is at the application level, not network level.            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The VPC Address Space: Reserved IPs and CIDR Math

Every subnet loses 5 IP addresses to AWS. Understanding why helps you plan capacity:

┌─────────────────────────────────────────────────────────────────────────┐
│                    SUBNET: 10.0.1.0/24 (256 IPs theoretically)           │
│                                                                         │
│   Reserved by AWS:                                                       │
│   ┌───────────────┬───────────────────────────────────────────────────┐ │
│   │ 10.0.1.0      │ Network address (standard networking)              │ │
│   │ 10.0.1.1      │ VPC Router (implicit router for the subnet)       │ │
│   │ 10.0.1.2      │ DNS Server (Amazon-provided DNS)                  │ │
│   │ 10.0.1.3      │ Reserved for future use                           │ │
│   │ 10.0.1.255    │ Broadcast address (VPC doesn't support broadcast) │ │
│   └───────────────┴───────────────────────────────────────────────────┘ │
│                                                                         │
│   Usable IPs: 10.0.1.4 through 10.0.1.254 = 251 IPs                     │
│                                                                         │
│   ⚠️  CRITICAL: The VPC router (10.0.1.1) is how traffic leaves the     │
│   subnet. All outbound traffic goes here first, then route table        │
│   determines next hop.                                                   │
│                                                                         │
│   ⚠️  CRITICAL: The DNS server (10.0.1.2) is why enableDnsSupport and   │
│   enableDnsHostnames matter. Without these, EC2 instances can't         │
│   resolve DNS names.                                                     │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

CIDR Block Planning: Common Mistakes

┌─────────────────────────────────────────────────────────────────────────┐
│                    CIDR PLANNING: AVOID THESE MISTAKES                   │
│                                                                         │
│   MISTAKE 1: VPC CIDR too small                                          │
│   ─────────────────────────────────────────────────────────────────────  │
│   Created: 10.0.0.0/24 (256 IPs)                                         │
│   Problem: Only room for ~4 small subnets                                │
│   Can't expand VPC CIDR after creation!                                  │
│                                                                         │
│   Better: 10.0.0.0/16 (65,536 IPs) - room to grow                       │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   MISTAKE 2: Overlapping CIDRs across VPCs                               │
│   ─────────────────────────────────────────────────────────────────────  │
│   VPC-A: 10.0.0.0/16                                                     │
│   VPC-B: 10.0.0.0/16  ← SAME CIDR!                                       │
│                                                                         │
│   Problem: Can NEVER peer these VPCs or connect via Transit Gateway     │
│   Routing would be ambiguous: is 10.0.1.50 in VPC-A or VPC-B?           │
│                                                                         │
│   Better: Plan non-overlapping ranges:                                   │
│   VPC-A: 10.0.0.0/16                                                     │
│   VPC-B: 10.1.0.0/16                                                     │
│   VPC-C: 10.2.0.0/16                                                     │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   MISTAKE 3: Using 172.17.0.0/16 with Docker                             │
│   ─────────────────────────────────────────────────────────────────────  │
│   Docker's default bridge network: 172.17.0.0/16                         │
│                                                                         │
│   If your VPC is 172.17.0.0/16, containers can't reach VPC resources!   │
│   Route goes to Docker bridge instead of VPC router.                     │
│                                                                         │
│   Better: Avoid 172.17.x.x for VPCs if using Docker                     │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

VPC Flow Logs: Seeing What’s Actually Happening

VPC Flow Logs capture metadata about every network flow. They’re your primary debugging and security monitoring tool.

Flow Log Record Format

┌─────────────────────────────────────────────────────────────────────────┐
│                    VPC FLOW LOG RECORD (Version 2)                       │
│                                                                         │
│   Raw log entry:                                                         │
│   2 123456789012 eni-abc123 10.0.1.50 10.0.2.100 49152 443 6 25 1234    │
│   1639489200 1639489260 ACCEPT OK                                        │
│                                                                         │
│   Parsed:                                                                │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │ version        │ 2                                                │ │
│   │ account-id     │ 123456789012                                     │ │
│   │ interface-id   │ eni-abc123 (which ENI saw this traffic)         │ │
│   │ srcaddr        │ 10.0.1.50 (source IP)                           │ │
│   │ dstaddr        │ 10.0.2.100 (destination IP)                     │ │
│   │ srcport        │ 49152 (source port - ephemeral)                 │ │
│   │ dstport        │ 443 (destination port - HTTPS)                  │ │
│   │ protocol       │ 6 (TCP - see IANA protocol numbers)            │ │
│   │ packets        │ 25                                               │ │
│   │ bytes          │ 1234                                             │ │
│   │ start          │ 1639489200 (Unix timestamp)                     │ │
│   │ end            │ 1639489260                                       │ │
│   │ action         │ ACCEPT (or REJECT)                              │ │
│   │ log-status     │ OK                                               │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│   KEY INSIGHT: action=REJECT means Security Group OR NACL blocked it   │
│   Flow logs don't tell you WHICH one - you must investigate both       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Using Flow Logs to Detect Attacks

┌─────────────────────────────────────────────────────────────────────────┐
│                    SECURITY PATTERNS IN FLOW LOGS                        │
│                                                                         │
│   PATTERN 1: Port Scan Detection                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   Same source IP hitting many destination ports in short time:          │
│                                                                         │
│   203.0.113.50 → 10.0.1.100:22   REJECT                                 │
│   203.0.113.50 → 10.0.1.100:23   REJECT                                 │
│   203.0.113.50 → 10.0.1.100:80   REJECT                                 │
│   203.0.113.50 → 10.0.1.100:443  REJECT                                 │
│   203.0.113.50 → 10.0.1.100:3306 REJECT                                 │
│   203.0.113.50 → 10.0.1.100:5432 REJECT                                 │
│                                                                         │
│   Query: Group by srcaddr, count distinct dstport, filter > 10 ports    │
│   Action: Add 203.0.113.50 to NACL deny list                            │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   PATTERN 2: Data Exfiltration                                           │
│   ─────────────────────────────────────────────────────────────────────  │
│   Unusual outbound data volume to external IP:                          │
│                                                                         │
│   10.0.2.50 → 185.143.223.100:443 ACCEPT bytes=2,147,483,648            │
│   (2 GB to unknown external IP - possible data theft!)                  │
│                                                                         │
│   Query: Group by srcaddr+dstaddr, sum bytes, filter external + large   │
│   Action: Investigate instance, check against threat intelligence       │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   PATTERN 3: SSH Brute Force                                             │
│   ─────────────────────────────────────────────────────────────────────  │
│   Many rejected connections to port 22 from same source:                │
│                                                                         │
│   185.234.x.x → 10.0.1.100:22 REJECT (1000+ times in 1 hour)            │
│                                                                         │
│   Query: Filter dstport=22 AND action=REJECT, group by srcaddr          │
│   Action: Block at NACL, consider IP reputation service                 │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Quick Start Guide: Your First 48 Hours

Feeling overwhelmed by the amount of information? Start here. This is your “getting started” roadmap for the first weekend.

Day 1 (Saturday): Foundation Understanding

Morning (3 hours):

Read the “Why This Matters” section (30 minutes)
- Understand the real-world impact of AWS networking
- Review the statistics and breach examples
- Study the “Traditional vs VPC” ASCII diagram
Study “How AWS Networking Actually Works” (1 hour)
- Focus on the SDN mental model
- Trace the packet flow diagram step-by-step
- Draw it yourself on paper
Review Security Groups vs NACLs (1 hour)
- Understand the statefulness concept
- Study the comparison table
- Read the connection tracking explanation

Afternoon (3 hours):

Set up your environment (1 hour)
- Install AWS CLI and configure credentials
- Install Terraform
- Create a dedicated AWS account or use an existing one with budget alerts
Start Project 1: VPC from Scratch (2 hours)
- Read the entire project description
- Study the CIDR planning diagram
- Create your first Terraform file with just the VPC resource
- Deploy it: terraform init && terraform apply
- Verify in AWS Console

Evening (1 hour):

Read prerequisite chapters (if needed)
- Review CIDR notation if still unclear
- Read about subnet masks and IP addressing

Day 2 (Sunday): Build Your First VPC

Morning (4 hours):

Complete Project 1 (3 hours)
- Add subnets across two AZs
- Configure Internet Gateway
- Set up route tables
- Deploy and test connectivity
Verification (1 hour)
- Launch a test EC2 in the public subnet
- Verify you can SSH to it
- Verify it can reach the internet
- Review VPC Flow Logs (even if brief)

Afternoon (3 hours):

Study what you built (1 hour)
- Use AWS Console to visualize your VPC
- Run terraform state list to see all resources
- Understand each resource’s purpose
Experiment and break things (2 hours)
- Delete a route and observe what breaks
- Remove the IGW and see what happens
- Misconfigure a Security Group intentionally
- Document what you learned from each failure

Evening (1 hour):

Clean up and reflect
- Run terraform destroy to avoid costs
- Write down 3 key insights you gained
- Identify 2 concepts you’re still unclear about

Week 1 Next Steps

After the first weekend, continue with:

Monday-Wednesday: Read the Deep Dive Reading chapters for VPC concepts
Thursday-Friday: Start Project 2 (Security Group Debugger)
Weekend 2: Complete Project 2 and start Project 3 (Flow Logs)

Quick Start Checklist

Use this to track your first 48 hours:

If You Get Stuck

Problem: “I don’t understand CIDR notation” → Solution: Read “Computer Networks” Ch. 4 by Tanenbaum or watch a 10-minute YouTube video on subnetting

Problem: “Terraform keeps failing” → Solution: Run terraform plan first, read error messages carefully, check AWS region consistency

Problem: “Can’t SSH to my EC2” → Solution: Check: (1) Security Group allows port 22 from your IP, (2) Instance has public IP, (3) Route table has IGW route

Problem: “This is taking too long” → Solution: That’s normal! Networking is complex. Focus on understanding one concept at a time.

What Success Looks Like After 48 Hours

You should be able to:

✅ Explain in your own words what a VPC is
✅ Draw a simple VPC architecture diagram on paper
✅ Deploy a basic VPC using Terraform
✅ Launch an EC2 and verify connectivity
✅ Understand why Security Groups are stateful
✅ Know the difference between public and private subnets

You do NOT need to:

❌ Memorize all AWS networking services
❌ Understand Transit Gateway or Direct Connect yet
❌ Be able to design multi-region architectures
❌ Know every Terraform resource parameter

Remember: You’re building foundational understanding. Speed comes with practice.

Concept Summary Table

Concept Cluster	What You Need to Internalize
VPCs and Subnets	CIDR planning, AZ boundaries, and how route tables define reachability.
Routing and Gateways	IGW, NAT, TGW, VGW are traffic exits and transit points with explicit routes.
Security Boundaries	Stateful Security Groups vs stateless NACLs and how they combine.
Hybrid Connectivity	IPSec VPN vs Direct Connect and BGP route propagation.
Multi-VPC Topologies	Peering for small graphs, Transit Gateway for hubs, PrivateLink for service exposure.
Observability	Flow Logs and reachability analysis explain why traffic is allowed or dropped.

Deep Dive Reading by Concept

This section maps each concept to specific book chapters. Read these before or alongside the projects.

Concept	Book & Chapter
VPC Fundamentals	AWS Certified Advanced Networking - Specialty Official Study Guide by Scott Piper and Brad Topol - Ch. 2: “VPC Fundamentals”
Routing and Gateways	AWS Certified Advanced Networking - Specialty Official Study Guide - Ch. 3: “Routing and Connectivity”
Security Groups and NACLs	AWS Certified Advanced Networking - Specialty Official Study Guide - Ch. 4: “Network Security”
BGP for Hybrid Connectivity	Routing TCP/IP, Volume I by Jeff Doyle and Jennifer Carroll - Ch. 8: “BGP”
NAT and Addressing	Computer Networking: A Top-Down Approach by Kurose and Ross - Ch. 4: “Network Layer”

Project 1: Build a Production-Ready VPC from Scratch

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK (TypeScript), CloudFormation (YAML), Pulumi
Coolness Level: Level 1: Pure Corporate Snoozefest
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Cloud Infrastructure / Networking
Software or Tool: AWS VPC, Terraform
Main Book: “AWS Certified Solutions Architect Study Guide” or AWS Documentation

What you’ll build: A fully functional VPC with public and private subnets across multiple Availability Zones, Internet Gateway, NAT Gateway, proper route tables, and Security Groups—deployable with a single command.

Why it teaches AWS networking: This is the foundation of EVERYTHING in AWS. By building it from scratch with infrastructure-as-code, you’ll understand every component and how they fit together. You can’t just click through the console—you must explicitly define every relationship.

Core challenges you’ll face:

CIDR block planning (avoiding overlaps, sizing for growth) → maps to IP address management
Multi-AZ subnet design (high availability, redundancy) → maps to fault tolerance
Route table associations (which subnet goes where) → maps to traffic flow control
NAT Gateway placement (public subnet, Elastic IP) → maps to outbound internet access
Security Group design (least privilege, references between groups) → maps to micro-segmentation

Key Concepts:

VPC Fundamentals: AWS VPC User Guide
CIDR Notation: “Computer Networks, Fifth Edition” Chapter 4 - Tanenbaum
Terraform Basics: HashiCorp Terraform AWS Provider Documentation
Multi-AZ Design: AWS VPC Design Best Practices

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic AWS console familiarity, CLI setup

Real world outcome:

$ terraform init && terraform apply

Plan: 23 resources to add

aws_vpc.main: Creating...
aws_vpc.main: Created (vpc-0abc123def456)
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
aws_internet_gateway.main: Creating...
aws_nat_gateway.main: Creating...
aws_route_table.public: Creating...
aws_route_table.private: Creating...
...

Apply complete! Resources: 23 added.

Outputs:
vpc_id = "vpc-0abc123def456"
public_subnet_ids = ["subnet-pub-a", "subnet-pub-b"]
private_subnet_ids = ["subnet-priv-a", "subnet-priv-b"]
nat_gateway_ip = "54.123.45.67"

# Verify connectivity
$ aws ec2 run-instances --subnet-id subnet-priv-a --image-id ami-xxx
$ ssh -J bastion@54.x.x.x ec2-user@10.0.2.15
[ec2-user@ip-10-0-2-15 ~]$ curl ifconfig.me
54.123.45.67  # Traffic exits via NAT Gateway!

Implementation Hints: The VPC structure should look like:

VPC: 10.0.0.0/16 (65,536 IPs)
├── Public Subnet A: 10.0.1.0/24 (AZ-a) - 256 IPs
├── Public Subnet B: 10.0.2.0/24 (AZ-b) - 256 IPs
├── Private Subnet A: 10.0.10.0/24 (AZ-a) - 256 IPs
├── Private Subnet B: 10.0.11.0/24 (AZ-b) - 256 IPs
├── Database Subnet A: 10.0.20.0/24 (AZ-a) - 256 IPs
└── Database Subnet B: 10.0.21.0/24 (AZ-b) - 256 IPs

Key Terraform resources to create:

# Pseudo-Terraform structure
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_subnet" "public" {
  for_each = { "a" = "10.0.1.0/24", "b" = "10.0.2.0/24" }
  vpc_id                  = aws_vpc.main.id
  cidr_block              = each.value
  availability_zone       = "us-east-1${each.key}"
  map_public_ip_on_launch = true
}

resource "aws_nat_gateway" "main" {
  subnet_id     = aws_subnet.public["a"].id
  allocation_id = aws_eip.nat.id
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }
}

Learning milestones:

VPC deploys with all subnets → You understand VPC structure
EC2 in public subnet is reachable → You understand IGW and public routing
EC2 in private subnet can reach internet → You understand NAT Gateway flow
Resources in private subnet are NOT directly reachable → You understand isolation

Real World Outcome

When you complete this project, you will have a fully deployable VPC infrastructure that you can use for any application. Here’s exactly what you’ll see:

# Step 1: Initialize and apply your Terraform configuration
$ cd vpc-project && terraform init

Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.0"...
- Installing hashicorp/aws v5.31.0...

Terraform has been successfully initialized!

$ terraform plan

Terraform will perform the following actions:

  # aws_eip.nat will be created
  + resource "aws_eip" "nat" {
      + allocation_id        = (known after apply)
      + domain               = "vpc"
      + public_ip            = (known after apply)
    }

  # aws_internet_gateway.main will be created
  + resource "aws_internet_gateway" "main" {
      + id      = (known after apply)
      + vpc_id  = (known after apply)
    }

  # aws_nat_gateway.main will be created
  + resource "aws_nat_gateway" "main" {
      + allocation_id        = (known after apply)
      + connectivity_type    = "public"
      + public_ip            = (known after apply)
      + subnet_id            = (known after apply)
    }

  # aws_vpc.main will be created
  + resource "aws_vpc" "main" {
      + cidr_block           = "10.0.0.0/16"
      + enable_dns_hostnames = true
      + enable_dns_support   = true
      + id                   = (known after apply)
    }

  ... (23 resources total)

Plan: 23 to add, 0 to change, 0 to destroy.

$ terraform apply -auto-approve

aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public["a"]: Creating...
aws_subnet.public["b"]: Creating...
aws_subnet.private["a"]: Creating...
aws_subnet.private["b"]: Creating...
aws_internet_gateway.main: Creation complete after 1s [id=igw-0def456789abc123]
aws_eip.nat: Creating...
aws_eip.nat: Creation complete after 1s [id=eipalloc-0123456789abcdef]
aws_nat_gateway.main: Creating...
aws_nat_gateway.main: Still creating... [1m0s elapsed]
aws_nat_gateway.main: Creation complete after 1m45s [id=nat-0abcdef123456789]
aws_route_table.private: Creating...
aws_route_table.public: Creating...
...

Apply complete! Resources: 23 added, 0 to change, 0 to destroy.

Outputs:

vpc_id = "vpc-0abc123def456789"
vpc_cidr = "10.0.0.0/16"
public_subnet_ids = [
  "subnet-0pub1111111111111",
  "subnet-0pub2222222222222",
]
private_subnet_ids = [
  "subnet-0prv1111111111111",
  "subnet-0prv2222222222222",
]
database_subnet_ids = [
  "subnet-0db11111111111111",
  "subnet-0db22222222222222",
]
nat_gateway_public_ip = "54.123.45.67"
internet_gateway_id = "igw-0def456789abc123"

# Step 2: Verify the infrastructure in AWS Console or CLI
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 --no-cli-pager

{
    "Vpcs": [{
        "CidrBlock": "10.0.0.0/16",
        "VpcId": "vpc-0abc123def456789",
        "State": "available",
        "EnableDnsHostnames": true,
        "EnableDnsSupport": true
    }]
}

# Step 3: Test connectivity by launching instances
$ aws ec2 run-instances \
    --image-id ami-0c55b159cbfafe1f0 \
    --instance-type t3.micro \
    --subnet-id subnet-0prv1111111111111 \
    --security-group-ids sg-0app1111111111111 \
    --key-name my-key \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=test-private}]'

# Step 4: SSH via bastion and verify NAT Gateway
$ ssh -J ec2-user@bastion.example.com ec2-user@10.0.10.50

[ec2-user@ip-10-0-10-50 ~]$ curl -s ifconfig.me
54.123.45.67

# Your private instance's traffic exits via the NAT Gateway!
# The public IP shown is the NAT Gateway's Elastic IP, not the instance's

[ec2-user@ip-10-0-10-50 ~]$ curl -s https://api.github.com | head -5
{
  "current_user_url": "https://api.github.com/user",
  "current_user_authorizations_html_url": "https://github.com/...",
  ...
}

# Private subnet instance can reach the internet (outbound only)!

Visual Architecture You’ve Built:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              YOUR VPC (10.0.0.0/16)                          │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        AVAILABILITY ZONE A                           │   │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │   │
│  │  │  Public Subnet  │  │  Private Subnet │  │    DB Subnet    │      │   │
│  │  │  10.0.1.0/24    │  │  10.0.10.0/24   │  │  10.0.20.0/24   │      │   │
│  │  │                 │  │                 │  │                 │      │   │
│  │  │  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │      │   │
│  │  │  │NAT Gateway│  │  │  │ App Server│  │  │  │    RDS    │  │      │   │
│  │  │  │  + EIP    │  │  │  │           │  │  │  │  Primary  │  │      │   │
│  │  │  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │      │   │
│  │  │  ┌───────────┐  │  │                 │  │                 │      │   │
│  │  │  │  Bastion  │  │  │                 │  │                 │      │   │
│  │  │  │   Host    │  │  │                 │  │                 │      │   │
│  │  │  └───────────┘  │  │                 │  │                 │      │   │
│  │  └─────────────────┘  └─────────────────┘  └─────────────────┘      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        AVAILABILITY ZONE B                           │   │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │   │
│  │  │  Public Subnet  │  │  Private Subnet │  │    DB Subnet    │      │   │
│  │  │  10.0.2.0/24    │  │  10.0.11.0/24   │  │  10.0.21.0/24   │      │   │
│  │  │                 │  │                 │  │                 │      │   │
│  │  │  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │      │   │
│  │  │  │    ALB    │  │  │  │ App Server│  │  │  │    RDS    │  │      │   │
│  │  │  │  (spare)  │  │  │  │  (spare)  │  │  │  │  Standby  │  │      │   │
│  │  │  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │      │   │
│  │  └─────────────────┘  └─────────────────┘  └─────────────────┘      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌──────────────────────┐                                                  │
│  │   Internet Gateway   │◄───── 0.0.0.0/0 route from public subnets        │
│  └──────────────────────┘                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                │
                ▼
          ┌──────────┐
          │ Internet │
          └──────────┘

The Core Question You’re Answering

“What actually makes a subnet ‘public’ vs ‘private’, and how does traffic flow between them and to the internet?”

This question gets to the heart of VPC design. There’s no checkbox that says “make this subnet public.” A subnet is public because of its route table configuration—specifically, whether it has a route to an Internet Gateway. Understanding this relationship is the foundation of all AWS networking.

Concepts You Must Understand First

Stop and research these before coding:

CIDR Notation and Subnetting
- What does 10.0.0.0/16 actually mean in binary?
- How do you calculate how many IPs are in a /24 vs /20 vs /16?
- Why can’t VPC CIDRs overlap if you want to peer them?
- What’s the difference between 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16? (RFC 1918)
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.6: IP Addresses
Route Tables and Longest Prefix Match
- How does a router decide where to send a packet?
- If you have routes for 10.0.0.0/16 and 10.0.1.0/24, which wins for 10.0.1.50?
- Why does every VPC route table have a “local” route?
- What happens if there’s no matching route?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.2: Routing Algorithms
NAT (Network Address Translation)
- Why can’t private IPs (10.x.x.x) be used on the public internet?
- How does NAT “hide” hundreds of instances behind one public IP?
- What’s the difference between SNAT (source NAT) and DNAT (destination NAT)?
- Why does NAT break some protocols (like FTP active mode)?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.6.4: Network Address Translation
Availability Zones and High Availability
- What actually IS an Availability Zone physically?
- Why do you need subnets in multiple AZs?
- What’s the latency between AZs in the same region?
- What happens if one AZ goes down?
- Book Reference: AWS Well-Architected Framework — Reliability Pillar
DNS in VPCs
- What is the .2 address in every VPC (e.g., 10.0.0.2)?
- What do enableDnsSupport and enableDnsHostnames actually do?
- Why do some services require DNS hostnames to work?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 7.1: DNS

Questions to Guide Your Design

Before implementing, think through these:

CIDR Planning
- How many IP addresses do you need today? In 5 years?
- If you use 10.0.0.0/16 for this VPC, what CIDRs will you use for future VPCs?
- How will you organize subnets? By tier (web/app/db)? By AZ? Both?
- What if you later need to peer with a VPC that has 10.0.0.0/16?
Subnet Design
- Why put public and private subnets in separate CIDR ranges (10.0.1.x vs 10.0.10.x)?
- How many IPs do you need per subnet? (Remember: AWS reserves 5 per subnet)
- Should database subnets be different from app private subnets?
- Do you need isolated subnets (no internet access at all)?
Route Table Design
- How many route tables do you need? (Hint: minimum 2 - public and private)
- Should each private subnet have its own NAT Gateway for HA?
- What happens to private subnet traffic if the NAT Gateway fails?
NAT Gateway vs NAT Instance
- NAT Gateway: Managed, scales automatically, ~$0.045/hour + data processing
- NAT Instance: Self-managed EC2, cheaper for low traffic, single point of failure
- When would you choose one over the other?
Cost Considerations
- NAT Gateway costs: $0.045/hour × 730 hours/month = $32.85/month just to exist
- Plus $0.045/GB data processed
- How can you reduce costs? (VPC Endpoints for AWS services, consider NAT Instance for dev)

Thinking Exercise

Before coding, trace these scenarios on paper:

Scenario 1: EC2 in private subnet wants to call api.github.com

Draw the packet flow:

EC2 (10.0.10.50) creates packet: src=10.0.10.50, dst=140.82.121.4 (github)
Route table lookup: 140.82.121.4 matches 0.0.0.0/0 → NAT Gateway
NAT Gateway receives packet, performs SNAT:
   - New src IP = NAT Gateway's Elastic IP (54.123.45.67)
   - Stores mapping in translation table
Route table in public subnet: 0.0.0.0/0 → Internet Gateway
Packet goes to Internet Gateway → Internet → GitHub
GitHub responds to 54.123.45.67
NAT Gateway receives response, looks up translation table
Translates dst back to 10.0.10.50, sends to private subnet
EC2 receives response

Question: What happens if NAT Gateway is deleted mid-connection?

Scenario 2: Someone on the internet tries to reach 10.0.10.50 directly

Draw what happens:

1. Attacker sends packet: src=attacker, dst=10.0.10.50
2. Packet arrives at... where exactly?
3. Can 10.0.10.50 even be routed on the public internet?

Answer: The packet never arrives. Private IPs (10.x.x.x) are not routable
on the public internet. Routers drop them. This is why private subnets
are "private" - they're literally unreachable from outside.

Scenario 3: EC2 in public subnet vs EC2 in private subnet

What’s actually different?

Public subnet EC2 (10.0.1.50):
- Has route 0.0.0.0/0 → Internet Gateway
- Can be assigned a public IP (Elastic IP or auto-assign)
- Traffic uses its own public IP for outbound
- Can receive inbound traffic from internet (if SG allows)

Private subnet EC2 (10.0.10.50):
- Has route 0.0.0.0/0 → NAT Gateway
- Cannot have a public IP
- Traffic uses NAT Gateway's IP for outbound
- Cannot receive inbound traffic from internet (no route back)

The ONLY difference is the route table. The subnet itself has no
"public" or "private" property - it's ALL about routing.

The Interview Questions They’ll Ask

Prepare to answer these:

“What makes a subnet public vs private in AWS?”
“Can an EC2 instance in a private subnet access the internet? How?”
“What’s the difference between an Internet Gateway and a NAT Gateway?”
“You have a VPC with CIDR 10.0.0.0/16. Can you peer it with another VPC that has 10.0.0.0/24? Why or why not?”
“Your private instances can’t reach the internet. How do you troubleshoot?”
“Why do you need subnets in multiple Availability Zones?”
“What happens to your application if one AZ goes down and you only have a NAT Gateway in that AZ?”
“How would you reduce NAT Gateway costs for a development environment?”
“What’s the maximum CIDR block size for a VPC?”
“How many IP addresses are usable in a /24 subnet in AWS?”

Hints in Layers

Hint 1: Start with the VPC and Subnets

Your first Terraform file should just create the VPC and subnets:

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
  }
}

# Public subnets - one per AZ
resource "aws_subnet" "public" {
  for_each = {
    "a" = { cidr = "10.0.1.0/24", az = "us-east-1a" }
    "b" = { cidr = "10.0.2.0/24", az = "us-east-1b" }
  }

  vpc_id                  = aws_vpc.main.id
  cidr_block              = each.value.cidr
  availability_zone       = each.value.az
  map_public_ip_on_launch = true  # This is what makes instances get public IPs

  tags = {
    Name = "public-${each.key}"
    Tier = "public"
  }
}

Run terraform apply and verify in the console that your VPC and subnets exist.

Hint 2: Add the Internet Gateway and Public Route Table

Without this, even “public” subnets can’t reach the internet:

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "main-igw"
  }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "public-rt"
  }
}

# Associate public subnets with public route table
resource "aws_route_table_association" "public" {
  for_each       = aws_subnet.public
  subnet_id      = each.value.id
  route_table_id = aws_route_table.public.id
}

Hint 3: Add NAT Gateway for Private Subnets

NAT Gateway needs an Elastic IP and must be in a PUBLIC subnet:

resource "aws_eip" "nat" {
  domain = "vpc"

  tags = {
    Name = "nat-eip"
  }
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public["a"].id  # Must be in PUBLIC subnet!

  tags = {
    Name = "main-nat"
  }

  depends_on = [aws_internet_gateway.main]
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }

  tags = {
    Name = "private-rt"
  }
}

Hint 4: Test Your Setup

After deploying, verify everything works:

# Launch a test instance in private subnet
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type t3.micro \
  --subnet-id <your-private-subnet-id> \
  --key-name <your-key> \
  --no-associate-public-ip-address

# Use Session Manager (no bastion needed) or SSH via bastion
# Then test outbound connectivity:
curl -s ifconfig.me  # Should show NAT Gateway's EIP

# Try to ping the instance from the internet - it should fail
# (because there's no inbound route)

Books That Will Help

Topic	Book	Chapter
IP addressing & CIDR	“Computer Networks, Fifth Edition” by Tanenbaum	Ch. 5.6: IP Addresses
Routing fundamentals	“Computer Networks, Fifth Edition” by Tanenbaum	Ch. 5.2: Routing Algorithms
NAT mechanics	“Computer Networks, Fifth Edition” by Tanenbaum	Ch. 5.6.4: Network Address Translation
AWS VPC deep dive	“AWS Certified Solutions Architect Study Guide”	VPC Chapter
High availability design	“AWS Well-Architected Framework”	Reliability Pillar
Terraform basics	“Terraform: Up & Running” by Yevgeniy Brikman	Ch. 2-4
Infrastructure as Code	“Infrastructure as Code” by Kief Morris	Ch. 1-5

Common Pitfalls & Debugging

Problem 1: “Terraform says CIDR blocks overlap, but they look different to me”

Why: 10.0.0.0/16 and 10.0.1.0/24 DO overlap. The /16 includes all addresses from 10.0.0.0 to 10.0.255.255, which encompasses 10.0.1.0/24.
Fix: Use completely different ranges for different VPCs (e.g., 10.0.0.0/16 for VPC-A, 10.1.0.0/16 for VPC-B)
Quick test: Use an online CIDR calculator to visualize the ranges

Problem 2: “My NAT Gateway costs are $100/month!”

Why: NAT Gateway charges $0.045/hour ($33/month) plus $0.045 per GB transferred. If you leave it running 24/7 during development, costs add up fast.
Fix: Delete NAT Gateway when not actively using it (terraform destroy after each session), or use a NAT Instance (t3.micro) for dev environments
Quick test: Check AWS Cost Explorer → Filter by service: “NAT Gateway”

Problem 3: “I can’t SSH to my EC2 in the public subnet”

Why: One of these is wrong: (1) No public IP assigned, (2) Security Group blocks port 22, (3) Route table doesn’t point to IGW, (4) Your local IP changed

Fix: Verify checklist:

# 1. Check instance has public IP
aws ec2 describe-instances --instance-ids i-xxx --query 'Reservations[0].Instances[0].PublicIpAddress'

# 2. Check Security Group allows SSH from your IP
aws ec2 describe-security-groups --group-ids sg-xxx

# 3. Check route table has IGW route
aws ec2 describe-route-tables --route-table-ids rtb-xxx

# 4. Check your current public IP
curl ifconfig.me

Quick test: Try telnet <public-ip> 22 – if it connects, SSH port is open

Problem 4: “Terraform apply keeps failing with ‘DependencyViolation’“

Why: You’re trying to delete resources in the wrong order. AWS won’t delete a route table that’s associated with subnets, or an IGW attached to a VPC.
Fix: Explicitly define dependencies in Terraform using depends_on, or use terraform destroy which handles order automatically
Quick test: Check Terraform state with terraform state list to see what exists

Problem 5: “Private subnet instances can’t reach the internet even with NAT Gateway”

Why: Most common causes: (1) Route table for private subnet doesn’t point to NAT Gateway, (2) NAT Gateway is in the wrong subnet (must be in public), (3) Security Group blocks outbound traffic

Fix: Trace the packet path:

# 1. Verify private route table points to NAT Gateway
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-private"
# Should show: 0.0.0.0/0 -> nat-xxx

# 2. Verify NAT Gateway is in public subnet and has Elastic IP
aws ec2 describe-nat-gateways --nat-gateway-ids nat-xxx

# 3. SSH to instance and test
ssh ec2-user@10.0.10.50  # via bastion
curl -I https://google.com --max-time 5

Quick test: From private instance, ping 8.8.8.8 – if this works but DNS fails, it’s a resolver issue, not routing

Problem 6: “I deployed to the wrong AWS region”

Why: Terraform defaults to whatever region is in your AWS CLI config or provider block. Resources created in us-east-1 won’t show up in us-west-2.

Fix:

provider "aws" {
  region = "us-east-1"  # Explicitly set region
}

Or verify with: aws configure get region

Quick test: aws ec2 describe-vpcs --region us-east-1 (specify region)

Problem 7: “My VPC has no available IP addresses”

Why: AWS reserves 5 IPs in every subnet (.0, .1, .2, .3, .255). A /28 subnet (16 IPs) only gives you 11 usable IPs. If you launch 12 instances, the 12th will fail.
Fix: Use larger subnets for instance-heavy tiers. A /24 gives 251 usable IPs, a /23 gives 507.
Quick test: Calculate usable IPs: 2^(32-CIDR) - 5

Problem 8: “Terraform state is out of sync with reality”

Why: Someone made manual changes in the AWS Console, or you deleted resources outside Terraform.

Fix:

# See the drift
terraform plan

# Import manually created resources
terraform import aws_vpc.main vpc-xxx

# Or refresh state
terraform refresh

Quick test: Always use terraform plan before apply to see what will change

Project 2: Security Group Traffic Flow Debugger

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Python
Alternative Programming Languages: Go, Bash with AWS CLI
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Security / Networking
Software or Tool: AWS Security Groups, boto3
Main Book: “AWS Security” by Dylan Shields

What you’ll build: A CLI tool that analyzes Security Groups and tells you whether traffic can flow between two resources, tracing the path through all relevant security controls.

Why it teaches AWS networking: Security Groups are the most misunderstood AWS feature. People add rules without understanding stateful behavior, or create circular dependencies they can’t debug. This tool forces you to understand exactly how SG evaluation works.

Core challenges you’ll face:

Understanding stateful filtering (return traffic is auto-allowed) → maps to connection tracking
Tracing SG references (SG-A allows SG-B, SG-B allows SG-C) → maps to graph traversal
Rule evaluation order (most permissive wins) → maps to rule processing
ENI-level association (one resource, multiple SGs) → maps to AWS networking model

Key Concepts:

Security Groups: AWS Security Group Documentation
Stateful vs Stateless: Security Group vs NACL
VPC Reachability Analyzer: AWS Network Access Analyzer

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Python, boto3, understanding of TCP/IP

Real world outcome:

$ ./sg-debug can-connect --from i-abc123 --to i-def456 --port 443

Analyzing connectivity: i-abc123 → i-def456:443

SOURCE INSTANCE: i-abc123
  ENI: eni-111
  Private IP: 10.0.1.50
  Security Groups: [sg-web-tier]

TARGET INSTANCE: i-def456
  ENI: eni-222
  Private IP: 10.0.2.100
  Security Groups: [sg-app-tier]

OUTBOUND CHECK (sg-web-tier):
  ✓ Rule found: "Allow all outbound" (0.0.0.0/0, all ports)

INBOUND CHECK (sg-app-tier):
  ✗ No rule allows TCP:443 from 10.0.1.50 or sg-web-tier

RESULT: CONNECTION BLOCKED ❌

RECOMMENDATION:
  Add inbound rule to sg-app-tier:
    Protocol: TCP
    Port: 443
    Source: sg-web-tier (recommended) or 10.0.1.50/32

$ ./sg-debug can-connect --from i-abc123 --to i-def456 --port 443 --after-fix
RESULT: CONNECTION ALLOWED ✓
  Return traffic: Auto-allowed (Security Groups are stateful)

Implementation Hints:

# Pseudo-code structure
def can_connect(source_instance, target_instance, port, protocol="tcp"):
    # Get ENI and SG info for both instances
    source_enis = get_enis(source_instance)
    target_enis = get_enis(target_instance)

    source_sgs = get_security_groups(source_enis)
    target_sgs = get_security_groups(target_enis)

    # Check outbound from source
    outbound_allowed = check_outbound_rules(
        source_sgs,
        target_ip=get_private_ip(target_enis[0]),
        target_sg_ids=[sg.id for sg in target_sgs],
        port=port,
        protocol=protocol
    )

    if not outbound_allowed:
        return False, "Outbound blocked by source Security Group"

    # Check inbound to target
    inbound_allowed = check_inbound_rules(
        target_sgs,
        source_ip=get_private_ip(source_enis[0]),
        source_sg_ids=[sg.id for sg in source_sgs],
        port=port,
        protocol=protocol
    )

    if not inbound_allowed:
        return False, "Inbound blocked by target Security Group"

    # SGs are stateful - return traffic auto-allowed
    return True, "Connection allowed (return traffic auto-allowed)"

def check_inbound_rules(sgs, source_ip, source_sg_ids, port, protocol):
    for sg in sgs:
        for rule in sg.ip_permissions:
            # Check if rule matches protocol
            if rule.ip_protocol != protocol and rule.ip_protocol != "-1":
                continue
            # Check if port is in range
            if not port_in_range(port, rule.from_port, rule.to_port):
                continue
            # Check if source matches
            if matches_source(rule, source_ip, source_sg_ids):
                return True
    return False

Learning milestones:

Tool correctly identifies blocked connections → You understand SG rule evaluation
Tool explains WHY traffic is blocked → You can debug SG issues
Understands SG references (sg-xxx as source) → You understand SG chaining
Correctly handles stateful behavior → You understand return traffic

Real World Outcome

When you complete this project, you’ll have a CLI tool that saves hours of debugging time by instantly showing whether traffic can flow between any two AWS resources. Here’s exactly what the tool will do:

# Basic usage - check if web server can talk to app server
$ ./sg-debug can-connect --from i-web123 --to i-app456 --port 8080

╔══════════════════════════════════════════════════════════════════════════════╗
║                    SECURITY GROUP CONNECTIVITY ANALYSIS                       ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  CONNECTION: i-web123 (10.0.1.50) → i-app456 (10.0.10.100):8080/TCP         ║
║                                                                              ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ SOURCE INSTANCE: i-web123                                              │ ║
║  │   Name: web-server-1                                                   │ ║
║  │   ENI: eni-0abc111111111111                                            │ ║
║  │   Private IP: 10.0.1.50                                                │ ║
║  │   Subnet: subnet-pub-a (10.0.1.0/24) - PUBLIC                         │ ║
║  │   Security Groups:                                                     │ ║
║  │     • sg-0web111111111111 (web-tier-sg)                               │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                      │                                       ║
║                                      ▼                                       ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ OUTBOUND CHECK (sg-web-tier-sg)                                        │ ║
║  │                                                                        │ ║
║  │   Checking rules for: TCP:8080 to 10.0.10.100                         │ ║
║  │                                                                        │ ║
║  │   Rule 1: Type=All, Protocol=All, Port=All, Dest=0.0.0.0/0           │ ║
║  │           ✓ MATCH - Destination 10.0.10.100 in 0.0.0.0/0             │ ║
║  │                                                                        │ ║
║  │   RESULT: ✓ OUTBOUND ALLOWED                                          │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                      │                                       ║
║                                      ▼                                       ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ TARGET INSTANCE: i-app456                                              │ ║
║  │   Name: app-server-1                                                   │ ║
║  │   ENI: eni-0def222222222222                                            │ ║
║  │   Private IP: 10.0.10.100                                              │ ║
║  │   Subnet: subnet-prv-a (10.0.10.0/24) - PRIVATE                       │ ║
║  │   Security Groups:                                                     │ ║
║  │     • sg-0app222222222222 (app-tier-sg)                               │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                      │                                       ║
║                                      ▼                                       ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ INBOUND CHECK (sg-app-tier-sg)                                         │ ║
║  │                                                                        │ ║
║  │   Checking rules for: TCP:8080 from 10.0.1.50 or sg-web-tier-sg       │ ║
║  │                                                                        │ ║
║  │   Rule 1: Type=Custom TCP, Protocol=TCP, Port=443, Source=sg-alb-sg  │ ║
║  │           ✗ NO MATCH - Port 443 ≠ 8080                                │ ║
║  │                                                                        │ ║
║  │   Rule 2: Type=Custom TCP, Protocol=TCP, Port=22, Source=10.0.0.0/8  │ ║
║  │           ✗ NO MATCH - Port 22 ≠ 8080                                 │ ║
║  │                                                                        │ ║
║  │   No more rules to check.                                              │ ║
║  │                                                                        │ ║
║  │   RESULT: ✗ INBOUND BLOCKED                                           │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                                                              ║
║  ══════════════════════════════════════════════════════════════════════════ ║
║                                                                              ║
║  FINAL RESULT: CONNECTION BLOCKED ❌                                         ║
║                                                                              ║
║  BLOCKED AT: Inbound rules on sg-app-tier-sg                                ║
║                                                                              ║
║  RECOMMENDATIONS:                                                            ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  Option 1 (Recommended - Security Group Reference):                         ║
║    aws ec2 authorize-security-group-ingress \                               ║
║      --group-id sg-0app222222222222 \                                       ║
║      --protocol tcp \                                                        ║
║      --port 8080 \                                                           ║
║      --source-group sg-0web111111111111                                     ║
║                                                                              ║
║  Option 2 (IP-based - less flexible):                                       ║
║    aws ec2 authorize-security-group-ingress \                               ║
║      --group-id sg-0app222222222222 \                                       ║
║      --protocol tcp \                                                        ║
║      --port 8080 \                                                           ║
║      --cidr 10.0.1.50/32                                                    ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

# After adding the rule, verify the fix:
$ ./sg-debug can-connect --from i-web123 --to i-app456 --port 8080

FINAL RESULT: CONNECTION ALLOWED ✓

Traffic Path:
  i-web123 (10.0.1.50)
    → [OUTBOUND: sg-web-tier-sg allows all]
    → i-app456 (10.0.10.100:8080)
    → [INBOUND: sg-app-tier-sg allows from sg-web-tier-sg]

Return Traffic: Auto-allowed (Security Groups are stateful)

# Advanced usage - trace all allowed connections for an instance
$ ./sg-debug list-allowed --instance i-app456 --direction inbound

╔══════════════════════════════════════════════════════════════════════════════╗
║              ALLOWED INBOUND CONNECTIONS TO i-app456                         ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Security Group: sg-app-tier-sg                                              ║
║                                                                              ║
║  ┌────────┬──────────┬───────────────────────────┬──────────────────────┐   ║
║  │ Port   │ Protocol │ Source                    │ Description          │   ║
║  ├────────┼──────────┼───────────────────────────┼──────────────────────┤   ║
║  │ 443    │ TCP      │ sg-alb-sg (ALB)          │ HTTPS from ALB       │   ║
║  │ 8080   │ TCP      │ sg-web-tier-sg           │ API from web tier    │   ║
║  │ 22     │ TCP      │ 10.0.0.0/8               │ SSH from VPC         │   ║
║  │ 3306   │ TCP      │ sg-app-tier-sg (self)    │ DB replication       │   ║
║  └────────┴──────────┴───────────────────────────┴──────────────────────┘   ║
║                                                                              ║
║  POTENTIAL ISSUES DETECTED:                                                  ║
║  ⚠️  SSH (22) open to entire VPC - consider restricting to bastion only     ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

# Find which instances can reach a specific target
$ ./sg-debug who-can-reach --target i-db789 --port 3306

╔══════════════════════════════════════════════════════════════════════════════╗
║              INSTANCES THAT CAN REACH i-db789:3306                          ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Database instance i-db789 accepts TCP:3306 from:                           ║
║                                                                              ║
║  Via Security Group sg-app-tier-sg:                                         ║
║    • i-app456 (10.0.10.100) - app-server-1                                  ║
║    • i-app789 (10.0.11.100) - app-server-2                                  ║
║                                                                              ║
║  Via CIDR 10.0.10.0/24:                                                     ║
║    • i-app456 (10.0.10.100) - app-server-1                                  ║
║    • i-cache123 (10.0.10.200) - redis-cache                                 ║
║                                                                              ║
║  TOTAL: 3 unique instances can reach the database                           ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

The Core Question You’re Answering

“When traffic is blocked between two AWS resources, how do I know WHERE it’s blocked and WHY?”

This is the question every AWS engineer faces daily. Security Groups silently drop packets—there’s no “connection refused” or error message. Understanding exactly how Security Group rules are evaluated, how stateful filtering works, and how SG references resolve is crucial for debugging any connectivity issue.

Concepts You Must Understand First

Stop and research these before coding:

Stateful vs Stateless Firewalls
- What does “stateful” actually mean in firewall terms?
- How does a stateful firewall track connections? (hint: connection table/conntrack)
- Why is return traffic automatically allowed in Security Groups?
- What’s the TCP three-way handshake and why does it matter for stateful filtering?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 8.9: Firewalls
Security Group Rule Evaluation
- Are Security Group rules evaluated in order like NACLs?
- What happens when you have multiple Security Groups on one ENI?
- Can you have deny rules in Security Groups?
- How does “All traffic” rule (-1 protocol) work?
- Book Reference: AWS Security Groups Documentation
Security Group References (SG-to-SG)
- What does it mean when a rule has “sg-xxx” as source instead of a CIDR?
- How does AWS resolve SG references when checking rules?
- Why is SG referencing more flexible than IP-based rules?
- What happens if you reference a SG from a different VPC?
- Book Reference: AWS Security Best Practices Whitepaper
ENI and Security Group Association
- What is an ENI (Elastic Network Interface)?
- How many Security Groups can be attached to one ENI?
- Can different ENIs on the same instance have different Security Groups?
- How do Lambda, RDS, and ELB use ENIs and Security Groups?
- Book Reference: “AWS Certified Solutions Architect Study Guide” — Networking Chapter
TCP/UDP and Port Ranges
- What’s the difference between source port and destination port?
- Why do you only need to specify destination port in SG rules?
- What are ephemeral ports and why don’t you need to allow them inbound?
- What does protocol “-1” mean in AWS?
- Book Reference: “TCP/IP Illustrated, Volume 1” by Stevens — Ch. 13: TCP Connection Management

Questions to Guide Your Design

Before implementing, think through these:

Data Model
- What information do you need to fetch from AWS to analyze connectivity?
- How will you represent a Security Group rule in your code?
- How will you handle multiple ENIs per instance?
- How will you resolve SG references to actual IP addresses?
Algorithm Design
- Should you check outbound rules first or inbound rules?
- How do you determine if a rule “matches” a connection attempt?
- When multiple rules could match, which one wins?
- How do you explain WHY a connection is blocked?
Edge Cases
- What if the source instance has multiple SGs and one allows while another doesn’t?
- What if the rule uses a SG reference that includes the source instance?
- How do you handle ICMP (ping) which doesn’t have ports?
- What about connections to/from AWS services (RDS, ElastiCache)?
User Experience
- How should you display the results—text, JSON, visual diagram?
- Should you recommend fixes for blocked connections?
- How do you make the output actionable for someone who’s debugging?

Thinking Exercise

Before coding, trace these scenarios on paper:

Scenario 1: Web server trying to reach database

Setup:
- web-server (i-web) has sg-web attached
- db-server (i-db) has sg-db attached
- sg-web outbound: Allow all (0.0.0.0/0)
- sg-db inbound: Allow TCP 3306 from sg-app (NOT sg-web!)

Connection attempt: i-web → i-db:3306

Step 1: Check sg-web outbound rules
  - Rule "Allow all" matches destination 10.0.20.50:3306
  - Outbound: ALLOWED ✓

Step 2: Check sg-db inbound rules
  - Rule "Allow 3306 from sg-app"
  - Is i-web in sg-app? NO!
  - No other rules match
  - Inbound: BLOCKED ✗

Result: CONNECTION BLOCKED
Reason: sg-db only allows 3306 from sg-app, not sg-web
Fix: Either add i-web to sg-app, or add new rule allowing sg-web

Scenario 2: Understanding SG references

Setup:
- Instance A has sg-A attached (IP: 10.0.1.10)
- Instance B has sg-B attached (IP: 10.0.1.20)
- Instance C has sg-A AND sg-B attached (IP: 10.0.1.30)
- sg-target inbound: Allow TCP 443 from sg-A

Question: Which instances can reach sg-target on port 443?

Answer:
- Instance A: YES (has sg-A)
- Instance B: NO (only has sg-B)
- Instance C: YES (has sg-A, even though it also has sg-B)

The SG reference sg-A means "any ENI that has sg-A attached"

Scenario 3: Multiple Security Groups on one ENI

Setup:
- Instance has sg-1 and sg-2 attached
- sg-1 outbound: Allow TCP 443 to 0.0.0.0/0
- sg-1 outbound: (no rule for port 8080)
- sg-2 outbound: Allow TCP 8080 to 10.0.0.0/8

Connection attempt: Instance → external-api:443
- sg-1 allows it: ALLOWED ✓

Connection attempt: Instance → internal-service:8080
- sg-1 doesn't have a rule... but wait!
- sg-2 allows it: ALLOWED ✓

Key insight: Security Groups are ADDITIVE
If ANY attached SG allows the traffic, it's allowed
There's no "most restrictive wins" - it's "most permissive wins"

The Interview Questions They’ll Ask

Prepare to answer these:

“What’s the difference between Security Groups and NACLs?”
“Security Groups are stateful—what does that mean exactly?”
“Can you block specific traffic with a Security Group? How?”
“What happens when you attach multiple Security Groups to an instance?”
“What’s the advantage of using Security Group references vs CIDR blocks?”
“You have a connection timeout between two instances. How do you debug it?”
“Can Security Groups span VPCs?”
“What’s the maximum number of rules you can have in a Security Group?”
“How do Security Groups work with Lambda functions?”
“You allowed inbound traffic but the connection still fails. What could be wrong?”

Hints in Layers

Hint 1: Start with boto3 to fetch Security Group data

import boto3

def get_instance_security_groups(instance_id: str) -> list:
    """Get all Security Groups attached to an instance's ENIs."""
    ec2 = boto3.client('ec2')

    # Get instance details
    response = ec2.describe_instances(InstanceIds=[instance_id])
    instance = response['Reservations'][0]['Instances'][0]

    # Collect SG IDs from all network interfaces
    sg_ids = set()
    for eni in instance.get('NetworkInterfaces', []):
        for group in eni.get('Groups', []):
            sg_ids.add(group['GroupId'])

    # Get full SG details
    sg_response = ec2.describe_security_groups(GroupIds=list(sg_ids))
    return sg_response['SecurityGroups']

Hint 2: Model the rule checking logic

def check_rule_matches(rule: dict, port: int, protocol: str,
                       source_ip: str, source_sg_ids: list) -> bool:
    """Check if a single inbound rule allows the connection."""

    # Check protocol (-1 means all protocols)
    rule_protocol = rule.get('IpProtocol', '-1')
    if rule_protocol != '-1' and rule_protocol != protocol:
        return False

    # Check port range (for TCP/UDP)
    if protocol in ['tcp', 'udp', '6', '17']:
        from_port = rule.get('FromPort', 0)
        to_port = rule.get('ToPort', 65535)
        if not (from_port <= port <= to_port):
            return False

    # Check source - could be CIDR or SG reference
    # Check IP ranges
    for ip_range in rule.get('IpRanges', []):
        cidr = ip_range.get('CidrIp')
        if ip_in_cidr(source_ip, cidr):
            return True

    # Check SG references
    for sg_ref in rule.get('UserIdGroupPairs', []):
        if sg_ref.get('GroupId') in source_sg_ids:
            return True

    return False

Hint 3: Implement the full connectivity check

def can_connect(source_instance: str, target_instance: str,
                port: int, protocol: str = 'tcp') -> dict:
    """
    Check if source can connect to target on specified port.
    Returns detailed analysis of the path.
    """
    result = {
        'allowed': False,
        'source': get_instance_info(source_instance),
        'target': get_instance_info(target_instance),
        'outbound_check': None,
        'inbound_check': None,
        'blocked_at': None,
        'recommendation': None
    }

    # Get SGs for both instances
    source_sgs = get_instance_security_groups(source_instance)
    target_sgs = get_instance_security_groups(target_instance)

    source_sg_ids = [sg['GroupId'] for sg in source_sgs]
    target_ip = result['target']['private_ip']
    source_ip = result['source']['private_ip']

    # Check outbound from source (any SG allowing = pass)
    outbound_allowed = False
    for sg in source_sgs:
        for rule in sg.get('IpPermissionsEgress', []):
            if check_outbound_rule_matches(rule, port, protocol, target_ip):
                outbound_allowed = True
                result['outbound_check'] = {
                    'allowed': True,
                    'matched_sg': sg['GroupId'],
                    'matched_rule': rule
                }
                break
        if outbound_allowed:
            break

    if not outbound_allowed:
        result['blocked_at'] = 'outbound'
        result['recommendation'] = generate_outbound_fix(source_sgs[0], port, protocol)
        return result

    # Check inbound to target
    inbound_allowed = False
    for sg in target_sgs:
        for rule in sg.get('IpPermissions', []):
            if check_rule_matches(rule, port, protocol, source_ip, source_sg_ids):
                inbound_allowed = True
                result['inbound_check'] = {
                    'allowed': True,
                    'matched_sg': sg['GroupId'],
                    'matched_rule': rule
                }
                break
        if inbound_allowed:
            break

    if not inbound_allowed:
        result['blocked_at'] = 'inbound'
        result['recommendation'] = generate_inbound_fix(
            target_sgs[0], port, protocol, source_sg_ids[0]
        )
        return result

    result['allowed'] = True
    return result

Hint 4: Generate actionable fix recommendations

def generate_inbound_fix(target_sg: dict, port: int,
                         protocol: str, source_sg_id: str) -> str:
    """Generate AWS CLI command to fix blocked inbound traffic."""
    return f"""
To allow this connection, run:

aws ec2 authorize-security-group-ingress \\
  --group-id {target_sg['GroupId']} \\
  --protocol {protocol} \\
  --port {port} \\
  --source-group {source_sg_id}

Or in Terraform:

resource "aws_security_group_rule" "allow_from_source" {{
  type                     = "ingress"
  from_port                = {port}
  to_port                  = {port}
  protocol                 = "{protocol}"
  source_security_group_id = "{source_sg_id}"
  security_group_id        = "{target_sg['GroupId']}"
}}
"""

Books That Will Help

Topic	Book	Chapter
Firewall concepts (stateful/stateless)	“Computer Networks, Fifth Edition” by Tanenbaum	Ch. 8.9: Firewalls
TCP connection states	“TCP/IP Illustrated, Volume 1” by Stevens	Ch. 13: TCP Connection Management
AWS Security Groups deep dive	“AWS Certified Security Specialty Study Guide”	Security Groups Chapter
Network security monitoring	“The Practice of Network Security Monitoring” by Bejtlich	Ch. 2-4
Python AWS SDK (boto3)	“Python for DevOps” by Gift, Behrman	AWS Chapter
CLI tool design	“The Linux Command Line” by Shotts	Ch. 25-27: Shell Scripting

Common Pitfalls & Debugging

Problem 1: “My tool says traffic is allowed, but connection still times out”

Why: Security Groups are only ONE layer. Traffic might be blocked by: (1) Network ACL (stateless), (2) OS-level firewall (iptables, Windows Firewall), (3) Route table missing route, (4) Application not listening on the port

Fix: Use a layered debugging approach:

# 1. Check Security Group (your tool)
./sg-debug can-connect --from i-xxx --to i-yyy --port 443

# 2. Check NACL
aws ec2 describe-network-acls --filters "Name=vpc-id,Values=vpc-xxx"

# 3. Check route table
aws ec2 describe-route-tables --route-table-ids rtb-xxx

# 4. Test from source instance
ssh ec2-user@source-instance
telnet 10.0.2.50 443
# If "Connection refused" = SG/NACL OK, app not listening
# If timeout = network blocked

Quick test: Use VPC Reachability Analyzer as ground truth comparison

Problem 2: “Tool doesn’t handle Security Group references (sg-xxx allowing sg-yyy)”

Why: Security Groups can reference other SGs instead of IP ranges. This creates a graph that needs recursive traversal.

Fix: Implement recursive SG resolution:

def resolve_sg_references(sg_id, visited=set()):
    if sg_id in visited:
        return []  # Prevent infinite loops
    visited.add(sg_id)

    rules = get_sg_rules(sg_id)
    for rule in rules:
        if rule.references_sg:
            resolve_sg_references(rule.referenced_sg, visited)

Quick test: Create SG-A allowing SG-B, SG-B allowing SG-C, verify your tool traces the full chain

Problem 3: “Tool shows ‘ALLOW’ but connection to RDS fails”

Why: RDS has multiple ports (3306 for MySQL, 5432 for Postgres). Also, RDS ENI isn’t the same as the RDS endpoint.

Fix: Query the actual RDS instance to get its ENI:

# Get RDS ENI
aws rds describe-db-instances --db-instance-identifier mydb \
  --query 'DBInstances[0].DBSubnetGroup.VpcId'

# Then find the ENI in that VPC
aws ec2 describe-network-interfaces \
  --filters "Name=description,Values=*RDSNetworkInterface*"

Quick test: Test with telnet rds-endpoint.region.rds.amazonaws.com 3306

Problem 4: “Boto3 API calls are super slow”

Why: You’re calling describe_security_groups for every single rule evaluation. API rate limiting kicks in.

Fix: Cache Security Group data in memory:

sg_cache = {}

def get_security_group(sg_id):
    if sg_id not in sg_cache:
        sg_cache[sg_id] = ec2.describe_security_groups(GroupIds=[sg_id])
    return sg_cache[sg_id]

Quick test: Add timing logs, should see <100ms per query after caching

Problem 5: “Tool says ‘REJECT’ but I can see the traffic in VPC Flow Logs”

Why: VPC Flow Logs capture traffic at the ENI level AFTER Security Groups. If SG allows but NACL blocks, Flow Logs show REJECT with action from NACL.

Fix: Your tool should check both SG and NACL:

sg_result = check_security_groups(src, dst, port)
nacl_result = check_network_acls(src_subnet, dst_subnet, port)

if sg_result == "ALLOW" and nacl_result == "ALLOW":
    return "ALLOW"
elif sg_result == "DENY":
    return "DENY (Security Group)"
elif nacl_result == "DENY":
    return "DENY (Network ACL)"

Quick test: Create an explicit DENY rule in NACL, verify tool catches it

Problem 6: “How do I handle multiple Security Groups on one ENI?”

Why: An ENI can have up to 5 Security Groups attached. Rules are merged (most permissive wins).

Fix: Evaluate ALL Security Groups, if ANY allows, traffic flows:

def check_multiple_sgs(eni_id, port):
    sgs = get_eni_security_groups(eni_id)
    for sg in sgs:
        if sg_allows(sg, port):
            return "ALLOW"
    return "DENY"

Quick test: Attach 2 SGs to an instance, one allowing port 80, one denying nothing. Port 80 should work.

Problem 7: “Tool fails with ‘UnauthorizedOperation’ error”

Why: Your IAM role/user doesn’t have permissions to describe Security Groups, ENIs, or instances.

Fix: Add required IAM permissions:

{
  "Effect": "Allow",
  "Action": [
    "ec2:DescribeSecurityGroups",
    "ec2:DescribeNetworkInterfaces",
    "ec2:DescribeInstances",
    "ec2:DescribeNetworkAcls"
  ],
  "Resource": "*"
}

Quick test: aws ec2 describe-security-groups --dry-run (should not error on permissions)

Problem 8: “Tool doesn’t handle IPv6 rules”

Why: Security Groups support both IPv4 (0.0.0.0/0) and IPv6 (::/0) rules. You need to handle both.

Fix: Check both IpRanges and Ipv6Ranges in the API response:

for rule in sg_rules:
    ipv4_ranges = rule.get('IpRanges', [])
    ipv6_ranges = rule.get('Ipv6Ranges', [])
    # Process both

Quick test: Add an IPv6-only rule, verify tool detects it

Project 3: VPC Flow Logs Analyzer

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Python
Alternative Programming Languages: Go, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Networking / Security Analysis
Software or Tool: AWS VPC Flow Logs, S3, Athena
Main Book: “The Practice of Network Security Monitoring” by Richard Bejtlich

What you’ll build: A system that ingests VPC Flow Logs, parses them, and provides real-time visibility into network traffic patterns, anomaly detection, and security insights.

Why it teaches AWS networking: VPC Flow Logs are how you SEE what’s actually happening on the network. Understanding them teaches you about IP flows, connection states, and how to detect both problems and attacks.

Core challenges you’ll face:

Parsing flow log format (version 2+ with custom fields) → maps to log parsing
Understanding flow states (ACCEPT, REJECT, NODATA) → maps to connection tracking
Correlating ENIs to resources (which instance is eni-xxx?) → maps to AWS metadata
Detecting anomalies (port scans, unusual traffic) → maps to security monitoring
Handling volume (millions of records/hour) → maps to data engineering

Key Concepts:

VPC Flow Logs Format: AWS VPC Flow Logs Documentation
Network Traffic Analysis: “The Practice of Network Security Monitoring” Chapter 5 - Richard Bejtlich
Athena Queries: Querying VPC Flow Logs with Athena

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Python, SQL, understanding of network protocols

Real world outcome:

$ ./flow-analyzer dashboard

╔════════════════════════════════════════════════════════════════╗
║           VPC FLOW LOGS DASHBOARD (Last 24 hours)              ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║  TRAFFIC SUMMARY                                               ║
║  ─────────────────────────────────────────────────────────────║
║  Total Flows: 2,456,789                                        ║
║  Accepted: 2,401,234 (97.7%)                                   ║
║  Rejected: 55,555 (2.3%)                                       ║
║  Data Transferred: 1.2 TB                                      ║
║                                                                ║
║  TOP TALKERS (by bytes)                                        ║
║  ─────────────────────────────────────────────────────────────║
║  1. i-abc123 (web-server-1) → 10.0.0.0/8: 245 GB               ║
║  2. i-def456 (db-primary)   → 10.0.1.0/24: 189 GB              ║
║  3. i-ghi789 (app-server)   → 0.0.0.0/0: 156 GB                ║
║                                                                ║
║  ⚠️  SECURITY ALERTS                                           ║
║  ─────────────────────────────────────────────────────────────║
║  🔴 CRITICAL: Port scan detected                               ║
║     Source: 203.0.113.50 (external)                            ║
║     Target: 10.0.1.0/24 (private subnet)                       ║
║     Ports scanned: 22, 23, 80, 443, 3306, 5432                 ║
║     Recommendation: Block 203.0.113.50 in NACL                 ║
║                                                                ║
║  🟡 WARNING: Unusual outbound traffic                          ║
║     Source: i-xyz789 (10.0.2.50)                               ║
║     Destination: 185.143.223.x (known C2 server)               ║
║     Bytes: 2.3 GB over 4 hours                                 ║
║     Recommendation: Isolate instance, investigate              ║
║                                                                ║
║  REJECTED CONNECTIONS (Top Sources)                            ║
║  ─────────────────────────────────────────────────────────────║
║  1. 192.168.1.50:* → 10.0.1.100:22 (SSH blocked) - 12,345      ║
║  2. 10.0.1.50:* → 10.0.2.100:3306 (DB not allowed) - 5,432     ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

$ ./flow-analyzer query "rejected traffic to port 22 in last hour"
Found 1,234 rejected flows to port 22:
  - 85% from external IPs (likely SSH brute force attempts)
  - 15% from internal IPs (misconfigured Security Groups)

Top external sources:
  185.143.223.x: 456 attempts (known bad IP - block in NACL)
  192.168.1.x: 234 attempts (RFC1918 - spoofed, drop at edge)

Implementation Hints: Flow log record format (v2):

version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789012 eni-abc123 10.0.1.50 10.0.2.100 49152 443 6 25 1234 1639489200 1639489260 ACCEPT OK

Processing pipeline:

# Pseudo-code
def process_flow_logs(s3_bucket, prefix):
    # Read from S3 (or Kinesis for real-time)
    for log_file in list_s3_objects(s3_bucket, prefix):
        records = parse_flow_log_file(log_file)

        for record in records:
            # Enrich with metadata
            record.source_instance = lookup_eni(record.interface_id)
            record.geo = geoip_lookup(record.srcaddr) if is_public(record.srcaddr) else None

            # Store in database (TimescaleDB, ClickHouse, etc.)
            insert_record(record)

            # Real-time anomaly detection
            if is_port_scan(record):
                alert("Port scan detected", record)

            if is_known_bad_ip(record.srcaddr):
                alert("Connection from known malicious IP", record)

def is_port_scan(record, window_seconds=60, port_threshold=10):
    # Check if same source hit many ports in short time
    recent_ports = query("""
        SELECT DISTINCT dstport FROM flows
        WHERE srcaddr = ? AND time > now() - interval ?
    """, record.srcaddr, window_seconds)

    return len(recent_ports) > port_threshold

Learning milestones:

Parse and store flow logs efficiently → You understand the format
Identify traffic patterns → You can analyze network behavior
Detect security anomalies → You understand attack patterns
Correlate with AWS resources → You connect network to infrastructure

Real World Outcome

When you complete this project, you’ll have a powerful network visibility tool that transforms raw VPC Flow Logs into actionable security intelligence. Here’s exactly what your tool will do:

# Start the flow analyzer daemon (processes logs from S3 in real-time)
$ ./flow-analyzer start --s3-bucket vpc-flow-logs-prod --region us-east-1

[2024-12-22 14:00:00] Flow Analyzer started
[2024-12-22 14:00:01] Connected to S3: vpc-flow-logs-prod
[2024-12-22 14:00:01] Loaded 156 ENI → Instance mappings
[2024-12-22 14:00:02] Processing backlog: 2,456 log files
[2024-12-22 14:00:15] Backlog processed: 12,456,789 flow records
[2024-12-22 14:00:15] Real-time processing active...

# View the live dashboard
$ ./flow-analyzer dashboard --refresh 5s

╔══════════════════════════════════════════════════════════════════════════════╗
║                    VPC FLOW LOGS DASHBOARD                                    ║
║                    Last updated: 2024-12-22 14:32:15 UTC                      ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  TRAFFIC SUMMARY (Last 24 Hours)                                             ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  ┌─────────────────┬────────────────┬─────────────────────────────────────┐ ║
║  │ Metric          │ Value          │ Graph (24h)                         │ ║
║  ├─────────────────┼────────────────┼─────────────────────────────────────┤ ║
║  │ Total Flows     │ 2,456,789      │ ▂▃▄▅▆▇█▇▆▅▄▅▆▇█▇▆▅▄▃▂▃▄▅          │ ║
║  │ Accepted        │ 2,401,234      │ 97.7% ████████████████████░░        │ ║
║  │ Rejected        │ 55,555         │  2.3% █░░░░░░░░░░░░░░░░░░░░░        │ ║
║  │ Data In         │ 892 GB         │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆          │ ║
║  │ Data Out        │ 1.2 TB         │ ▃▄▅▆▇█▇▆▅▄▃▂▃▄▅▆▇█▇▆▅▄▃▂          │ ║
║  │ Unique Sources  │ 12,456         │                                     │ ║
║  │ Unique Dests    │ 8,234          │                                     │ ║
║  └─────────────────┴────────────────┴─────────────────────────────────────┘ ║
║                                                                              ║
║  TOP TALKERS (by bytes transferred)                                          ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  ┌────┬────────────────────────────┬──────────────┬────────────────────────┐║
║  │ #  │ Source                     │ Bytes        │ Top Destination        │║
║  ├────┼────────────────────────────┼──────────────┼────────────────────────┤║
║  │ 1  │ i-abc123 (web-server-1)   │ 245.6 GB     │ ALB (internal)         │║
║  │ 2  │ i-def456 (db-primary)     │ 189.2 GB     │ db-replica (10.0.11.x) │║
║  │ 3  │ i-ghi789 (app-server-1)   │ 156.8 GB     │ S3 (via endpoint)      │║
║  │ 4  │ i-jkl012 (batch-worker)   │ 98.4 GB      │ External APIs          │║
║  │ 5  │ i-mno345 (cache-server)   │ 67.2 GB      │ app-servers            │║
║  └────┴────────────────────────────┴──────────────┴────────────────────────┘║
║                                                                              ║
║  🚨 SECURITY ALERTS                                                          ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  🔴 CRITICAL [14:28:32] Port Scan Detected                                   ║
║     ┌──────────────────────────────────────────────────────────────────────┐║
║     │ Source: 203.0.113.50 (external - Tor exit node)                      │║
║     │ Target: 10.0.1.0/24 (public subnet)                                  │║
║     │ Ports scanned: 22, 23, 80, 443, 3306, 5432, 6379, 27017 (15 total)  │║
║     │ Duration: 45 seconds                                                  │║
║     │ Status: All REJECTED by Security Groups                              │║
║     │                                                                       │║
║     │ RECOMMENDED ACTION:                                                   │║
║     │ aws ec2 create-network-acl-entry --network-acl-id acl-xxx \          │║
║     │   --rule-number 50 --protocol -1 --cidr-block 203.0.113.50/32 \      │║
║     │   --egress false --rule-action deny                                   │║
║     └──────────────────────────────────────────────────────────────────────┘║
║                                                                              ║
║  🟡 WARNING [14:15:22] Unusual Outbound Traffic                              ║
║     ┌──────────────────────────────────────────────────────────────────────┐║
║     │ Source: i-xyz789 (10.0.2.50) - Name: batch-processor-3               │║
║     │ Destination: 185.143.223.100 (external)                              │║
║     │ GeoIP: Russia, Moscow                                                 │║
║     │ Threat Intel: Listed in AbuseIPDB (C2 server suspected)              │║
║     │ Bytes transferred: 2.3 GB over 4 hours                               │║
║     │ Pattern: Large outbound, minimal inbound (data exfiltration?)        │║
║     │                                                                       │║
║     │ RECOMMENDED ACTIONS:                                                  │║
║     │ 1. Isolate instance: aws ec2 modify-instance-attribute \             │║
║     │      --instance-id i-xyz789 --groups sg-isolated                     │║
║     │ 2. Create forensic snapshot before termination                       │║
║     │ 3. Review CloudTrail for instance compromise indicators              │║
║     └──────────────────────────────────────────────────────────────────────┘║
║                                                                              ║
║  🟢 INFO [13:45:00] New Communication Path Detected                          ║
║     i-web123 (web-tier) → i-cache456 (redis) on port 6379                   ║
║     First seen: 2024-12-22 13:45:00 (new deployment?)                       ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

# Query specific traffic patterns
$ ./flow-analyzer query --sql "
  SELECT srcaddr, COUNT(*) as attempts, COUNT(DISTINCT dstport) as ports
  FROM flows
  WHERE action = 'REJECT'
    AND dstport IN (22, 23, 3389)
    AND start_time > now() - interval '1 hour'
  GROUP BY srcaddr
  HAVING COUNT(DISTINCT dstport) > 2
  ORDER BY attempts DESC
  LIMIT 10
"

┌─────────────────┬──────────┬───────┐
│ srcaddr         │ attempts │ ports │
├─────────────────┼──────────┼───────┤
│ 185.143.223.x   │ 1,456    │ 3     │
│ 203.0.113.50    │ 892      │ 3     │
│ 192.168.1.100   │ 234      │ 2     │  ← Internal! Misconfigured?
│ 45.33.32.156    │ 189      │ 3     │
└─────────────────┴──────────┴───────┘

# Generate security report
$ ./flow-analyzer report --format pdf --period "last 7 days" --output weekly-report.pdf

Generated: weekly-report.pdf
Contents:
  - Executive Summary
  - Traffic Volume Trends
  - Top Talkers Analysis
  - Security Incidents (12 alerts)
  - Rejected Traffic Analysis
  - Recommendations
  - Appendix: Raw Data

# Export to SIEM
$ ./flow-analyzer export --format splunk --dest "https://splunk.company.com:8088/services/collector"

Exported 2,456,789 records to Splunk

Architecture You’ll Build:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         VPC FLOW LOGS ANALYZER                               │
│                                                                             │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────────────┐ │
│  │    VPC      │      │   S3        │      │      Flow Analyzer          │ │
│  │             │      │   Bucket    │      │                             │ │
│  │  ┌───────┐  │      │             │      │  ┌──────────────────────┐  │ │
│  │  │ ENI   │──┼──────┼─► Flow Logs ├──────┼──► Log Parser           │  │ │
│  │  └───────┘  │      │             │      │  └──────────┬───────────┘  │ │
│  │  ┌───────┐  │      │             │      │             │              │ │
│  │  │ ENI   │──┼──────┤             │      │  ┌──────────▼───────────┐  │ │
│  │  └───────┘  │      │             │      │  │ Enrichment Engine    │  │ │
│  │  ┌───────┐  │      │             │      │  │  - ENI → Instance    │  │ │
│  │  │ ENI   │──┼──────┤             │      │  │  - GeoIP lookup      │  │ │
│  │  └───────┘  │      │             │      │  │  - Threat Intel      │  │ │
│  │             │      └─────────────┘      │  └──────────┬───────────┘  │ │
│  └─────────────┘                           │             │              │ │
│                                            │  ┌──────────▼───────────┐  │ │
│                                            │  │ Analytics Engine     │  │ │
│                                            │  │  - Anomaly detection │  │ │
│                                            │  │  - Pattern matching  │  │ │
│                                            │  │  - Alerting          │  │ │
│                                            │  └──────────┬───────────┘  │ │
│                                            │             │              │ │
│                                            │  ┌──────────▼───────────┐  │ │
│                                            │  │ Storage (TimescaleDB │  │ │
│                                            │  │ or ClickHouse)       │  │ │
│                                            │  └──────────┬───────────┘  │ │
│                                            │             │              │ │
│                                            │  ┌──────────▼───────────┐  │ │
│                                            │  │ CLI / Dashboard      │  │ │
│                                            │  │  - Real-time view    │  │ │
│                                            │  │  - SQL queries       │  │ │
│                                            │  │  - Reports           │  │ │
│                                            │  └──────────────────────┘  │ │
│                                            └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

The Core Question You’re Answering

“What traffic is actually flowing through my VPC, and how do I detect problems and attacks?”

VPC Flow Logs are your eyes into the network. Without them, you’re blind—you don’t know what’s communicating with what, whether traffic is being blocked, or if there’s a security incident. This project teaches you to transform raw metadata into actionable intelligence.

Concepts You Must Understand First

Stop and research these before coding:

VPC Flow Log Record Format
- What are the default fields in v2 flow logs?
- What additional fields can you add in v3+?
- What does each field actually represent?
- Why are there “NODATA” and “SKIPDATA” entries?
- Book Reference: AWS VPC Flow Logs Documentation
Network Protocol Numbers
- What does protocol “6” mean? (TCP) Protocol “17”? (UDP) Protocol “1”? (ICMP)
- Where do these numbers come from? (IANA protocol numbers)
- Why is this important for analyzing traffic?
- Book Reference: “TCP/IP Illustrated, Volume 1” by Stevens — Ch. 1: Introduction
Flow vs Packet
- What’s the difference between a flow and a packet?
- Why does AWS capture flows instead of packets?
- How long does a flow represent? (aggregation window)
- Book Reference: “The Practice of Network Security Monitoring” by Bejtlich — Ch. 5: Flow Data
Security Attack Patterns
- What does a port scan look like in flow logs?
- How do you detect brute force attacks?
- What indicates data exfiltration?
- What’s a C2 (Command & Control) communication pattern?
- Book Reference: “The Practice of Network Security Monitoring” by Bejtlich — Ch. 8: Analysis Techniques
ENI (Elastic Network Interface) Architecture
- What is an ENI and how does it relate to instances?
- Why do flow logs capture at the ENI level, not instance level?
- How do you map ENI to instance/resource?
- Book Reference: AWS Documentation on ENIs

Questions to Guide Your Design

Before implementing, think through these:

Data Ingestion
- Where should flow logs be delivered—S3, CloudWatch Logs, or Kinesis?
- How do you handle the delay between event and log availability?
- How do you process backlog when starting up?
- What’s your strategy for handling millions of records per hour?
Data Storage
- What database is best for time-series flow data? (TimescaleDB, ClickHouse, etc.)
- How do you partition data for efficient queries?
- How long should you retain data?
- How do you handle storage costs at scale?
Enrichment
- How do you map ENI IDs to instance names?
- Should you do GeoIP lookups? Performance implications?
- How do you integrate threat intelligence feeds?
- How often should you refresh ENI mappings?
Detection Logic
- What thresholds define a “port scan”?
- How do you distinguish attack traffic from legitimate scanning?
- What’s “unusual” outbound traffic?
- How do you avoid alert fatigue?
Output & Alerting
- How should alerts be delivered—Slack, PagerDuty, email?
- What severity levels should you use?
- How do you make the dashboard useful for both security and operations?

Thinking Exercise

Before coding, analyze these flow log samples:

Sample 1: Normal Web Traffic

2 123456789012 eni-web123 203.0.113.50 10.0.1.100 52341 443 6 25 15000 1639489200 1639489260 ACCEPT OK
2 123456789012 eni-web123 10.0.1.100 203.0.113.50 443 52341 6 30 125000 1639489200 1639489260 ACCEPT OK

Questions:

Which direction is the client → server traffic?
How many bytes did the server send vs receive?
What service is being accessed?

Sample 2: Port Scan

123456789012 eni-web123 185.143.223.x 10.0.1.100 45123 22 6 1 40 1639489200 1639489201 REJECT OK
123456789012 eni-web123 185.143.223.x 10.0.1.100 45124 23 6 1 40 1639489201 1639489202 REJECT OK
123456789012 eni-web123 185.143.223.x 10.0.1.100 45125 80 6 1 40 1639489202 1639489203 REJECT OK
123456789012 eni-web123 185.143.223.x 10.0.1.100 45126 443 6 1 40 1639489203 1639489204 REJECT OK
123456789012 eni-web123 185.143.223.x 10.0.1.100 45127 3306 6 1 40 1639489204 1639489205 REJECT OK

Questions:

How do you know this is a port scan?
What’s the scan rate (ports per second)?
Why are all actions REJECT?
What NACL rule would block this?

Sample 3: Possible Data Exfiltration

2 123456789012 eni-app456 10.0.10.50 185.143.223.100 49152 443 6 50000 52428800 1639400000 1639486400 ACCEPT OK
2 123456789012 eni-app456 185.143.223.100 10.0.10.50 443 49152 6 1000 50000 1639400000 1639486400 ACCEPT OK

Questions:

How much data was sent outbound? (52428800 bytes = 50 MB)
Over what time period? (86400 seconds = 24 hours)
Why is the ratio suspicious? (50 MB out, 50 KB in)
What’s the destination? (External IP—needs investigation)

The Interview Questions They’ll Ask

Prepare to answer these:

“What are VPC Flow Logs and what do they capture?”
“What’s the difference between ACCEPT and REJECT in flow logs?”
“How would you detect a port scan using flow logs?”
“What are the limitations of VPC Flow Logs?” (no payload, no DNS queries to Amazon DNS)
“How would you set up flow logs for compliance/audit requirements?”
“What’s the performance impact of enabling flow logs?”
“How do you analyze flow logs at scale?”
“What’s the difference between flow logs sent to S3 vs CloudWatch Logs?”
“How would you use flow logs to troubleshoot a connectivity issue?”
“What security threats can you detect with flow logs?”

Hints in Layers

Hint 1: Parse flow log records efficiently

from dataclasses import dataclass
from typing import Optional
import ipaddress

@dataclass
class FlowRecord:
    version: int
    account_id: str
    interface_id: str
    srcaddr: str
    dstaddr: str
    srcport: int
    dstport: int
    protocol: int
    packets: int
    bytes: int
    start: int
    end: int
    action: str
    log_status: str

    @classmethod
    def from_line(cls, line: str) -> Optional['FlowRecord']:
        """Parse a single flow log line."""
        parts = line.strip().split()
        if len(parts) < 14 or parts[0] != '2':  # Version 2
            return None

        return cls(
            version=int(parts[0]),
            account_id=parts[1],
            interface_id=parts[2],
            srcaddr=parts[3],
            dstaddr=parts[4],
            srcport=int(parts[5]) if parts[5] != '-' else 0,
            dstport=int(parts[6]) if parts[6] != '-' else 0,
            protocol=int(parts[7]) if parts[7] != '-' else 0,
            packets=int(parts[8]) if parts[8] != '-' else 0,
            bytes=int(parts[9]) if parts[9] != '-' else 0,
            start=int(parts[10]) if parts[10] != '-' else 0,
            end=int(parts[11]) if parts[11] != '-' else 0,
            action=parts[12],
            log_status=parts[13]
        )

    def is_external_source(self) -> bool:
        """Check if source IP is external (not RFC1918)."""
        try:
            ip = ipaddress.ip_address(self.srcaddr)
            return not ip.is_private
        except ValueError:
            return False

    def protocol_name(self) -> str:
        """Convert protocol number to name."""
        protocols = {1: 'ICMP', 6: 'TCP', 17: 'UDP'}
        return protocols.get(self.protocol, str(self.protocol))

Hint 2: Build the ENI-to-Instance mapping

import boto3
from functools import lru_cache

class ENIMapper:
    def __init__(self):
        self.ec2 = boto3.client('ec2')
        self._cache = {}

    def refresh_cache(self):
        """Refresh ENI to instance mapping."""
        self._cache = {}

        # Get all ENIs
        paginator = self.ec2.get_paginator('describe_network_interfaces')
        for page in paginator.paginate():
            for eni in page['NetworkInterfaces']:
                eni_id = eni['NetworkInterfaceId']
                attachment = eni.get('Attachment', {})

                self._cache[eni_id] = {
                    'instance_id': attachment.get('InstanceId'),
                    'private_ip': eni.get('PrivateIpAddress'),
                    'vpc_id': eni.get('VpcId'),
                    'subnet_id': eni.get('SubnetId'),
                    'description': eni.get('Description', '')
                }

        # Enrich with instance names
        instance_ids = [v['instance_id'] for v in self._cache.values()
                       if v['instance_id']]
        if instance_ids:
            instances = self.ec2.describe_instances(InstanceIds=instance_ids)
            for reservation in instances['Reservations']:
                for instance in reservation['Instances']:
                    name = next(
                        (tag['Value'] for tag in instance.get('Tags', [])
                         if tag['Key'] == 'Name'),
                        instance['InstanceId']
                    )
                    # Update all ENIs for this instance
                    for eni_id, data in self._cache.items():
                        if data['instance_id'] == instance['InstanceId']:
                            data['instance_name'] = name

    def lookup(self, eni_id: str) -> dict:
        """Look up ENI details."""
        return self._cache.get(eni_id, {'unknown': True})

Hint 3: Implement port scan detection

from collections import defaultdict
from datetime import datetime, timedelta

class PortScanDetector:
    def __init__(self, window_seconds=60, port_threshold=10):
        self.window_seconds = window_seconds
        self.port_threshold = port_threshold
        # {source_ip: [(timestamp, port), ...]}
        self.activity = defaultdict(list)

    def check(self, record: FlowRecord) -> Optional[dict]:
        """Check if this record indicates a port scan."""
        if record.action != 'REJECT':
            return None  # Only care about rejected (probing) traffic

        now = datetime.utcnow()
        source = record.srcaddr

        # Add this activity
        self.activity[source].append((now, record.dstport))

        # Clean old activity
        cutoff = now - timedelta(seconds=self.window_seconds)
        self.activity[source] = [
            (ts, port) for ts, port in self.activity[source]
            if ts > cutoff
        ]

        # Check for port scan pattern
        recent_ports = set(port for _, port in self.activity[source])
        if len(recent_ports) >= self.port_threshold:
            return {
                'type': 'PORT_SCAN',
                'severity': 'HIGH',
                'source_ip': source,
                'ports_scanned': sorted(recent_ports),
                'window_seconds': self.window_seconds,
                'recommendation': f"Block {source} in NACL"
            }

        return None

Hint 4: Create the analytics dashboard

from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.live import Live

class Dashboard:
    def __init__(self, db):
        self.db = db
        self.console = Console()

    def render(self):
        """Render the dashboard."""
        # Traffic summary
        summary = self.db.query("""
            SELECT
                COUNT(*) as total_flows,
                SUM(CASE WHEN action = 'ACCEPT' THEN 1 ELSE 0 END) as accepted,
                SUM(CASE WHEN action = 'REJECT' THEN 1 ELSE 0 END) as rejected,
                SUM(bytes) as total_bytes
            FROM flows
            WHERE start_time > now() - interval '24 hours'
        """).fetchone()

        # Top talkers
        top_talkers = self.db.query("""
            SELECT
                srcaddr,
                SUM(bytes) as total_bytes,
                COUNT(DISTINCT dstaddr) as unique_destinations
            FROM flows
            WHERE action = 'ACCEPT'
              AND start_time > now() - interval '24 hours'
            GROUP BY srcaddr
            ORDER BY total_bytes DESC
            LIMIT 5
        """).fetchall()

        # Build display
        table = Table(title="Top Talkers (24h)")
        table.add_column("Source IP")
        table.add_column("Bytes", justify="right")
        table.add_column("Destinations", justify="right")

        for row in top_talkers:
            # Enrich with instance name
            eni_info = self.eni_mapper.lookup_by_ip(row['srcaddr'])
            name = eni_info.get('instance_name', row['srcaddr'])
            table.add_row(
                name,
                format_bytes(row['total_bytes']),
                str(row['unique_destinations'])
            )

        self.console.print(table)

Books That Will Help

Topic	Book	Chapter
Network flow analysis	“The Practice of Network Security Monitoring” by Bejtlich	Ch. 5: Flow Data, Ch. 8: Analysis
TCP/IP fundamentals	“TCP/IP Illustrated, Volume 1” by Stevens	Ch. 1-4: Protocol basics
Security monitoring	“Applied Network Security Monitoring” by Sanders	Ch. 5-7: Collection and Analysis
Time-series databases	“Designing Data-Intensive Applications” by Kleppmann	Ch. 3: Storage and Retrieval
Python data processing	“Data Engineering with Python” by Reis	Ch. 4-6: Pipelines
AWS networking	“AWS Certified Advanced Networking Study Guide”	VPC Flow Logs section

Common Pitfalls & Debugging

Problem 1: “Flow Logs aren’t appearing in S3/CloudWatch”

Why: Most common causes: (1) IAM role doesn’t have permissions, (2) Flow Logs not enabled for the right resource, (3) Filter is set to “REJECT only” and all traffic is allowed

Fix: Verify the setup:

# 1. Check Flow Logs are enabled
aws ec2 describe-flow-logs --filter "Name=resource-id,Values=vpc-xxx"

# 2. Check IAM role has permissions
aws iam get-role --role-name flowlogsRole
# Should have: logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents

# 3. Check filter setting
# traffic-type should be "ALL" for full visibility

Quick test: Generate traffic (ping, curl) and wait 10-15 minutes (Flow Logs have delay)

Problem 2: “S3 bucket has millions of Flow Log files, costs are high”

Why: Flow Logs create a new file every 5 minutes per ENI. With 100 ENIs, that’s 2,880,000 files/month. S3 LIST operations cost money.

Fix: Use Athena partitioning and lifecycle policies:

-- Create Athena table with partitions by date
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
  ...
) PARTITIONED BY (year string, month string, day string)
LOCATION 's3://my-bucket/AWSLogs/account-id/vpcflowlogs/'

# Add S3 lifecycle rule to delete old logs
aws s3api put-bucket-lifecycle-configuration \
  --bucket flow-logs-bucket \
  --lifecycle-configuration '{
    "Rules": [{
      "Status": "Enabled",
      "Expiration": {"Days": 90}
    }]
  }'

Quick test: Check S3 costs in Cost Explorer, should drop after lifecycle policy

Problem 3: “How do I know if traffic was blocked by Security Group vs NACL?”

Why: Flow Logs only show “REJECT” action, not which layer rejected it.

Fix: Use this logic:

# If you see REJECT in Flow Logs:
# 1. Check if there's a corresponding ACCEPT in reverse direction
#    (SG rejected = no entry, NACL rejected = REJECT logged)

# 2. Check NACL rules for explicit DENY
nacls = get_nacls_for_subnet(subnet_id)
for rule in nacls:
    if rule.action == "DENY" and matches(packet, rule):
        return "NACL blocked"

# 3. If no NACL DENY, it's Security Group implicit deny
return "Security Group blocked"

Quick test: Block port 22 in NACL, you’ll see REJECT in Flow Logs. Remove NACL rule, block in SG, you’ll see… nothing (no flow log entry for SG denies on initial packet)

Problem 4: “Flow Logs show ACCEPT but connection still fails”

Why: Flow Logs show packets reached the ENI, but application might not be listening, or return path is broken.

Fix: Check both directions:

# Inbound ACCEPT but outbound missing = asymmetric routing
# Filter Flow Logs for both:
grep "srcaddr=10.0.1.50 dstaddr=10.0.2.100 dstport=443" flow.log
grep "srcaddr=10.0.2.100 dstaddr=10.0.1.50 srcport=443" flow.log

# If only inbound shows ACCEPT, return path is broken

Quick test: Check route tables, ensure return traffic can reach source

Problem 5: “Athena query on Flow Logs is super slow and expensive”

Why: Full table scan across millions of files without partitions or compression.

Fix: Optimize the data:

-- 1. Use partitions (year/month/day)
-- 2. Convert to Parquet format (10x faster, 10x cheaper)
CREATE TABLE flow_logs_optimized
WITH (format='PARQUET', partitioned_by=ARRAY['year','month','day'])
AS SELECT * FROM vpc_flow_logs;

-- 3. Query with partition filters
SELECT * FROM flow_logs_optimized
WHERE year='2025' AND month='01' AND day='15'
  AND srcaddr='10.0.1.50';

Quick test: Compare query costs in Athena console (should be 90%+ reduction)

Problem 6: “Can’t distinguish between different types of REJECT (SG vs NACL vs routing)”

Why: Flow Logs action field only says “REJECT”, not why.

Fix: Cross-reference with configuration:

def analyze_reject(flow_log_entry):
    src_ip = flow_log_entry['srcaddr']
    dst_ip = flow_log_entry['dstaddr']
    port = flow_log_entry['dstport']

    # Check route table
    if not has_route(src_subnet, dst_subnet):
        return "REJECT: No route"

    # Check NACL
    if nacl_blocks(src_subnet, dst_subnet, port):
        return "REJECT: Network ACL"

    # Check Security Group
    if sg_blocks(dst_eni, port):
        return "REJECT: Security Group"

    return "REJECT: Unknown (possible OS firewall)"

Quick test: Create known NACL block, verify your analyzer identifies it correctly

Problem 7: “Too much data to process, queries take forever”

Why: Processing 10GB+ of raw Flow Logs daily.

Fix: Use sampling and aggregation:

# Enable Flow Logs with sampling (1 in 10 packets)
--traffic-type ALL --max-aggregation-interval 600 --sampling-rate 10

# Pre-aggregate in Lambda/Glue
hourly_summary = flow_logs.groupby(['srcaddr', 'dstaddr', 'dstport']).agg({
    'bytes': 'sum',
    'packets': 'count',
    'action': lambda x: 'REJECT' if 'REJECT' in x.values else 'ACCEPT'
})

Quick test: Compare processing time and costs (should be 10x improvement)

Problem 8: “Flow Logs show huge data transfer to unknown IP”

Why: This could be legitimate (CDN, AWS service endpoint) or data exfiltration.

Fix: Investigate systematically:

# 1. Identify the IP owner
whois <unknown-ip>

# 2. Check if it's an AWS service endpoint
nslookup <ip-address>
# If it resolves to *.amazonaws.com, it's AWS

# 3. Check the instance's IAM role and security posture
aws ec2 describe-instances --instance-ids i-xxx

# 4. If suspicious, isolate instance immediately
aws ec2 modify-instance-attribute --instance-id i-xxx \
  --groups sg-isolation-quarantine

Quick test: Simulate by uploading large file to S3, verify Flow Logs show S3 endpoint IPs

Project 4: VPC Peering vs Transit Gateway Lab

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK, CloudFormation
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Cloud Architecture / Networking
Software or Tool: AWS VPC Peering, Transit Gateway
Main Book: “AWS Certified Advanced Networking Study Guide”

What you’ll build: Two parallel network architectures—one using VPC Peering (full mesh) and one using Transit Gateway (hub-and-spoke)—with performance tests and cost analysis to understand when to use each.

Why it teaches AWS networking: The VPC Peering vs Transit Gateway decision is one of the most important architectural choices. This project lets you experience both, measure the differences, and make informed decisions.

Core challenges you’ll face:

Full mesh complexity (n VPCs = n(n-1)/2 peering connections) → maps to *scaling limits
Non-transitive routing (VPC A↔B↔C doesn’t mean A↔C) → maps to routing behavior
TGW route table design (attachment associations, propagations) → maps to hub routing
Latency measurement (extra hop in TGW) → maps to performance tradeoffs
Cost analysis (TGW hourly + data processing vs peering data transfer) → maps to FinOps

Key Concepts:

VPC Peering: AWS VPC Peering Guide
Transit Gateway: AWS Transit Gateway Documentation
Architecture Decision: VPC Peering vs Transit Gateway Comparison

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC fundamentals, Terraform

Real world outcome:

$ terraform apply -var="architecture=peering"
Creating VPC Peering architecture (5 VPCs, full mesh)...
VPC-A ↔ VPC-B: peering-001
VPC-A ↔ VPC-C: peering-002
VPC-A ↔ VPC-D: peering-003
VPC-A ↔ VPC-E: peering-004
VPC-B ↔ VPC-C: peering-005
... (10 peering connections total)
Route tables: 10 routes per VPC

$ terraform apply -var="architecture=tgw"
Creating Transit Gateway architecture (5 VPCs, hub-and-spoke)...
Transit Gateway: tgw-0abc123
VPC-A attachment: tgw-attach-001
VPC-B attachment: tgw-attach-002
... (5 attachments total)
Route tables: 1 route per VPC (0.0.0.0/0 → TGW)

$ ./network-benchmark

╔══════════════════════════════════════════════════════════════╗
║         VPC PEERING vs TRANSIT GATEWAY COMPARISON            ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  CONFIGURATION COMPLEXITY                                    ║
║  ──────────────────────────────────────────────────────────  ║
║  Peering (5 VPCs):     10 connections, 50 route entries      ║
║  TGW (5 VPCs):         1 TGW, 5 attachments, 5 route entries ║
║                                                              ║
║  LATENCY (VPC-A to VPC-E, avg of 1000 pings)                 ║
║  ──────────────────────────────────────────────────────────  ║
║  VPC Peering:          0.3ms (direct connection)             ║
║  Transit Gateway:      0.5ms (+0.2ms for TGW hop)            ║
║                                                              ║
║  BANDWIDTH (iperf3, same AZ)                                 ║
║  ──────────────────────────────────────────────────────────  ║
║  VPC Peering:          ~unlimited (AWS backbone)             ║
║  Transit Gateway:      50 Gbps per attachment (100 Gbps now) ║
║                                                              ║
║  MONTHLY COST (5 VPCs, 1 TB cross-VPC traffic)               ║
║  ──────────────────────────────────────────────────────────  ║
║  VPC Peering:          $10.00 (data transfer only)           ║
║  Transit Gateway:      $380.00 ($36/attachment + $0.02/GB)   ║
║                                                              ║
║  SCALING TO 50 VPCs                                          ║
║  ──────────────────────────────────────────────────────────  ║
║  Peering connections:  1,225 (50*49/2) - UNMANAGEABLE!       ║
║  TGW attachments:      50 - easy to manage                   ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝

RECOMMENDATION:
- < 10 VPCs, low traffic: VPC Peering (cost-effective)
- > 10 VPCs or hybrid connectivity: Transit Gateway (manageable)
- Latency-critical applications: VPC Peering (direct path)

Implementation Hints: VPC Peering full mesh calculation:

For n VPCs:
  Peering connections = n * (n-1) / 2
  Route entries per VPC = n - 1

  5 VPCs:  10 connections, 4 routes each
  10 VPCs: 45 connections, 9 routes each
  50 VPCs: 1,225 connections, 49 routes each (NIGHTMARE!)

Transit Gateway scales linearly:

For n VPCs:
  TGW attachments = n
  Route entries per VPC = 1 (point to TGW)

  5 VPCs:  5 attachments
  50 VPCs: 50 attachments
  500 VPCs: 500 attachments (TGW supports 5,000)

Key insight about transitivity:

VPC Peering: A ↔ B, B ↔ C does NOT mean A ↔ C
Transit Gateway: A → TGW → C works automatically

# This is why peering becomes unmanageable at scale
# Every VPC needs direct peering to every other VPC

Learning milestones:

Both architectures deploy successfully → You understand the components
Traffic flows in both designs → You understand routing
Measure latency difference → You understand the TGW hop
Calculate cost breakeven point → You can make business decisions

Common Pitfalls & Debugging

Problem 1: “VPC Peering connection stuck in ‘Pending Acceptance’ state”

Why: Peering requires manual acceptance from the peer VPC owner (even if same account, different region).

Fix: Accept the peering connection:

# List pending peering connections
aws ec2 describe-vpc-peering-connections \
  --filters "Name=status-code,Values=pending-acceptance"

# Accept from the peer VPC's region
aws ec2 accept-vpc-peering-connection \
  --vpc-peering-connection-id pcx-xxx \
  --region <peer-region>

Quick test: Check status changes to “active”

Problem 2: “Transit Gateway shows ‘available’ but VPCs can’t communicate”

Why: Missing route table entries. TGW attachments don’t automatically add routes to VPC route tables.

Fix: Add routes manually to each VPC’s route table:

# For each VPC, add route to other VPCs via TGW
aws ec2 create-route \
  --route-table-id rtb-vpc-a \
  --destination-cidr-block 10.1.0.0/16 \
  --transit-gateway-id tgw-xxx

aws ec2 create-route \
  --route-table-id rtb-vpc-b \
  --destination-cidr-block 10.0.0.0/16 \
  --transit-gateway-id tgw-xxx

Quick test: ping from instance in VPC-A to instance in VPC-B

Problem 3: “VPC Peering fails with ‘Overlapping CIDR blocks’ error”

Why: You can’t peer VPCs with overlapping IP ranges (e.g., both using 10.0.0.0/16).

Fix: Plan non-overlapping CIDR blocks from the start. If already deployed, you must recreate one VPC with a different range:

VPC-A: 10.0.0.0/16  ✅
VPC-B: 10.1.0.0/16  ✅
VPC-C: 172.16.0.0/16  ✅

VPC-A: 10.0.0.0/16  ❌
VPC-B: 10.0.0.0/16  ❌ (overlaps with VPC-A)

Quick test: No fix - this is by design. Prevention is key.

Problem 4: “Transit Gateway attachment costs are $36/month per VPC!”

Why: Each TGW attachment costs $0.05/hour ($36.50/month), plus data transfer costs ($0.02/GB).

Fix: Calculate the breakeven point:

Peering: $0/month (free), but limited scalability
Transit Gateway: $36.50/month per attachment + data transfer

Breakeven: ~4+ VPCs (where full mesh peering becomes complex)

Example:
- 3 VPCs: Peering is cheaper (3 connections)
- 10 VPCs: TGW is cheaper (45 peering connections vs 10 TGW attachments)

Quick test: Use AWS Cost Calculator to compare scenarios

Problem 5: “Peering between VPCs in different regions has high latency”

Why: Cross-region peering adds network distance. us-east-1 to eu-west-1 is ~80-100ms RTT.
Fix: This is expected behavior. Options:
1. Accept the latency if it’s within tolerance
2. Replicate data closer to users (multi-region architecture)
3. Use CloudFront for static assets
```
# Measure actual latency
ping <instance-in-remote-region>
# Expect 80-150ms for cross-continent
```
Quick test: Compare latency with same-region communication (should be <2ms)

Problem 6: “Transit Gateway route table has conflicting routes”

Why: TGW uses route tables just like VPCs. Longest prefix match wins, but misconfiguration can blackhole traffic.

Fix: Verify TGW route table:

# List TGW route tables
aws ec2 describe-transit-gateway-route-tables

# Check routes
aws ec2 search-transit-gateway-routes \
  --transit-gateway-route-table-id tgw-rtb-xxx \
  --filters "Name=state,Values=active"

# Each VPC CIDR should point to its attachment
# 10.0.0.0/16 → tgw-attach-vpc-a
# 10.1.0.0/16 → tgw-attach-vpc-b

Quick test: Use traceroute from VPC-A to VPC-B, verify it goes through TGW

Problem 7: “Terraform destroy fails for peering connection”

Why: Dependencies not properly defined. Must delete routes before deleting peering connection.

Fix: Use depends_on in Terraform:

resource "aws_vpc_peering_connection" "peer" {
  # ...
}

resource "aws_route" "to_peer" {
  # ...
  vpc_peering_connection_id = aws_vpc_peering_connection.peer.id

  # Explicit dependency
  depends_on = [aws_vpc_peering_connection.peer]
}

Or just use terraform destroy which handles order automatically.

Quick test: terraform destroy should complete without errors

Problem 8: “VPC peering works but can’t access VPC endpoints (S3, DynamoDB) via peering”

Why: VPC endpoints are local to a VPC. You can’t access VPC-A’s S3 endpoint from VPC-B via peering.

Fix: Create separate VPC endpoints in each VPC, or use public S3 endpoints:

# Each VPC needs its own endpoints
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-a \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-vpc-a

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-b \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-vpc-b

Quick test: From VPC-B instance, try aws s3 ls (should work with local endpoint)

Project 5: NAT Gateway Deep Dive & Cost Optimizer

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Python
Alternative Programming Languages: Go, Bash
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Cloud Networking / FinOps
Software or Tool: AWS NAT Gateway, VPC Endpoints
Main Book: “Cloud FinOps” by J.R. Storment

What you’ll build: A tool that analyzes NAT Gateway traffic, identifies cost-saving opportunities (like using VPC Endpoints for AWS services), and provides recommendations with projected savings.

Why it teaches AWS networking: NAT Gateway is often the #1 surprise cost in AWS bills. Understanding why traffic goes through NAT (vs. VPC Endpoints) teaches you about routing, endpoints, and cost optimization.

Core challenges you’ll face:

Analyzing NAT Gateway bytes (CloudWatch metrics, flow logs) → maps to traffic analysis
Identifying AWS service traffic (S3, DynamoDB, ECR, etc.) → maps to endpoint candidates
Calculating endpoint savings (NAT costs vs endpoint costs) → maps to FinOps
Route table impact (how endpoints change routing) → maps to routing precedence

Key Concepts:

NAT Gateway Pricing: AWS NAT Gateway Pricing
VPC Endpoints: Gateway Endpoints vs Interface Endpoints
Centralized Endpoints: Centralized VPC Endpoints Architecture

Difficulty: Intermediate Time estimate: 1 week Prerequisites: AWS billing understanding, CloudWatch

Real world outcome:

$ ./nat-optimizer analyze --vpc vpc-0abc123

╔════════════════════════════════════════════════════════════════╗
║           NAT GATEWAY COST ANALYSIS (Last 30 days)             ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║  NAT GATEWAY SUMMARY                                           ║
║  ─────────────────────────────────────────────────────────────║
║  NAT Gateway: nat-0abc123 (us-east-1a)                         ║
║  Data Processed: 15.2 TB                                       ║
║  Current Cost: $683.10 ($0.045/GB processing)                  ║
║  Hourly Cost: $32.40 (720 hours × $0.045)                      ║
║  TOTAL: $715.50/month                                          ║
║                                                                ║
║  TRAFFIC BREAKDOWN (by destination)                            ║
║  ─────────────────────────────────────────────────────────────║
║  1. S3 (52-region prefix lists):     8.5 TB (55.9%)            ║
║  2. ECR (Elastic Container Registry): 3.2 TB (21.1%)           ║
║  3. DynamoDB:                         1.8 TB (11.8%)           ║
║  4. Internet (non-AWS):               1.7 TB (11.2%)           ║
║                                                                ║
║  💰 SAVINGS OPPORTUNITIES                                      ║
║  ─────────────────────────────────────────────────────────────║
║                                                                ║
║  S3 Gateway Endpoint (FREE!)                                   ║
║    Current: 8.5 TB × $0.045 = $382.50                          ║
║    With Endpoint: $0.00 (gateway endpoints are free)           ║
║    SAVINGS: $382.50/month                                      ║
║                                                                ║
║  DynamoDB Gateway Endpoint (FREE!)                             ║
║    Current: 1.8 TB × $0.045 = $81.00                           ║
║    With Endpoint: $0.00                                        ║
║    SAVINGS: $81.00/month                                       ║
║                                                                ║
║  ECR Interface Endpoint                                        ║
║    Current: 3.2 TB × $0.045 = $144.00                          ║
║    Endpoint hourly: $7.30/month (2 AZs × $0.01/hr × 730)       ║
║    Endpoint data: 3.2 TB × $0.01 = $32.00                      ║
║    SAVINGS: $104.70/month                                      ║
║                                                                ║
║  ═══════════════════════════════════════════════════════════  ║
║  TOTAL POTENTIAL SAVINGS: $568.20/month (79.4% reduction!)     ║
║  REMAINING NAT COST: $147.30 (internet traffic only)           ║
║  ═══════════════════════════════════════════════════════════  ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

$ ./nat-optimizer deploy-endpoints --vpc vpc-0abc123 --dry-run
Would create:
  - S3 Gateway Endpoint (vpce-gw-s3)
  - DynamoDB Gateway Endpoint (vpce-gw-ddb)
  - ECR Interface Endpoint (vpce-if-ecr-api, vpce-if-ecr-dkr)

Route table changes:
  - rtb-private-a: Add S3 prefix list → vpce-gw-s3
  - rtb-private-b: Add S3 prefix list → vpce-gw-s3
  - rtb-private-a: Add DynamoDB prefix list → vpce-gw-ddb
  - rtb-private-b: Add DynamoDB prefix list → vpce-gw-ddb

Implementation Hints: The key insight: Gateway Endpoints are FREE (for S3 and DynamoDB), while Interface Endpoints cost ~$7.30/month per AZ but are much cheaper than NAT Gateway for high-traffic services.

# Pseudo-code for traffic analysis
def analyze_nat_traffic(vpc_id, days=30):
    # Get NAT Gateway CloudWatch metrics
    nat_gateways = get_nat_gateways(vpc_id)

    for nat in nat_gateways:
        # Get bytes processed
        metrics = cloudwatch.get_metric_statistics(
            Namespace='AWS/NATGateway',
            MetricName='BytesOutToDestination',
            Dimensions=[{'Name': 'NatGatewayId', 'Value': nat.id}],
            Period=86400,
            Statistics=['Sum']
        )

        # Analyze flow logs to determine destinations
        flows = query_flow_logs(f"""
            SELECT dstaddr, SUM(bytes) as total_bytes
            FROM flow_logs
            WHERE interface_id = '{nat.eni_id}'
            AND action = 'ACCEPT'
            GROUP BY dstaddr
            ORDER BY total_bytes DESC
        """)

        # Categorize by destination
        for flow in flows:
            if is_s3_ip(flow.dstaddr):
                traffic['s3'] += flow.total_bytes
            elif is_dynamodb_ip(flow.dstaddr):
                traffic['dynamodb'] += flow.total_bytes
            elif is_ecr_ip(flow.dstaddr):
                traffic['ecr'] += flow.total_bytes
            else:
                traffic['internet'] += flow.total_bytes

    return calculate_savings(traffic)

Learning milestones:

Identify traffic patterns through NAT → You understand outbound flow
Calculate endpoint savings correctly → You understand pricing
Deploy endpoints without breaking traffic → You understand routing
See cost reduction in AWS bill → You’ve made a real impact

Common Pitfalls & Debugging

Problem 1: “NAT Gateway costs $150/month but I only have 3 instances!”

Why: NAT Gateway charges for both time ($0.045/hour = $33/month) AND data transfer ($0.045/GB). If instances download Docker images, OS updates, or process large data, data charges dominate.

Fix: Analyze VPC Flow Logs to identify top talkers:

# Find instances with highest outbound traffic through NAT
aws ec2 describe-flow-logs | grep -A5 "NAT Gateway"

# Check which services can use VPC Endpoints instead
# S3, DynamoDB, ECR, SQS, SNS all have Gateway/Interface endpoints

Quick test: Deploy S3 Gateway endpoint, watch NAT data transfer drop

Problem 2: “VPC Endpoint for S3 created but instances still using NAT Gateway”

Why: Route table doesn’t have the endpoint route, or endpoint has wrong policy.

Fix: Verify endpoint configuration:

# Check endpoint is in correct route table
aws ec2 describe-route-tables --route-table-ids rtb-xxx
# Should see: pl-xxx (S3 prefix list) -> vpce-xxx

# Check endpoint policy allows traffic
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-xxx

Quick test: From instance: aws s3 ls --debug 2>&1 | grep endpoint (should show VPC endpoint URL)

Problem 3: “S3 Gateway Endpoint is free but costs went up!”

Why: Gateway endpoints for S3/DynamoDB are free, but Interface Endpoints for other services cost $0.01/hour ($7.30/month) per AZ plus data charges.
Fix: Understand the difference: ``` Gateway Endpoints (FREE):
- S3
- DynamoDB
Interface Endpoints ($7.30/month per AZ + data):
- EC2, ECS, ECR, CloudWatch, SQS, SNS, Kinesis, etc.
Cost calculation:
- 3 AZs x $7.30 = $21.90/month per service
- If you have 5 services, that’s $109.50/month
- Still cheaper than NAT Gateway if data transfer is high ```
Quick test: Compare NAT data costs vs interface endpoint costs in Cost Explorer

Problem 4: “NAT Gateway in single AZ - what if it fails?”

Why: NAT Gateway is highly available within an AZ, but AZ failure means total outage for instances in other AZs using it.

Fix: Deploy one NAT Gateway per AZ for HA:

resource "aws_nat_gateway" "per_az" {
  for_each = { "a" = "subnet-pub-a", "b" = "subnet-pub-b" }

  subnet_id     = each.value
  allocation_id = aws_eip.nat[each.key].id
}

# Each private subnet routes to NAT in same AZ
resource "aws_route" "private_nat" {
  for_each = { "a" = "rtb-priv-a", "b" = "rtb-priv-b" }

  route_table_id         = each.value
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.per_az[each.key].id
}

Cost: 2 NAT Gateways = $66/month + data (2x cost for HA)

Quick test: Simulate AZ failure, verify other AZ still has internet

Problem 5: “Switched to NAT Instance to save money but performance is terrible”

Why: NAT Instance (t3.micro) has limited network bandwidth (~5 Gbps max) and CPU. NAT Gateway scales to 100 Gbps automatically.

Fix: Right-size NAT Instance or use NAT Gateway for production:

NAT Instance pros: Cheap ($3-10/month)
NAT Instance cons: Single point of failure, limited bandwidth, requires patching

NAT Gateway pros: Highly available, auto-scaling, managed
NAT Gateway cons: Expensive ($33+/month per AZ)

Decision: Use NAT Instance for dev/test, NAT Gateway for production

Quick test: Run iperf3 through NAT Instance vs NAT Gateway, measure throughput

Problem 6: “VPC Endpoint policy blocks some S3 buckets”

Why: Default endpoint policy might deny access to certain buckets or principals.

Fix: Update endpoint policy:

{
  "Statement": [{
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": [
      "arn:aws:s3:::my-bucket/*",
      "arn:aws:s3:::my-bucket"
    ]
  }]
}

Or use full access policy (less secure):

{"Statement": [{"Effect": "Allow", "Principal": "*", "Action": "*", "Resource": "*"}]}

Quick test: aws s3 ls s3://blocked-bucket (should work after policy update)

Problem 7: “Interface Endpoint DNS not resolving”

Why: Private DNS must be enabled on the VPC endpoint for AWS service hostnames to resolve to private IPs.

Fix: Enable private DNS:

aws ec2 modify-vpc-endpoint \
  --vpc-endpoint-id vpce-xxx \
  --private-dns-enabled

# Verify DNS resolution
nslookup sqs.us-east-1.amazonaws.com
# Should return 10.0.x.x (private IP), not public AWS IP

Quick test: dig sqs.us-east-1.amazonaws.com from instance (should show VPC internal IP)

Problem 8: “Cost optimizer script shows $0 savings but NAT costs are high”

Why: Script might not account for services that don’t have VPC Endpoints (e.g., third-party APIs, software updates).

Fix: Analyze traffic comprehensively:

# Check destinations in VPC Flow Logs
def analyze_destinations(flow_logs):
    external_ips = {}
    for entry in flow_logs:
        if is_external_ip(entry['dstaddr']):
            external_ips[entry['dstaddr']] = external_ips.get(entry['dstaddr'], 0) + entry['bytes']

    # Identify top destinations
    top_destinations = sorted(external_ips.items(), key=lambda x: x[1], reverse=True)

    for ip, bytes in top_destinations[:10]:
        hostname = reverse_dns(ip)
        print(f"{ip} ({hostname}): {bytes / 1e9:.2f} GB")

        # Check if AWS service has endpoint
        if is_aws_service(hostname):
            print("  → Consider VPC Endpoint")

Quick test: Run analysis, identify non-AWS destinations that justify NAT Gateway cost

Project 6: Multi-Account Network with AWS Organizations

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK, CloudFormation StackSets
Coolness Level: Level 3: Genuinely Clever
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Enterprise Architecture / Networking
Software or Tool: AWS Organizations, RAM, Transit Gateway
Main Book: “AWS Certified Solutions Architect Professional Study Guide”

What you’ll build: A multi-account network architecture with a shared services VPC (containing NAT Gateways, VPC Endpoints), application VPCs in separate accounts, and Transit Gateway connecting everything—all shared via Resource Access Manager.

Why it teaches AWS networking: Enterprise AWS is multi-account. Understanding how to share networking resources across accounts, maintain isolation while enabling connectivity, and centralize common infrastructure is essential for cloud architects.

Core challenges you’ll face:

Resource Access Manager (sharing TGW, subnets across accounts) → maps to multi-account patterns
Centralized egress (shared NAT, single point of control) → maps to hub-and-spoke
Centralized VPC Endpoints (cost optimization, management) → maps to shared services
Route propagation (TGW route tables, blackhole routes) → maps to transit routing
Security boundaries (what CAN vs SHOULD communicate) → maps to network segmentation

Key Concepts:

Multi-VPC Architecture: AWS Multi-VPC Whitepaper
Resource Access Manager: AWS RAM Documentation
Centralized Egress: One to Many: Evolving VPC Design

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Multi-account AWS setup, Terraform, Transit Gateway basics

Real world outcome:

MULTI-ACCOUNT NETWORK ARCHITECTURE

┌──────────────────────────────────────────────────────────────────┐
│                     MANAGEMENT ACCOUNT                            │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  AWS Organizations                                          │  │
│  │  Resource Access Manager (shares TGW, Subnets)              │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│                     NETWORK ACCOUNT                               │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Shared Services VPC (10.0.0.0/16)                        │    │
│  │  ├── NAT Gateways (centralized egress)                    │    │
│  │  ├── VPC Endpoints (S3, DynamoDB, ECR, etc.)             │    │
│  │  └── Transit Gateway (hub)                                │    │
│  └──────────────────────────────────────────────────────────┘    │
│                              │                                    │
│              ┌───────────────┼───────────────┐                    │
│              ▼               ▼               ▼                    │
└──────────────────────────────────────────────────────────────────┘
               │               │               │
               ▼               ▼               ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│  DEV ACCOUNT    │ │  STAGING ACCOUNT│ │  PROD ACCOUNT   │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ App VPC     │ │ │ │ App VPC     │ │ │ │ App VPC     │ │
│ │ 10.1.0.0/16 │ │ │ │ 10.2.0.0/16 │ │ │ │ 10.3.0.0/16 │ │
│ │             │ │ │ │             │ │ │ │             │ │
│ │ TGW Attach  │ │ │ │ TGW Attach  │ │ │ │ TGW Attach  │ │
│ └─────────────┘ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘

$ terraform apply

Network Account:
  Transit Gateway: tgw-0net123 (shared via RAM)
  Shared Services VPC: vpc-0shared
  NAT Gateway: nat-0central
  VPC Endpoints: 8 endpoints (S3, DynamoDB, ECR, SSM, etc.)

Dev Account:
  TGW Attachment: tgw-attach-dev (accepted from RAM share)
  App VPC: vpc-0dev (10.1.0.0/16)
  Route: 0.0.0.0/0 → TGW → Shared Services → NAT → Internet

Prod Account:
  TGW Attachment: tgw-attach-prod
  App VPC: vpc-0prod (10.3.0.0/16)
  ISOLATED: Cannot reach Dev VPC (TGW route table segmentation)

# Test connectivity
$ aws-vault exec dev -- ssh app-server
[dev-app]$ curl https://s3.amazonaws.com/mybucket/test.txt
OK (via centralized VPC endpoint - no NAT!)

[dev-app]$ ping 10.3.0.50  # Prod server
Request timed out (blocked by TGW route table - no route to prod)

Implementation Hints: Key architecture decisions:

Centralized Egress: All outbound traffic from spoke VPCs routes through the shared services VPC. This gives you:
- Single NAT Gateway bill (not one per VPC)
- Central firewall/inspection point
- Unified logging
Route Table Segmentation: Use separate TGW route tables to control what can talk to what: ``` TGW Route Table: “shared-services”
- All VPC attachments associated
- Routes to all VPCs (hub access)

TGW Route Table: “dev”

Dev VPC attachment associated
Route to shared-services VPC: 10.0.0.0/16
Route to dev VPC: 10.1.0.0/16
NO route to prod VPC (isolation!)

TGW Route Table: “prod”

Prod VPC attachment associated
Route to shared-services VPC: 10.0.0.0/16
Route to prod VPC: 10.3.0.0/16
NO route to dev VPC ```

Centralized VPC Endpoints: ```hcl
In network account

resource “aws_vpc_endpoint” “s3” { vpc_id = aws_vpc.shared_services.id service_name = “com.amazonaws.${var.region}.s3” # This endpoint is reachable from spoke VPCs via TGW }

Spoke VPCs route S3 traffic: VPC → TGW → Shared VPC → Endpoint

**Learning milestones**:
1. **Resources share correctly via RAM** → You understand cross-account sharing
2. **Spoke VPCs reach internet via centralized NAT** → You understand transit routing
3. **Dev cannot reach Prod** → You understand TGW route table segmentation
4. **All AWS API calls use centralized endpoints** → You understand endpoint routing

---

### Common Pitfalls & Debugging

**Problem 1: "RAM resource share invitation not appearing in spoke accounts"**
- **Why**: Invitations only appear if accounts are NOT in the same AWS Organization with sharing enabled. With Organizations + Resource Access Manager enabled, sharing is automatic.
- **Fix**: Enable sharing within AWS Organizations:
  ```bash
  # In Management Account
  aws ram enable-sharing-with-aws-organization

  # Verify organization sharing is enabled
  aws organizations describe-organization

No invitation needed - resources appear automatically in spoke accounts.

Quick test: Check spoke account for shared TGW without accepting invitation

Problem 2: “Transit Gateway shared but spoke VPC can’t attach to it”

Why: IAM permissions missing in spoke account, or TGW not allowing attachments from spoke account.

Fix: Grant permissions in spoke account:

{
  "Effect": "Allow",
  "Action": [
    "ec2:CreateTransitGatewayVpcAttachment",
    "ec2:DescribeTransitGateways",
    "ec2:DescribeTransitGatewayAttachments"
  ],
  "Resource": "*"
}

And verify TGW allows cross-account attachments (default: allowed)

Quick test: Try creating attachment from spoke account CLI

Problem 3: “Spoke VPCs can communicate with each other but not reach internet via egress VPC”

Why: Missing routes in Transit Gateway route table or egress VPC route table.

Fix: Set up routing correctly:

# In TGW route table for spokes:
# Default route points to egress VPC attachment
aws ec2 create-transit-gateway-route \
  --transit-gateway-route-table-id tgw-rtb-spokes \
  --destination-cidr-block 0.0.0.0/0 \
  --transit-gateway-attachment-id tgw-attach-egress

# In egress VPC route table:
# Routes for spoke CIDRs point to TGW
aws ec2 create-route \
  --route-table-id rtb-egress-private \
  --destination-cidr-block 10.1.0.0/16 \
  --transit-gateway-id tgw-xxx

Quick test: From spoke instance, curl ifconfig.me (should return egress VPC’s NAT Gateway IP)

Problem 4: “Dev account can still reach Prod account despite route table separation”

Why: Route table associations are wrong, or both use the same TGW route table.

Fix: Verify isolation:

# Check which route table each attachment uses
aws ec2 describe-transit-gateway-attachments

# Dev attachment should use "dev-rtb" (no routes to prod)
# Prod attachment should use "prod-rtb" (no routes to dev)

# Associate attachments correctly
aws ec2 associate-transit-gateway-route-table \
  --transit-gateway-route-table-id tgw-rtb-dev \
  --transit-gateway-attachment-id tgw-attach-dev

Quick test: From dev instance, try to ping prod instance IP (should timeout)

Problem 5: “Centralized VPC Endpoints not accessible from spoke VPCs”

Why: PrivateLink endpoints are VPC-local by default. You need to share via Transit Gateway + Route 53 Resolver.

Fix: Two options:

Option A: Share DNS via Route 53 Resolver endpoints (recommended)

# Create inbound resolver endpoint in egress VPC
aws route53resolver create-resolver-endpoint \
  --name egress-inbound \
  --security-group-ids sg-xxx \
  --direction INBOUND \
  --ip-addresses SubnetId=subnet-egress-a

# In spoke VPCs, forward AWS service queries to egress resolver
# Configure Route 53 Resolver rules

Option B: Create endpoints in each VPC (simpler but more expensive)

Quick test: From spoke instance, nslookup sqs.us-east-1.amazonaws.com (should resolve to egress VPC endpoint IP)

Problem 6: “Cross-account Transit Gateway attachment stuck in ‘pendingAcceptance’“

Why: TGW owner (network account) must accept attachment requests from spoke accounts.

Fix: Accept in network account:

# List pending attachments (in network account)
aws ec2 describe-transit-gateway-vpc-attachments \
  --filters "Name=state,Values=pendingAcceptance"

# Accept the attachment
aws ec2 accept-transit-gateway-vpc-attachment \
  --transit-gateway-attachment-id tgw-attach-xxx

Quick test: Check attachment state changes to “available”

Problem 7: “Terraform can’t create resources in spoke accounts”

Why: Missing cross-account IAM assume role configuration.

Fix: Set up provider aliasing:

# In network account Terraform
provider "aws" {
  alias  = "spoke-dev"
  region = "us-east-1"

  assume_role {
    role_arn = "arn:aws:iam::DEV-ACCOUNT-ID:role/OrganizationAccountAccessRole"
  }
}

# Use provider for spoke resources
resource "aws_vpc" "spoke_dev" {
  provider = aws.spoke-dev
  cidr_block = "10.1.0.0/16"
}

Quick test: terraform plan should show resources in multiple accounts

Problem 8: “CloudWatch Logs centralization not working”

Why: Missing permissions for cross-account log delivery, or Kinesis Data Firehose not configured.

Fix: Set up log aggregation:

# In logging account, S3 bucket policy allows PutObject from spoke accounts
{
  "Effect": "Allow",
  "Principal": {
    "Service": "logs.amazonaws.com"
  },
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:::central-logs/*",
  "Condition": {
    "StringEquals": {
      "aws:SourceAccount": ["SPOKE-ACCOUNT-1", "SPOKE-ACCOUNT-2"]
    }
  }
}

Quick test: Generate log event in spoke account, verify it appears in central S3 bucket

Project 7: Site-to-Site VPN with Simulated On-Premises

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: CloudFormation, Manual (console + strongSwan)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Hybrid Networking / VPN
Software or Tool: AWS Site-to-Site VPN, strongSwan, Libreswan
Main Book: “Computer Networks, Fifth Edition” by Tanenbaum (Chapter on VPNs)

What you’ll build: A complete Site-to-Site VPN setup with an EC2 instance running strongSwan to simulate an on-premises data center, demonstrating IPsec tunnel establishment, BGP routing, and failover.

Why it teaches AWS networking: Hybrid connectivity is essential for enterprise AWS. By simulating the “on-premises” side with strongSwan, you’ll understand both ends of the VPN tunnel—something most engineers never see.

Core challenges you’ll face:

IPsec configuration (Phase 1, Phase 2, encryption algorithms) → maps to VPN protocols
BGP configuration (ASNs, route advertisement) → maps to dynamic routing
Tunnel redundancy (two tunnels, failover testing) → maps to high availability
NAT-Traversal (when CGW is behind NAT) → maps to real-world complexity
Troubleshooting tunnels (why isn’t it coming up?) → maps to practical debugging

Key Concepts:

Site-to-Site VPN: AWS VPN Documentation
IPsec Fundamentals: “Computer Networks, Fifth Edition” Chapter 8 - Tanenbaum
BGP Basics: AWS BGP Routing
strongSwan Configuration: strongSwan AWS Guide

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Basic networking, Linux administration

Real world outcome:

$ terraform apply

AWS Resources Created:
  VPN Gateway: vgw-0abc123 (attached to VPC)
  Customer Gateway: cgw-0def456 (simulated on-prem)
  VPN Connection: vpn-0ghi789 (2 tunnels)
    Tunnel 1: 169.254.10.1/30 (AWS) ↔ 169.254.10.2/30 (On-prem)
    Tunnel 2: 169.254.11.1/30 (AWS) ↔ 169.254.11.2/30 (On-prem)

On-Premises Simulation (EC2 in different VPC):
  strongSwan instance: i-onprem123
  Public IP: 54.x.x.x (Customer Gateway IP)
  Internal network: 192.168.0.0/16 (simulated corporate LAN)

# Check VPN status
$ aws ec2 describe-vpn-connections --vpn-connection-ids vpn-0ghi789 \
    --query 'VpnConnections[0].VgwTelemetry'

[
  {
    "OutsideIpAddress": "52.1.2.3",
    "Status": "UP",
    "StatusMessage": "2 BGP ROUTES",
    "AcceptedRouteCount": 2
  },
  {
    "OutsideIpAddress": "52.4.5.6",
    "Status": "UP",
    "StatusMessage": "2 BGP ROUTES",
    "AcceptedRouteCount": 2
  }
]

# Test connectivity
$ ssh on-prem-server
[on-prem]$ ping 10.0.1.50  # AWS private IP
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=1 ttl=64 time=15.2 ms
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=2 ttl=64 time=14.8 ms

[on-prem]$ traceroute 10.0.1.50
 1  192.168.1.1 (gateway)        0.5 ms
 2  169.254.10.1 (AWS tunnel)    15.1 ms  # Through IPsec tunnel!
 3  10.0.1.50 (AWS instance)     15.3 ms

# Failover test
$ ssh on-prem-server
[on-prem]$ sudo ipsec down aws-tunnel-1
[on-prem]$ ping 10.0.1.50
# Brief interruption, then traffic flows through tunnel-2
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=5 ttl=64 time=16.1 ms

Implementation Hints: The strongSwan configuration on the simulated on-prem server:

# /etc/ipsec.conf (simplified)
conn aws-tunnel-1
    auto=start
    type=tunnel
    authby=secret
    left=%defaultroute
    leftid=54.x.x.x              # On-prem public IP
    leftsubnet=192.168.0.0/16    # On-prem network
    right=52.1.2.3               # AWS VPN endpoint 1
    rightsubnet=10.0.0.0/16      # AWS VPC CIDR
    ike=aes256-sha256-modp2048
    esp=aes256-sha256
    keyexchange=ikev2
    ikelifetime=8h
    lifetime=1h
    dpdaction=restart
    dpddelay=10s

conn aws-tunnel-2
    # Similar config for second tunnel
    right=52.4.5.6               # AWS VPN endpoint 2

BGP configuration for dynamic routing:

# Using quagga/FRR for BGP
router bgp 65001
  neighbor 169.254.10.1 remote-as 64512  # AWS ASN
  neighbor 169.254.11.1 remote-as 64512
  network 192.168.0.0/16                  # Advertise on-prem network

Key troubleshooting steps:

# Check IPsec status
sudo ipsec statusall

# Check BGP sessions
sudo vtysh -c "show ip bgp summary"

# Check routes learned from AWS
ip route show | grep 10.0

# Debug IPsec
sudo ipsec stroke loglevel ike 3
sudo tail -f /var/log/syslog | grep charon

Learning milestones:

Both tunnels establish → You understand IPsec negotiation
BGP routes are exchanged → You understand dynamic routing
Traffic flows through VPN → You can verify connectivity
Failover works when one tunnel dies → You understand redundancy

Common Pitfalls & Debugging

Problem 1: “VPN tunnel status shows ‘DOWN’ in AWS console”

Why: Phase 1 or Phase 2 IPsec negotiation failed. Mismatched encryption settings, wrong pre-shared key, or firewall blocking UDP 500/4500.

Fix: Debug systematically:

# On strongSwan side, check logs
sudo journalctl -u strongswan -n 100

# Common errors:
# "no matching proposal found" = encryption mismatch
# "authentication failed" = wrong pre-shared key
# "INVALID_ID_INFORMATION" = identity mismatch

# Verify AWS VPN configuration matches strongSwan
aws ec2 describe-vpn-connections --vpn-connection-ids vpn-xxx

Quick test: sudo ipsec status (should show “ESTABLISHED”)

Problem 2: “VPN tunnel UP but no traffic flows”

Why: Security Groups, NACLs, or on-prem firewall blocking traffic. Or routing not configured correctly.

Fix: Trace the path:

# From on-prem instance
ping 10.0.1.50  # AWS instance private IP
traceroute 10.0.1.50

# Check Security Group on AWS side allows ICMP from on-prem CIDR
aws ec2 describe-security-groups --group-ids sg-xxx

# Check route propagation is enabled
aws ec2 describe-route-tables --route-table-ids rtb-xxx
# Should see VPN routes automatically propagated

Quick test: Enable ICMP in SG, try ping again

Problem 3: “BGP not establishing, stuck in ‘Idle’ or ‘Active’ state”

Why: TCP port 179 blocked, wrong BGP ASN configuration, or IP addressing mismatch.

Fix: Verify BGP configuration:

# On strongSwan side (if using BIRD or FRR for BGP)
sudo birdc show protocols
# Should show "Established" state

# Check BGP configuration matches AWS
# AWS provides inside tunnel IPs: 169.254.x.x/30
# Your router must use the correct IP from that range

# Test TCP 179 connectivity
telnet 169.254.10.1 179  # AWS BGP endpoint

Quick test: sudo birdc show route (should show AWS routes)

Problem 4: “VPN works but latency is terrible (200ms+)”

Why: Traffic going through wrong tunnel, or using far-away AWS VPN endpoint.

Fix: Optimize routing:

# AWS VPN endpoints are region-specific
# Create VPN connection in region closest to your location

# For US East Coast on-prem: use us-east-1
# For EU on-prem: use eu-west-1
# For Asia: use ap-southeast-1

# Check latency to both tunnels
ping 169.254.10.1  # Tunnel 1
ping 169.254.10.5  # Tunnel 2
# Use tunnel with lowest latency as primary via BGP weight

Quick test: Measure latency before and after VPN (should add ~10-30ms for same region)

Problem 5: “VPN tunnel keeps disconnecting every few hours”

Why: Dead Peer Detection (DPD) timeout misconfiguration, or NAT gateway in path causing state table expiration.

Fix: Adjust DPD and keepalives:

# In strongSwan ipsec.conf
dpddelay=10s
dpdtimeout=30s
dpdaction=restart

# Ensure keepalives are enabled
# IPsec uses IKE keepalives to maintain tunnel

Quick test: Leave tunnel up overnight, check in morning (should still be UP)

Problem 6: “Can’t simulate on-premises environment, don’t have second AWS account”

Why: You think you need expensive hardware or second AWS account.

Fix: Use a VPC in same account:

# Create "on-prem" VPC: 192.168.0.0/16
# Create "AWS" VPC: 10.0.0.0/16
# Deploy strongSwan EC2 in "on-prem" VPC
# Create VPN connection between them

# This simulates hybrid connectivity cheaply
# You'll learn the same concepts

Quick test: Costs <$5/month (t3.micro for strongSwan + VPN connection)

Problem 7: “VPN configuration download doesn’t work with my router”

Why: AWS provides configs for Cisco, Juniper, Palo Alto, but not all vendors.

Fix: Use generic strongSwan/Libreswan config:

# Download generic configuration
aws ec2 download-vpn-configuration \
  --vpn-connection-id vpn-xxx \
  --output text > vpn-config.xml

# Extract pre-shared keys and tunnel IPs
grep -A5 "ike" vpn-config.xml
grep -A5 "ipsec" vpn-config.xml

# Manually configure your device

Quick test: Compare working config with your setup, find differences

Problem 8: “VPN tunnel fails during high traffic”

Why: MTU/MSS issues causing fragmentation. VPN adds IPsec overhead (50-60 bytes).

Fix: Adjust MTU and TCP MSS clamping:

# On strongSwan instance
sudo ip link set dev eth0 mtu 1400  # Reduce from 1500

# Enable TCP MSS clamping
sudo iptables -t mangle -A FORWARD \
  -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --set-mss 1360

# Test with large packets
ping -M do -s 1400 10.0.1.50  # Should work
ping -M do -s 1500 10.0.1.50  # Might fail

Quick test: Transfer large file through VPN, verify no packet loss

Project 8: AWS Network Firewall Deployment

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK, CloudFormation
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Network Security
Software or Tool: AWS Network Firewall
Main Book: “Network Security Monitoring” by Richard Bejtlich

What you’ll build: A network inspection architecture using AWS Network Firewall with domain filtering, IDS/IPS rules, and logging—positioned to inspect all egress traffic.

Why it teaches AWS networking: AWS Network Firewall provides VPC-perimeter inspection that Security Groups and NACLs can’t do. Understanding where to place it, how to route traffic through it, and how to write rules teaches deep network security.

Core challenges you’ll face:

Firewall subnet placement (dedicated subnets, routing) → maps to inspection architecture
Routing to the firewall (ingress routing, gateway route tables) → maps to traffic flow design
Rule group design (stateful vs stateless, domain filtering) → maps to firewall policy
Logging and alerting (what to log, where to send) → maps to security monitoring
Understanding Suricata rules (IDS/IPS syntax) → maps to threat detection

Key Concepts:

AWS Network Firewall: AWS Network Firewall Documentation
Suricata Rules: Suricata Rule Writing Guide
Inspection Architecture: Network Firewall Deployment Models

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC routing, basic firewall concepts

Real world outcome:

$ terraform apply

Resources Created:
  Network Firewall: nfw-0abc123
  Firewall Policy: policy-egress-inspection
  Rule Groups:
    - Stateless: Allow/Drop by IP/Port
    - Stateful: Domain filtering, IDS/IPS

Architecture:
  ┌─────────────────────────────────────────────────────────┐
  │                        VPC                               │
  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
  │  │ Private     │    │ Firewall    │    │ Public      │  │
  │  │ Subnet      │───▶│ Subnet      │───▶│ Subnet      │  │
  │  │ (App)       │    │ (NFW)       │    │ (NAT)       │  │
  │  └─────────────┘    └─────────────┘    └─────────────┘  │
  │       10.0.1.0/24      10.0.100.0/24      10.0.0.0/24   │
  │                             │                     │      │
  │                       ┌─────┴─────┐         ┌────┴────┐ │
  │                       │  Network  │         │   NAT   │ │
  │                       │  Firewall │         │ Gateway │ │
  │                       └───────────┘         └─────────┘ │
  └─────────────────────────────────────────────────────────┘

# Test domain filtering
$ ssh app-server
[app]$ curl https://example.com
HTTP/1.1 200 OK  # Allowed

[app]$ curl https://malware-c2-server.evil
curl: (Connection refused)  # BLOCKED by Network Firewall!

# Check firewall logs
$ aws logs filter-log-events --log-group-name /aws/network-firewall/alert

{
  "event_type": "alert",
  "alert": {
    "action": "blocked",
    "signature": "ET MALWARE Known C2 Domain",
    "src_ip": "10.0.1.50",
    "dest_ip": "185.143.x.x",
    "dest_port": 443
  }
}

# Firewall metrics
$ ./nfw-monitor

╔════════════════════════════════════════════════════════════╗
║         NETWORK FIREWALL METRICS (Last Hour)               ║
╠════════════════════════════════════════════════════════════╣
║                                                            ║
║  Traffic Processed: 2.3 GB                                 ║
║  Packets Inspected: 4,567,890                              ║
║                                                            ║
║  Actions:                                                  ║
║    ✓ Passed: 4,560,123 (99.8%)                             ║
║    ✗ Dropped: 7,767 (0.2%)                                 ║
║                                                            ║
║  Top Blocked:                                              ║
║    1. Malware C2 domains: 234 connections                  ║
║    2. Crypto mining pools: 156 connections                 ║
║    3. Known bad IPs: 89 connections                        ║
║                                                            ║
║  IDS/IPS Alerts: 47                                        ║
║    - SQL injection attempts: 12                            ║
║    - SSH brute force: 35                                   ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

Implementation Hints: The critical routing for egress inspection:

# Traffic flow: App → NFW → NAT → Internet
# Return flow: Internet → NAT → NFW → App

# Private subnet route table (where apps live)
resource "aws_route" "private_to_firewall" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  vpc_endpoint_id        = aws_networkfirewall_firewall.main.firewall_status[0].sync_states[0].attachment[0].endpoint_id
}

# Firewall subnet route table
resource "aws_route" "firewall_to_nat" {
  route_table_id         = aws_route_table.firewall.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.main.id
}

# NAT Gateway subnet needs INGRESS routing back through firewall
resource "aws_route" "nat_to_firewall" {
  route_table_id         = aws_route_table.public.id
  destination_cidr_block = "10.0.1.0/24"  # Private subnet
  vpc_endpoint_id        = aws_networkfirewall_firewall.main.firewall_status[0].sync_states[0].attachment[0].endpoint_id
}

Example firewall rules:

# Domain filtering (block malware C2)
resource "aws_networkfirewall_rule_group" "domain_block" {
  capacity = 100
  name     = "block-malicious-domains"
  type     = "STATEFUL"

  rule_group {
    rule_variables {
      ip_sets {
        key = "HOME_NET"
        ip_set { definition = ["10.0.0.0/16"] }
      }
    }
    rules_source {
      rules_source_list {
        generated_rules_type = "DENYLIST"
        target_types         = ["TLS_SNI", "HTTP_HOST"]
        targets              = [".evil.com", ".malware.net", ".c2server.xyz"]
      }
    }
  }
}

# IDS/IPS with Suricata rules
resource "aws_networkfirewall_rule_group" "ids" {
  capacity = 500
  name     = "ids-rules"
  type     = "STATEFUL"

  rule_group {
    rules_source {
      rules_string = <<EOF
alert tcp any any -> any 22 (msg:"SSH brute force attempt"; flow:to_server; threshold:type both,track by_src,count 5,seconds 60; sid:1000001; rev:1;)
drop tcp any any -> any any (msg:"SQL injection"; content:"UNION SELECT"; nocase; sid:1000002; rev:1;)
EOF
    }
  }
}

Learning milestones:

Firewall inspects all egress traffic → You understand routing through NFW
Domain blocking works → You understand stateful rules
IDS alerts fire correctly → You understand Suricata
Return traffic routes correctly → You understand symmetric routing

Common Pitfalls & Debugging

Problem 1: “Network Firewall deployed but traffic still reaches blocked domains”

Why: Route tables not updated to send traffic through firewall endpoints, or firewall rule order is wrong.

Fix: Verify traffic flow:

# Check route table sends traffic to firewall endpoint
aws ec2 describe-route-tables --route-table-ids rtb-xxx
# Should see: 0.0.0.0/0 → vpce-xxx (firewall endpoint)

# Check firewall stateful rules (order matters!)
aws network-firewall describe-rule-group --rule-group-arn arn:xxx
# Stateful rules are evaluated in order
# Make sure BLOCK rules come before broader ALLOW rules

Quick test: Try accessing blocked domain from instance (should be dropped)

Problem 2: “Network Firewall costs $300/month! Is this normal?”

Why: Firewall charges $0.395/hour ($290/month) plus $0.065/GB processed. For high-traffic workloads, this adds up fast.
Fix: Understand the cost model: ``` Network Firewall costs:
- $0.395/hour = $290/month (base)
- $0.065/GB processed
Example monthly cost for 100GB/day:
- Base: $290
- Data: 3000 GB × $0.065 = $195
- Total: $485/month
Alternatives for lower budgets:
- Security Groups + NACLs (free, limited features)
- Third-party firewall on EC2 (Palo Alto, Fortinet)
- NAT Gateway + DNS filtering (CloudFlare Gateway) ```
Quick test: Use AWS Cost Calculator to estimate your usage

Problem 3: “Firewall rules not matching traffic as expected”

Why: Stateful vs stateless rule confusion, or Suricata syntax errors.
Fix: Understand rule evaluation: ``` Stateless rules (evaluated FIRST):
- Custom actions: Pass, Drop, Forward to stateful
- Match on: 5-tuple (src/dst IP/port, protocol)
- Fast but limited
Stateful rules (evaluated SECOND):
- Domain filtering, intrusion detection
- Uses Suricata syntax
- More powerful but slower
Common Suricata rule: drop http any any -> any any (msg:”Block malware domain”; http.host; content:”malicious.com”; sid:1;) ```
Quick test: Enable flow logs, verify packets hit expected rule

Problem 4: “Asymmetric routing breaking Network Firewall”

Why: Network Firewall requires symmetric routing (same firewall sees both request and response).

Fix: Design network correctly:

# Inspection VPC with firewall in each AZ
# Each AZ's traffic stays in same AZ

AZ-A:
  Public Subnet → Firewall Endpoint (AZ-A) → Protected Subnet (AZ-A)

AZ-B:
  Public Subnet → Firewall Endpoint (AZ-B) → Protected Subnet (AZ-B)

# Use Gateway Load Balancer if you need centralized firewall

Quick test: Check VPC Flow Logs for firewall endpoint IDs in both directions

Problem 5: “Domain-based filtering not working for HTTPS traffic”

Why: TLS encryption hides the domain name from the firewall. Network Firewall can only see encrypted payload.

Fix: Use SNI inspection (Server Name Indication):

# Suricata rule for HTTPS SNI filtering
drop tls any any -> any any (tls.sni; content:"blocked-domain.com"; sid:1;)

# SNI is unencrypted during TLS handshake
# Firewall can see it even though payload is encrypted

# However, TLS 1.3 with Encrypted SNI (ESNI) defeats this
# In that case, block by destination IP ranges instead

Quick test: curl https://example.com through firewall, check logs for SNI match

Problem 6: “IDS alerts flooding CloudWatch Logs with false positives”

Why: Default IDS rules are too sensitive for your environment.

Fix: Tune IDS rules:

# Disable noisy rules
# Use managed rule groups but override specific rules

# In Terraform
resource "aws_networkfirewall_rule_group" "example" {
  rule_group {
    rules_source {
      stateful_rule {
        action = "ALERT"  # Don't drop, just alert
        header {
          destination = "any"
          source      = "any"
          protocol    = "TCP"
        }
        rule_option {
          keyword = "sid"
          settings = ["1"]
        }
      }
    }
  }
}

Quick test: Monitor CloudWatch Logs, adjust rules until alerts are actionable

Problem 7: “Firewall deployment breaks existing VPC connectivity”

Why: Route tables changed without planning, traffic blackholed.

Fix: Implement firewall gradually:

# Phase 1: Deploy firewall but don't route traffic through it yet
# Test firewall rules with synthetic traffic

# Phase 2: Route one subnet's traffic through firewall
# Monitor for issues

# Phase 3: Gradually migrate all subnets
# Have rollback plan ready (restore old route table)

# Rollback command:
aws ec2 replace-route \
  --route-table-id rtb-xxx \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id igw-xxx  # Back to IGW

Quick test: Keep old route table, swap atomically to minimize downtime

Problem 8: “TLS inspection causing certificate errors”

Why: Network Firewall doesn’t support TLS inspection (decrypt-and-inspect). Use AWS Certificate Manager if you need this.
Fix: Understand limitations: ``` Network Firewall CAN:
- Inspect SNI (server name) in TLS handshake
- Block based on destination IP/port
- Apply IDS rules to unencrypted traffic
Network Firewall CANNOT:
- Decrypt HTTPS payload
- Inspect encrypted application data
- Act as TLS termination proxy
For TLS inspection, use:
- Application Load Balancer (for HTTP/S)
- Third-party firewall (Palo Alto, Fortinet) ```
Quick test: Don’t expect DPI (Deep Packet Inspection) on encrypted traffic

Project 9: Global Accelerator & CloudFront Edge Networking

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK, CloudFormation
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: CDN / Edge Networking
Software or Tool: AWS Global Accelerator, CloudFront
Main Book: “High Performance Browser Networking” by Ilya Grigorik

What you’ll build: A global application with CloudFront for static content, Global Accelerator for dynamic content, and latency-based routing—with performance benchmarks showing the improvement.

Why it teaches AWS networking: Understanding how traffic enters AWS at the edge, traverses the AWS backbone, and reaches your origin teaches you about Anycast, PoPs, and why edge networking matters for performance.

Core challenges you’ll face:

CloudFront distribution setup (origins, behaviors, caching) → maps to CDN architecture
Global Accelerator configuration (endpoints, health checks) → maps to Anycast networking
Understanding when to use which (static vs dynamic, TCP vs HTTP) → maps to architectural decisions
Measuring performance improvement (latency from different regions) → maps to performance testing

Key Concepts:

CloudFront: AWS CloudFront Developer Guide
Global Accelerator: AWS Global Accelerator Documentation
Edge Networking: “High Performance Browser Networking” Chapter 1 - Ilya Grigorik

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic DNS, HTTP understanding

Real world outcome:

$ terraform apply

CloudFront Distribution: d123abc.cloudfront.net
  Origin: alb-us-east-1.example.com
  Behaviors:
    /static/* → S3 origin (cached 1 year)
    /api/*    → ALB origin (no cache)
    /*        → ALB origin (cache 5 min)

Global Accelerator: a1b2c3d4.awsglobalaccelerator.com
  Listener: TCP 443
  Endpoint Groups:
    - us-east-1: alb-east (weight 100)
    - eu-west-1: alb-west (weight 100)
  Health Checks: /health every 10s

# Performance comparison from different locations
$ ./edge-benchmark

╔════════════════════════════════════════════════════════════════╗
║              EDGE NETWORKING PERFORMANCE COMPARISON            ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║  TEST: HTTPS request to API endpoint                           ║
║  Origin: us-east-1 (N. Virginia)                               ║
║                                                                ║
║  FROM NEW YORK (same region as origin)                         ║
║  ─────────────────────────────────────────────────────────────║
║  Direct to ALB:       45ms                                     ║
║  Via CloudFront:      42ms (-7%)                               ║
║  Via Global Accel:    38ms (-16%)                              ║
║                                                                ║
║  FROM LONDON (eu-west-1)                                       ║
║  ─────────────────────────────────────────────────────────────║
║  Direct to ALB:       125ms (transatlantic internet)           ║
║  Via CloudFront:      48ms (-62%) ← enters AWS at London PoP  ║
║  Via Global Accel:    44ms (-65%) ← uses AWS backbone         ║
║                                                                ║
║  FROM TOKYO (ap-northeast-1)                                   ║
║  ─────────────────────────────────────────────────────────────║
║  Direct to ALB:       210ms (Pacific + US internet)            ║
║  Via CloudFront:      65ms (-69%) ← Tokyo PoP + AWS backbone  ║
║  Via Global Accel:    58ms (-72%) ← Anycast to nearest PoP    ║
║                                                                ║
║  STATIC CONTENT (S3 via CloudFront)                            ║
║  ─────────────────────────────────────────────────────────────║
║  First request:       85ms (cache miss, fetch from S3)         ║
║  Cached requests:     12ms (served from edge)                  ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

WHY THE IMPROVEMENT?
1. Traffic enters AWS at nearest edge location (400+ PoPs)
2. Traverses AWS private backbone (not public internet)
3. CloudFront caches responses at edge
4. Global Accelerator uses Anycast for optimal routing
5. TCP optimization (connection reuse, fast retransmits)

Implementation Hints: Key architectural understanding:

WITHOUT EDGE SERVICES:
  User (Tokyo) → Internet → Origin (us-east-1)
  - Multiple ISP hops, congested peering points
  - TCP cold start on every connection
  - ~200ms latency

WITH CLOUDFRONT:
  User (Tokyo) → Tokyo PoP (5ms) → AWS Backbone → Origin
  - Enters AWS at nearest Point of Presence
  - Uses optimized AWS backbone
  - Connection pooling to origin
  - Caching at edge
  - ~65ms latency

WITH GLOBAL ACCELERATOR:
  User (Tokyo) → Anycast → Nearest PoP → AWS Backbone → Origin
  - Anycast IP routes to geographically closest PoP
  - Static IPs (good for whitelisting)
  - TCP/UDP optimization
  - Health-checked failover
  - ~58ms latency

When to use which:

CLOUDFRONT:
  - HTTP/HTTPS traffic
  - Cacheable content
  - Complex routing rules (path-based, header-based)
  - Lambda@Edge for edge compute

GLOBAL ACCELERATOR:
  - TCP/UDP traffic (not just HTTP)
  - Non-cacheable dynamic content
  - Need static Anycast IPs
  - Multi-region active-active with health checks
  - Gaming, VoIP, financial applications

Learning milestones:

CloudFront serves static content from edge → You understand CDN caching
API latency improves for distant users → You understand backbone routing
Global Accelerator failover works → You understand health-checked routing
You can explain when to use each → You can make architectural decisions

Common Pitfalls & Debugging

Problem 1: “CloudFront distribution created but still serves stale content”

Why: CloudFront caches content at edge locations. Default TTL is 24 hours.

Fix: Invalidate cache or adjust TTL:

# Create invalidation for specific paths
aws cloudfront create-invalidation \
  --distribution-id E123456 \
  --paths "/index.html" "/static/*"

# Or adjust Cache-Control headers in origin
Cache-Control: max-age=3600  # 1 hour TTL

# For dynamic content, disable caching
Cache-Control: no-cache, no-store, must-revalidate

Quick test: Check response headers for X-Cache: Hit from cloudfront vs Miss

Problem 2: “Global Accelerator shows no performance improvement”

Why: Your users are already close to your AWS region, or you’re testing from same location as origin.

Fix: Test from distant locations:

# Use VPN or cloud VM in different continent
# e.g., if origin is us-east-1, test from Singapore

# Without Global Accelerator:
curl -w "@curl-format.txt" https://api.example.com
# Time: 250ms (public internet routing)

# With Global Accelerator:
curl -w "@curl-format.txt" https://accelerator-dns.awsglobalaccelerator.com
# Time: 120ms (AWS backbone from nearest edge)

Quick test: Use online tools like pingdom.com to test from multiple regions

Problem 3: “CloudFront returns 502 Bad Gateway errors”

Why: Origin is unreachable, Security Group blocks CloudFront, or origin returned invalid response.

Fix: Debug systematically:

# Check origin health
curl -I https://origin.example.com

# Allow CloudFront IPs in Security Group
# Use AWS managed prefix list for CloudFront
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,PrefixListIds=[{PrefixListId=pl-xxx}]

# Check CloudFront logs for error details
aws s3 sync s3://cloudfront-logs-bucket/ ./logs/
grep "502" ./logs/*.gz

Quick test: Access origin directly, verify it works

Problem 4: “Global Accelerator costs $0.025/hour per accelerator but I have 20 services!”

Why: Each accelerator costs $18/month. With 20, that’s $360/month base cost plus data transfer.
Fix: Consolidate or use ALB: ``` Option 1: Use one accelerator with multiple endpoints
- Add all services as endpoints
- Use listener rules to route traffic
- Cost: $18/month + data transfer
Option 2: Use Application Load Balancer instead
- ALB can route to multiple targets
- No Global Accelerator needed for many use cases
- Cost: $16/month + LCU charges (usually cheaper)
When to use Global Accelerator:
- Need Anycast static IPs
- Non-HTTP protocols (TCP/UDP)
- Instant regional failover
- Traffic needs AWS backbone ```
Quick test: Compare costs in AWS Calculator

Problem 5: “CloudFront caching API responses incorrectly”

Why: Default behavior caches based on URL only. Query strings and headers not considered.

Fix: Configure cache behavior:

# Forward query strings
"ForwardedValues": {
  "QueryString": true,
  "Headers": ["Authorization", "Accept-Encoding"]
}

# Or use cache policies (recommended)
"CachePolicyId": "4135ea2d-6df8-44a3-9df3-4b5a84be39ad",  // CachingDisabled
"OriginRequestPolicyId": "216adef6-5c7f-47e4-b989-5492eafa07d3"  // AllViewer

Quick test: Make API call with different query params, verify cache key

Problem 6: “Global Accelerator health checks failing despite origin being healthy”

Why: Health check configuration mismatch, or Security Group blocks health check source IPs.

Fix: Verify health check config:

# Check health check settings
aws globalaccelerator describe-endpoint-group \
  --endpoint-group-arn arn:xxx

# Health check comes from AWS edge locations
# Allow from all AWS IP ranges or use broad CIDR
# Better: Use ALB as endpoint (ALB has own health checks)

# Verify endpoint responds to health check path
curl https://origin.example.com/health

Quick test: Check endpoint health status in Global Accelerator console

Problem 7: “CloudFront not compressing content”

Why: Compression not enabled in CloudFront behavior, or origin sends pre-compressed content.

Fix: Enable compression:

# In CloudFront behavior
"Compress": true,

# Ensure origin doesn't set Content-Encoding
# CloudFront will compress text/html, text/css, application/json, etc.

# Check if compression works
curl -H "Accept-Encoding: gzip" https://distribution.cloudfront.net/file.js -I
# Should see: Content-Encoding: gzip

Quick test: Compare Content-Length with and without compression (should be ~70% smaller)

Problem 8: “Users in China can’t access Global Accelerator endpoints”

Why: China requires ICP license for internet content. AWS Global Accelerator doesn’t work in China without special setup.
Fix: Use alternatives: ``` Options for China:
1. Deploy in AWS China regions (requires ICP license)
2. Use CloudFront with AWS China partner
3. Use local CDN provider (Alibaba Cloud, Tencent Cloud)
4. Route China traffic differently (geo-based routing)
For Global Accelerator:
- Only works with “Standard” tier
- Cannot route to China-based origins ```
Quick test: Use VPN to China location, test accessibility

Project 10: VPC Lattice for Service-to-Service Networking

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Service Mesh / Application Networking
Software or Tool: AWS VPC Lattice
Main Book: “Building Microservices, 2nd Edition” by Sam Newman

What you’ll build: A service mesh using VPC Lattice that enables service-to-service communication across VPCs and accounts with built-in authentication, authorization, and observability—without managing sidecars or proxies.

Why it teaches AWS networking: VPC Lattice represents AWS’s vision for application-layer networking. It abstracts away traditional network complexity (VPC peering, Transit Gateway for service communication) and provides L7 features like path-based routing and IAM authentication.

Core challenges you’ll face:

Service network creation (the namespace for services) → maps to service mesh concepts
Service association (connecting Lambda, ECS, EC2 to the mesh) → maps to compute abstraction
Auth policies (IAM-based service-to-service auth) → maps to zero-trust
Cross-account service sharing → maps to multi-account patterns
Traffic management (weighted routing, path-based) → maps to L7 load balancing

Key Concepts:

VPC Lattice: AWS VPC Lattice Documentation
Service Mesh Concepts: “Building Microservices, 2nd Edition” Chapter 4 - Sam Newman
Zero-Trust Networking: NIST Zero Trust Architecture (SP 800-207)

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC fundamentals, IAM, microservices basics

Real world outcome:

$ terraform apply

VPC Lattice Service Network: my-service-network
  Associated VPCs: [vpc-frontend, vpc-backend, vpc-data]

Services:
  - frontend-service (Lambda function)
    URL: https://frontend-svc-abc123.vpc-lattice-svcs.us-east-1.on.aws
  - orders-service (ECS Fargate)
    URL: https://orders-svc-def456.vpc-lattice-svcs.us-east-1.on.aws
  - inventory-service (EC2 instances)
    URL: https://inventory-svc-ghi789.vpc-lattice-svcs.us-east-1.on.aws

Auth Policies:
  - frontend-service → can call → orders-service
  - orders-service → can call → inventory-service
  - frontend-service → CANNOT call → inventory-service (least privilege!)

# Test service-to-service call
$ ssh frontend-host
[frontend]$ curl -H "Authorization: $IAM_SIG" \
    https://orders-svc-def456.vpc-lattice-svcs.us-east-1.on.aws/orders

{"orders": [...]}  # Allowed!

[frontend]$ curl -H "Authorization: $IAM_SIG" \
    https://inventory-svc-ghi789.vpc-lattice-svcs.us-east-1.on.aws/inventory

403 Forbidden: frontend-service is not authorized to access inventory-service

# Traffic split for canary deployment
$ terraform apply -var="orders_v2_weight=10"

Traffic routing for orders-service:
  v1 (current): 90%
  v2 (canary):  10%

# Observe metrics
$ aws cloudwatch get-metric-statistics \
    --namespace AWS/VPCLattice \
    --metric-name RequestCount

orders-service metrics:
  Total requests: 10,000
  Target v1: 9,012 (90.1%)
  Target v2: 988 (9.9%)
  4xx errors v1: 45 (0.5%)
  4xx errors v2: 12 (1.2%)

Implementation Hints: VPC Lattice architecture:

                    ┌─────────────────────────────┐
                    │     Service Network         │
                    │   "my-service-network"      │
                    └─────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          │                   │                   │
    ┌─────▼─────┐      ┌─────▼─────┐      ┌─────▼─────┐
    │ frontend  │      │ orders    │      │ inventory │
    │ service   │─────▶│ service   │─────▶│ service   │
    │ (Lambda)  │      │ (ECS)     │      │ (EC2)     │
    └───────────┘      └───────────┘      └───────────┘
         │                  │                   │
    ┌────▼────┐       ┌────▼────┐        ┌────▼────┐
    │  VPC A  │       │  VPC B  │        │  VPC C  │
    └─────────┘       └─────────┘        └─────────┘

Key insight: NO VPC peering or Transit Gateway needed!
VPC Lattice provides overlay networking between services.

IAM-based service auth:

resource "aws_vpclattice_auth_policy" "orders_policy" {
  resource_identifier = aws_vpclattice_service.orders.arn

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = "*"
        Action = "vpc-lattice-svcs:Invoke"
        Resource = "*"
        Condition = {
          StringEquals = {
            "vpc-lattice-svcs:SourceVpcOwnerAccount" = var.frontend_account_id
          }
          "ForAnyValue:StringLike" = {
            "aws:PrincipalServiceName" = "lambda.amazonaws.com"
          }
        }
      }
    ]
  })
}

Why VPC Lattice matters:

BEFORE (with Transit Gateway + ALBs):
- Create TGW, attach VPCs
- Create ALB in each VPC
- Manage Security Groups between VPCs
- Build custom auth (mTLS, JWT, etc.)
- Set up monitoring per service
- Total: 50+ resources, complex routing

AFTER (with VPC Lattice):
- Create service network
- Register services
- Define auth policies (IAM-native!)
- Built-in metrics and access logs
- Total: ~10 resources, declarative

Learning milestones:

Services communicate across VPCs → You understand Lattice overlay
IAM policies restrict access → You understand zero-trust auth
Traffic splitting works → You understand L7 capabilities
Can monitor service-to-service calls → You understand observability

Common Pitfalls & Debugging

Problem 1: “VPC Lattice service network created but services can’t connect”

Why: Service association missing, or IAM auth policy blocking access.

Fix: Verify associations and policies:

# Check service is associated with service network
aws vpc-lattice list-services --service-network-identifier sn-xxx

# Check auth policy allows traffic
aws vpc-lattice get-auth-policy --resource-identifier sn-xxx

# Default policy might deny all
# Update to allow specific principals
aws vpc-lattice put-auth-policy \
  --resource-identifier sn-xxx \
  --policy '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":"*","Action":"*","Resource":"*"}]}'

Quick test: Use test client to call service, check for 403 Forbidden vs connection timeout

Problem 2: “VPC Lattice costs are confusing - what am I actually paying for?”

Why: VPC Lattice has multiple cost components.
Fix: Understand the pricing model: ``` VPC Lattice costs:
- Service Network: $0.025/hour ($18/month) per service network
- Data processed: $0.025/GB
- No charge for: VPC associations, services, target groups
Example:
- 1 service network
- 3 services
- 100GB/month data processed
Cost:
- Service network: $18/month
- Data: 100GB × $0.025 = $2.50
- Total: $20.50/month
Compare to ALB: $16/month + $0.008/LCU (usually cheaper for high traffic) ```
Quick test: Use AWS Cost Calculator

Problem 3: “IAM-based auth not working, getting 403 errors”

Why: SigV4 signing missing or IAM policy misconfigured.

Fix: Implement SigV4 signing:

# Client must sign requests with AWS credentials
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest

credentials = boto3.Session().get_credentials()
request = AWSRequest(method='GET', url='https://service-xxxxx.vpc-lattice.amazonaws.com/api')
SigV4Auth(credentials, 'vpc-lattice-svcs', 'us-east-1').add_auth(request)

# Verify auth policy allows principal
{
  "Effect": "Allow",
  "Principal": {"AWS": "arn:aws:iam::ACCOUNT:role/MyRole"},
  "Action": "vpc-lattice-svcs:Invoke",
  "Resource": "*"
}

Quick test: Try unauthenticated request (should get 403), authenticated (should work)

Problem 4: “Target group health checks failing”

Why: Health check configuration doesn’t match application, or Security Group blocks health checks.

Fix: Debug health checks:

# Check health check configuration
aws vpc-lattice get-target-group --target-group-identifier tg-xxx

# Health checks come from 169.254.171.x/24
# Allow in Security Group
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --ip-permissions IpProtocol=tcp,FromPort=80,ToPort=80,CidrIp=169.254.171.0/24

# Test health check endpoint manually
curl http://target-ip/health

Quick test: Check target health status in VPC Lattice console

Problem 5: “Cross-VPC communication not working”

Why: VPC association missing or route tables not configured.

Fix: Associate VPCs with service network:

# Associate consumer VPC with service network
aws vpc-lattice create-service-network-vpc-association \
  --service-network-identifier sn-xxx \
  --vpc-identifier vpc-consumer

# VPC Lattice auto-creates routes to 169.254.171.0/24
# Verify route table has managed route
aws ec2 describe-route-tables --vpc-id vpc-consumer
# Should see: 169.254.171.0/24 -> VPC Lattice

Quick test: From EC2 in consumer VPC, ping service DNS name

Problem 6: “DNS resolution failing for VPC Lattice services”

Why: Service custom domain not configured, or Route 53 Resolver not set up.

Fix: Set up DNS correctly:

# VPC Lattice provides DNS: service-xxxxx.vpc-lattice.amazonaws.com
# For custom domain:

# 1. Create CNAME in Route 53
api.example.com CNAME service-xxxxx.vpc-lattice.amazonaws.com

# 2. Or use alias record
api.example.com ALIAS service-xxxxx.vpc-lattice.amazonaws.com

# 3. Verify DNS resolution from VPC
nslookup api.example.com

Quick test: dig service-xxx.vpc-lattice.amazonaws.com from client VPC

Problem 7: “Traffic splitting not distributing evenly”

Why: Weighted target groups configured incorrectly, or sticky sessions enabled.

Fix: Verify target group weights:

# Check listener rules and target group weights
aws vpc-lattice get-listener --listener-identifier listener-xxx

# For 80/20 split:
# Target Group A: weight 80
# Target Group B: weight 20

# Sticky sessions affect distribution
# Disable if you want pure weighted round-robin

Quick test: Make 100 requests, count responses from each target (should match weights)

Problem 8: “VPC Lattice vs PrivateLink - when to use which?”

Why: Both solve cross-VPC communication but differently.
Fix: Understand the differences: ``` VPC Lattice:
- L7 routing (HTTP/HTTPS/gRPC)
- IAM-based auth built-in
- Traffic splitting, weighted routing
- Service discovery and mesh-like features
- Use for: Microservices, service mesh
PrivateLink:
- L4 routing (TCP/UDP)
- Works with NLB
- Cross-account SaaS delivery
- Consumer creates Interface Endpoint
- Use for: Exposing services to customers, third-party integrations
Decision tree:
- Internal microservices? → VPC Lattice
- Exposing SaaS to external accounts? → PrivateLink
- Need L7 features (path routing, headers)? → VPC Lattice
- Non-HTTP protocols? → PrivateLink ```
Quick test: Match your use case to decision tree

Project 11: PrivateLink Service Provider

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: AWS CDK, CloudFormation
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Private Networking / SaaS Architecture
Software or Tool: AWS PrivateLink, Network Load Balancer
Main Book: “AWS Certified Advanced Networking Study Guide”

What you’ll build: A PrivateLink service that exposes your application to other AWS accounts privately—the same way AWS exposes services like S3, DynamoDB. Customers connect via Interface Endpoints without any public internet exposure.

Why it teaches AWS networking: PrivateLink is how you build “AWS-native” SaaS. Understanding how to become a service provider (not just consumer) teaches you about NLBs, endpoint services, cross-account networking, and how AWS builds its own services.

Core challenges you’ll face:

NLB as PrivateLink backend (why NLB, not ALB?) → maps to PrivateLink architecture
Cross-account access (allowlisting consumer accounts) → maps to multi-tenant SaaS
DNS for endpoints (private hosted zones, endpoint DNS) → maps to private DNS
Connection acceptance (auto vs manual) → maps to security controls
High availability (multi-AZ endpoints) → maps to reliability

Key Concepts:

PrivateLink Services: AWS PrivateLink Documentation
Endpoint Services: Creating Endpoint Services
NLB Requirements: PrivateLink NLB Requirements

Difficulty: Expert Time estimate: 2 weeks Prerequisites: NLB, VPC Endpoints, cross-account IAM

Real world outcome:

$ terraform apply -target=module.provider

PROVIDER ACCOUNT (your SaaS):
  VPC: vpc-provider
  NLB: nlb-myservice (internal)
    Listeners: 443 → target-group (your app)
  Endpoint Service: vpce-svc-abc123
    Service Name: com.amazonaws.vpce.us-east-1.vpce-svc-abc123
    Allowed Principals: [arn:aws:iam::CUSTOMER_ACCOUNT:root]

$ terraform apply -target=module.consumer

CONSUMER ACCOUNT (customer):
  VPC: vpc-consumer
  Interface Endpoint: vpce-def456
    Service: com.amazonaws.vpce.us-east-1.vpce-svc-abc123
    Subnets: [subnet-a, subnet-b]
    DNS: myservice.vpce.local → 10.1.1.100, 10.1.2.100

# From customer's VPC
$ ssh customer-server
[customer]$ nslookup myservice.vpce.local
  10.1.1.100  (ENI in subnet-a)
  10.1.2.100  (ENI in subnet-b)

[customer]$ curl https://myservice.vpce.local/api/health
{"status": "healthy", "provider": "MyService SaaS"}

# Traffic never touches internet!
[customer]$ traceroute myservice.vpce.local
  1  10.1.1.100 (vpce ENI)  0.5ms  # Direct to PrivateLink ENI
  # No internet hops - traffic stays in AWS backbone

# Provider sees customer connection
$ aws ec2 describe-vpc-endpoint-connections

{
  "VpcEndpointConnections": [{
    "VpcEndpointId": "vpce-def456",
    "VpcEndpointOwner": "CUSTOMER_ACCOUNT",
    "VpcEndpointState": "available",
    "CreationTimestamp": "2024-01-15T10:30:00Z"
  }]
}

Implementation Hints: PrivateLink architecture (provider side):

PROVIDER VPC
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐ │
│  │ Your App    │◀───│ Target      │◀───│ Network Load    │ │
│  │ (ECS/EC2)   │    │ Group       │    │ Balancer        │ │
│  └─────────────┘    └─────────────┘    └────────┬────────┘ │
│        10.0.1.x                                  │          │
│                                                  │          │
│                              ┌───────────────────▼────────┐ │
│                              │ VPC Endpoint Service       │ │
│                              │ vpce-svc-abc123            │ │
│                              │                            │ │
│                              │ Allowed: CUSTOMER_ACCOUNT  │ │
│                              └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                                        │
                                        │ AWS PrivateLink
                                        │ (private connection)
                                        │
CONSUMER VPC                            ▼
┌─────────────────────────────────────────────────────────────┐
│  ┌─────────────────┐    ┌────────────────────────────────┐  │
│  │ Interface       │───▶│ Customer App                   │  │
│  │ Endpoint        │    │ curl myservice.vpce.local      │  │
│  │ vpce-def456     │    └────────────────────────────────┘  │
│  │ 10.1.1.100      │                                        │
│  └─────────────────┘                                        │
└─────────────────────────────────────────────────────────────┘

Key Terraform resources:

# Provider side
resource "aws_vpc_endpoint_service" "myservice" {
  acceptance_required        = false  # or true for manual approval
  network_load_balancer_arns = [aws_lb.internal.arn]

  allowed_principals = [
    "arn:aws:iam::CUSTOMER_ACCOUNT:root"
  ]
}

# Consumer side
resource "aws_vpc_endpoint" "myservice" {
  vpc_id              = aws_vpc.consumer.id
  service_name        = "com.amazonaws.vpce.us-east-1.vpce-svc-abc123"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.consumer_a.id, aws_subnet.consumer_b.id]
  security_group_ids  = [aws_security_group.endpoint.id]

  private_dns_enabled = false  # Use custom DNS instead
}

# Custom DNS in consumer VPC
resource "aws_route53_zone" "private" {
  name = "vpce.local"
  vpc {
    vpc_id = aws_vpc.consumer.id
  }
}

resource "aws_route53_record" "myservice" {
  zone_id = aws_route53_zone.private.zone_id
  name    = "myservice.vpce.local"
  type    = "A"

  alias {
    name                   = aws_vpc_endpoint.myservice.dns_entry[0].dns_name
    zone_id                = aws_vpc_endpoint.myservice.dns_entry[0].hosted_zone_id
    evaluate_target_health = true
  }
}

Why NLB (not ALB)?

PrivateLink requires NLB because:
1. Layer 4 (TCP/UDP) - preserves client IP
2. Static IPs per AZ (stable endpoints)
3. Extremely high performance
4. Works with any TCP protocol (not just HTTP)

If you need HTTP features (path routing, headers):
  NLB → ALB → App (NLB for PrivateLink, ALB for L7)

Learning milestones:

Endpoint service is created → You understand the provider side
Consumer connects via Interface Endpoint → You understand cross-account
Traffic is private (no internet route) → You verify PrivateLink privacy
Multiple consumers can connect → You understand multi-tenant SaaS

Common Pitfalls & Debugging

Problem 1: “Created Endpoint Service but consumer can’t find it”

Why: Service name is long and auto-generated. Consumer needs exact service name, and you may need to share it manually.

Fix: Get and share service name:

# Get service name
aws ec2 describe-vpc-endpoint-service-configurations \
  --service-ids vpce-svc-xxx

# Service name format:
# com.amazonaws.vpce.REGION.vpce-svc-XXXXX

# Share this exact name with consumer
# They'll use it to create Interface Endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-consumer \
  --service-name com.amazonaws.vpce.us-east-1.vpce-svc-xxxxx \
  --vpc-endpoint-type Interface

Quick test: Consumer should see “pendingAcceptance” status

Problem 2: “Endpoint connection stuck in ‘pending’ state”

Why: Provider must accept connection requests (unless auto-acceptance enabled).

Fix: Accept connection on provider side:

# List pending connections
aws ec2 describe-vpc-endpoint-connections \
  --filters "Name=service-id,Values=vpce-svc-xxx"

# Accept specific connection
aws ec2 accept-vpc-endpoint-connections \
  --service-id vpce-svc-xxx \
  --vpc-endpoint-ids vpce-yyy

# Or enable auto-acceptance
aws ec2 modify-vpc-endpoint-service-configuration \
  --service-id vpce-svc-xxx \
  --acceptance-required false

Quick test: Connection status changes to “available”

Problem 3: “Why must I use NLB? I want to use ALB!”

Why: PrivateLink only supports NLB as backend. ALB is not compatible with Endpoint Services.
Fix: Understand the architecture: ``` PrivateLink requires NLB because:
1. Preserves source IP (ALB doesn’t)
2. Works at Layer 4 (any TCP protocol)
3. Supports PrivateLink DNS integration
4. Handles connection acceptance
If you need ALB features: Consumer → Interface Endpoint → NLB → ALB → App

This gives you:
- PrivateLink (NLB)
- Path routing, SSL termination, WAF (ALB) ```
Quick test: Deploy NLB → ALB architecture

Problem 4: “Consumer’s Security Group blocks traffic to endpoint”

Why: Interface Endpoints create ENIs in consumer’s VPC. Security Group must allow traffic to these ENIs.

Fix: Configure Security Groups correctly:

# Consumer side: Allow traffic from app to endpoint ENIs
aws ec2 authorize-security-group-ingress \
  --group-id sg-consumer-app \
  --protocol tcp \
  --port 443 \
  --source-group sg-endpoint

# Provider side: NLB Security Group must allow from anywhere
# (NLB sees consumer's VPC IPs, not actual source)

Quick test: Check VPC Flow Logs for REJECT actions

Problem 5: “DNS resolution not working for PrivateLink endpoint”

Why: Private DNS names require enableDnsSupport and enableDnsHostnames on consumer VPC, or Private DNS not enabled on endpoint.

Fix: Enable DNS settings:

# On consumer VPC
aws ec2 modify-vpc-attribute \
  --vpc-id vpc-consumer \
  --enable-dns-support
aws ec2 modify-vpc-attribute \
  --vpc-id vpc-consumer \
  --enable-dns-hostnames

# On Interface Endpoint
aws ec2 modify-vpc-endpoint \
  --vpc-endpoint-id vpce-yyy \
  --private-dns-enabled

# Verify DNS resolution
nslookup myservice.example.com  # Should return 10.x.x.x (private IP)

Quick test: dig endpoint DNS name from consumer VPC

Problem 6: “PrivateLink costs $7.30/month per AZ!”

Why: Interface Endpoints cost $0.01/hour ($7.30/month) per AZ, plus data transfer.
Fix: Understand and optimize costs: ``` Cost model:
- Endpoint: $7.30/month per AZ
- Data transfer: $0.01/GB
- NLB: $0.0225/hour ($16.50/month) + LCU charges
Example with 3 AZs:
- Endpoints: 3 × $7.30 = $21.90/month
- NLB: $16.50/month
- Data: 100GB × $0.01 = $1
- Total: ~$40/month baseline
Optimization:
- Use 2 AZs instead of 3 (if acceptable)
- Share endpoints across multiple services (one endpoint, multiple target groups) ```
Quick test: Model costs in AWS Calculator

Problem 7: “Can consumers in other regions connect to my PrivateLink service?”

Why: PrivateLink is regional. Consumers must be in same region as Endpoint Service.
Fix: Deploy multi-region: ``` For global reach:
1. Deploy Endpoint Service in each region
2. Use same service architecture
3. Consumers connect to endpoint in their region
4. Use Route 53 for cross-region DNS if needed
Alternative:
- Use VPC Peering + PrivateLink (complex)
- Use Inter-Region VPC Peering to reach endpoint ```
Quick test: Try creating endpoint in different region (will fail to find service)

Problem 8: “How do I charge customers for PrivateLink usage?”

Why: You’re building SaaS, need to meter usage through PrivateLink.

Fix: Implement metering:

# Use NLB access logs to track data transfer per endpoint
# S3 bucket: nlb-logs/AWSLogs/account/region/year/month/day/

import boto3
s3 = boto3.client('s3')

# Parse NLB logs to get:
# - Client IP (consumer's endpoint ENI)
# - Bytes sent/received
# - Request count

# Map endpoint ENI to customer account
# (stored when they connect via acceptVpcEndpointConnections)

# Bill monthly based on data transfer

Quick test: Enable NLB access logs, verify data captured

Project 12: Complete Multi-Region Network Architecture (Capstone)

File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
Main Programming Language: Terraform
Alternative Programming Languages: Pulumi, AWS CDK
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Enterprise Architecture / Global Networking
Software or Tool: All AWS Networking Services
Main Book: “AWS Certified Solutions Architect Professional Study Guide”

What you’ll build: A production-grade, multi-region, multi-account network with Transit Gateway peering, centralized egress, Network Firewall, hybrid connectivity, and complete observability—the kind of network that runs Fortune 500 companies.

Why it teaches AWS networking: This is the synthesis of everything. You’ll face real architectural decisions: where to place firewalls, how to handle cross-region traffic, when to use which connectivity option, and how to make it all observable and maintainable.

Core challenges you’ll face:

Multi-region Transit Gateway peering → maps to global networking
Centralized egress per region → maps to inspection architecture
Cross-region data transfer optimization → maps to cost optimization
Failover and disaster recovery → maps to reliability
Unified monitoring and logging → maps to observability at scale

Key Concepts:

AWS Multi-Region Architecture: AWS Global Infrastructure
Transit Gateway Inter-Region Peering: TGW Peering Documentation
Well-Architected Framework - Reliability Pillar: AWS Well-Architected

Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects

Real world outcome:

╔══════════════════════════════════════════════════════════════════════════╗
║                    ENTERPRISE MULTI-REGION NETWORK                        ║
╠══════════════════════════════════════════════════════════════════════════╣

                           US-EAST-1                    EU-WEST-1
                    ┌─────────────────────┐      ┌─────────────────────┐
                    │                     │      │                     │
   ON-PREMISES      │  ┌───────────────┐  │      │  ┌───────────────┐  │
   ┌──────────┐     │  │ Transit       │  │      │  │ Transit       │  │
   │ Data     │─VPN─┼──│ Gateway       │◀─┼─PEER─┼─▶│ Gateway       │  │
   │ Center   │     │  │ (us-east-1)   │  │      │  │ (eu-west-1)   │  │
   └──────────┘     │  └───────┬───────┘  │      │  └───────┬───────┘  │
                    │          │          │      │          │          │
                    │  ┌───────┴───────┐  │      │  ┌───────┴───────┐  │
                    │  │               │  │      │  │               │  │
                    │  ▼               ▼  │      │  ▼               ▼  │
              ┌─────────────┐   ┌─────────────┐ ┌─────────────┐  ┌─────────────┐
              │ Shared Svcs │   │ Production  │ │ Shared Svcs │  │ Production  │
              │ VPC         │   │ VPC         │ │ VPC         │  │ VPC         │
              │ ┌─────────┐ │   │             │ │ ┌─────────┐ │  │             │
              │ │ NFW     │ │   │ ┌─────────┐ │ │ │ NFW     │ │  │ ┌─────────┐ │
              │ │ NAT     │ │   │ │ App     │ │ │ │ NAT     │ │  │ │ App     │ │
              │ │ Endpoint│ │   │ │ Cluster │ │ │ │ Endpoint│ │  │ │ Cluster │ │
              │ └─────────┘ │   │ └─────────┘ │ │ └─────────┘ │  │ └─────────┘ │
              └─────────────┘   └─────────────┘ └─────────────┘  └─────────────┘
                    │                     │      │                     │
                    └─────────────────────┘      └─────────────────────┘

ACCOUNTS:
  - Network Account: Transit Gateways, Shared Services VPCs, Firewalls
  - Security Account: GuardDuty, Security Hub, Flow Logs aggregation
  - Production Account: Application workloads
  - Development Account: Dev/test workloads (isolated)

TRAFFIC FLOWS:
  ┌────────────────────────────────────────────────────────────────────────┐
  │ App (Prod US) → TGW → Shared Services VPC → NFW → NAT → Internet      │
  │ App (Prod EU) → App (Prod US): TGW EU → TGW Peering → TGW US → App    │
  │ On-Prem → AWS: VPN → TGW US → Shared Services → Production VPC        │
  └────────────────────────────────────────────────────────────────────────┘

$ ./network-dashboard

╔════════════════════════════════════════════════════════════════════════╗
║                    GLOBAL NETWORK HEALTH DASHBOARD                     ║
╠════════════════════════════════════════════════════════════════════════╣
║                                                                        ║
║  REGIONS                                                               ║
║  ─────────────────────────────────────────────────────────────────────║
║  us-east-1: ✓ Healthy                                                  ║
║    - Transit Gateway: 12 attachments, 4.2 Gbps throughput              ║
║    - Network Firewall: 156K flows/min, 23 blocks                       ║
║    - VPN Tunnels: 2/2 UP (BGP routes: 15)                              ║
║                                                                        ║
║  eu-west-1: ✓ Healthy                                                  ║
║    - Transit Gateway: 8 attachments, 2.1 Gbps throughput               ║
║    - Network Firewall: 89K flows/min, 12 blocks                        ║
║    - TGW Peering: UP (latency: 72ms to us-east-1)                      ║
║                                                                        ║
║  CROSS-REGION TRAFFIC (Last 24h)                                       ║
║  ─────────────────────────────────────────────────────────────────────║
║  us-east-1 ↔ eu-west-1: 1.2 TB                                         ║
║  Cost: $24.00 (at $0.02/GB)                                            ║
║                                                                        ║
║  ALERTS                                                                ║
║  ─────────────────────────────────────────────────────────────────────║
║  ⚠️  VPN Tunnel 1 latency spike: 180ms → 95ms (recovered)              ║
║  ⚠️  Network Firewall blocked crypto miner: 10.1.2.50 → pool.mining.com║
║                                                                        ║
║  COMPLIANCE                                                            ║
║  ─────────────────────────────────────────────────────────────────────║
║  ✓ All flow logs enabled and shipped to Security Account               ║
║  ✓ No direct internet access from production VPCs                      ║
║  ✓ All cross-account traffic via Transit Gateway (auditable)           ║
║  ✓ Network Firewall IDS rules: 2,456 signatures active                 ║
║                                                                        ║
╚════════════════════════════════════════════════════════════════════════╝

Implementation Hints: This project combines all previous projects. Key architectural decisions:

Transit Gateway per region with peering:

resource "aws_ec2_transit_gateway_peering_attachment" "us_eu" {
  provider                = aws.us_east_1
  peer_region             = "eu-west-1"
  peer_transit_gateway_id = aws_ec2_transit_gateway.eu.id
  transit_gateway_id      = aws_ec2_transit_gateway.us.id
}

Centralized egress with inspection: ``` Traffic: App VPC → TGW → Firewall VPC → NFW → NAT → Internet Return: Internet → NAT → NFW → TGW → App VPC

All egress inspected by Network Firewall All traffic logged via Flow Logs


3. **Route table segmentation for security**:

TGW Route Tables:

“production”: Routes to shared services + prod VPCs, NO dev access
“development”: Routes to shared services + dev VPCs, NO prod access
“shared-services”: Routes to all VPCs (hub) ```

Observability pipeline:

VPC Flow Logs → CloudWatch Logs → Kinesis Firehose → S3 (Security Account)
                                                 ↓
Network Firewall Logs → CloudWatch Logs ──────────────────→ Athena/QuickSight
                                                 ↓
TGW Flow Logs → CloudWatch Logs ───────────────────────────→ SIEM Integration

Learning milestones:

Multi-region TGW peering works → You understand global connectivity
All egress flows through central firewall → You understand inspection
Prod/Dev isolation enforced → You understand segmentation
Complete observability in Security Account → You understand enterprise monitoring

Common Pitfalls & Debugging

Problem 1: “Multi-region Transit Gateway peering is taking forever to deploy”

Why: Cross-region TGW peering requires resources in multiple regions, each with dependencies. AWS provisioning can take 10-15 minutes per peering attachment.

Fix: Be patient and plan deployment order:

# Deploy in sequence:
# 1. TGWs in each region (parallel OK)
# 2. VPC attachments in each region (parallel OK)
# 3. TGW peering (sequential - one region initiates, other accepts)
# 4. Route table updates (after peering is active)

# Check peering status
aws ec2 describe-transit-gateway-peering-attachments \
  --region us-east-1

# Status progression:
# initiatingRequest → pendingAcceptance → available

Quick test: Wait for “available” status before adding routes (10-15 min typical)

Problem 2: “Prod VPC can still reach Dev VPC despite TGW route table isolation”

Why: Route table associations not configured correctly, or routes leaked through shared services VPC.

Fix: Verify complete segmentation:

# Check TGW attachment associations
aws ec2 get-transit-gateway-attachment-propagations \
  --transit-gateway-attachment-id tgw-attach-prod

# Should only associate with "production" route table
# NOT "development" route table

# Verify prod route table has NO routes to dev CIDRs
aws ec2 search-transit-gateway-routes \
  --transit-gateway-route-table-id tgw-rtb-prod \
  --filters "Name=state,Values=active"

# Should NOT see: 10.1.0.0/16 (dev CIDR)

Quick test: Try ping from prod to dev (should timeout)

Problem 3: “Terraform apply takes 45+ minutes and times out”

Why: Multi-region, multi-account infrastructure has hundreds of resources with complex dependencies.

Fix: Break into modules and use targeted applies:

# Module structure:
modules/
  ├── network-foundation/  # VPCs, subnets, IGWs
  ├── transit-gateway/     # TGW, peering, route tables
  ├── security/            # NFW, security groups, NACLs
  ├── observability/       # Flow logs, CloudWatch, S3
  └── workloads/           # Application resources

# Deploy in stages:
terraform apply -target=module.network-foundation
terraform apply -target=module.transit-gateway
terraform apply -target=module.security
# etc.

# Or use separate state files per region

Quick test: Monitor AWS CloudFormation events for bottlenecks

Problem 4: “Cross-account Flow Logs not appearing in Security Account S3 bucket”

Why: S3 bucket policy doesn’t allow cross-account log delivery, or VPC Flow Log IAM role misconfigured.

Fix: Configure cross-account permissions:

# In Security Account S3 bucket policy
{
  "Effect": "Allow",
  "Principal": {
    "Service": "vpc-flow-logs.amazonaws.com"
  },
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:::security-account-flow-logs/*",
  "Condition": {
    "StringEquals": {
      "aws:SourceAccount": ["PROD-ACCOUNT-ID", "DEV-ACCOUNT-ID"]
    }
  }
}

Quick test: Generate traffic, wait 10 minutes, check S3 for logs

Problem 5: “Monthly AWS bill is $2,000+ for this architecture!”

Why: Multi-region, multi-account enterprise architecture has significant costs. Common culprits: NAT Gateways, Network Firewall, Transit Gateway attachments, data transfer.
Fix: Analyze and optimize costs: ``` Typical monthly costs:
- Transit Gateway: 12 attachments × $36 = $432
- Network Firewall: 2 regions × $290 = $580
- NAT Gateways: 4 AZs × $33 = $132
- Data transfer: Variable ($100-500+)
- VPC Flow Logs: $20-50
- Interface Endpoints: $50-100
- Total baseline: ~$1,300-1,600/month
Optimization strategies:
1. Use NAT Instances in dev/test ($10 vs $33/month)
2. Consolidate Transit Gateway attachments where possible
3. Use VPC Endpoints to reduce NAT Gateway data transfer
4. Implement data transfer quotas and budgets
5. Use S3 lifecycle policies for Flow Logs (delete after 90 days) ```
Quick test: Enable Cost Explorer, analyze by service and region

Problem 6: “Inter-region TGW traffic is slow (200ms latency)”

Why: Physical distance between regions. us-east-1 to ap-southeast-1 is ~18,000 km.
Fix: Understand latency is physics: ``` Expected latencies (one-way):
- us-east-1 ↔ us-west-2: ~70ms
- us-east-1 ↔ eu-west-1: ~80ms
- us-east-1 ↔ ap-southeast-1: ~200ms
Optimization:
1. Use Global Accelerator for user traffic (AWS backbone)
2. Cache data regionally (ElastiCache, DynamoDB Global Tables)
3. Use CloudFront for static content
4. Replicate databases closer to users
5. Async processing for non-real-time workloads ```
Quick test: ping across regions to baseline latency

Problem 7: “Terraform state conflicts when multiple people deploy”

Why: Single state file for massive infrastructure causes lock contention.

Fix: Use remote state locking and workspaces:

terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket"
    key            = "network/global.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"  # State locking
    encrypt        = true
  }
}

# Or use separate state files:
# - network-us-east-1.tfstate
# - network-eu-west-1.tfstate
# - security-global.tfstate

Quick test: Two people run terraform plan simultaneously (should not conflict)

Problem 8: “How do I document this architecture for my team?”

Why: Complex multi-region architecture is hard to visualize and onboard new team members.
Fix: Create living documentation: ``` Documentation checklist:
1. Architecture diagrams (draw.io or Lucidchart)
  - Logical topology (VPCs, TGW, connectivity)
  - Traffic flows (ingress, egress, cross-region)
  - Security boundaries (prod/dev isolation)
2. Network inventory spreadsheet: | VPC | Region | CIDR | Account | Purpose | TGW Route Table | |—–|——–|——|———|———|—————–|
3. Runbooks:
  - Adding new VPC to network
  - Troubleshooting connectivity
  - Emergency procedures (isolate VPC, block IP)
4. Terraform docs (auto-generate): terraform-docs markdown . > TERRAFORM.md
5. Decision log:
  - Why TGW over VPC Peering?
  - Why Network Firewall over third-party?
  - Cost/benefit analysis
6. Embed diagrams in code:
  ascii-to-diagram skill to generate images
  
```
Quick test: New team member can understand architecture in 30 minutes

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Production VPC from Scratch	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐
2. Security Group Debugger	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐⭐
3. VPC Flow Logs Analyzer	Advanced	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
4. VPC Peering vs TGW Lab	Advanced	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
5. NAT Gateway Cost Optimizer	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐⭐
6. Multi-Account Network	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
7. Site-to-Site VPN Lab	Advanced	1-2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
8. Network Firewall Deployment	Advanced	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
9. Global Accelerator & CloudFront	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐⭐
10. VPC Lattice Service Mesh	Advanced	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
11. PrivateLink Service Provider	Expert	2 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
12. Multi-Region Architecture	Master	1-2 months	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

Recommended Learning Path

For Beginners (Start Here)

Project 1: Production VPC - The foundation of everything
Project 2: Security Group Debugger - Understand traffic flow
Project 5: NAT Gateway Optimizer - Learn about VPC Endpoints

For Intermediate Cloud Engineers

Project 3: VPC Flow Logs Analyzer - Visibility into your network
Project 4: VPC Peering vs TGW Lab - Key architectural decision
Project 9: Global Accelerator & CloudFront - Edge networking

For Advanced Engineers

Project 7: Site-to-Site VPN - Hybrid connectivity
Project 8: Network Firewall - Perimeter security
Project 6: Multi-Account Network - Enterprise patterns

For Experts

Project 10: VPC Lattice - Modern service networking
Project 11: PrivateLink Provider - SaaS architecture
Project 12: Multi-Region Architecture - Everything combined

Essential Resources

AWS Documentation

Architecture Guides

Security

Books

“Computer Networks, Fifth Edition” by Tanenbaum - Networking fundamentals
“The Practice of Network Security Monitoring” by Richard Bejtlich - Security analysis
“High Performance Browser Networking” by Ilya Grigorik - Edge networking

Summary

#	Project	Main Language
1	Build a Production-Ready VPC from Scratch	Terraform
2	Security Group Traffic Flow Debugger	Python
3	VPC Flow Logs Analyzer	Python
4	VPC Peering vs Transit Gateway Lab	Terraform
5	NAT Gateway Deep Dive & Cost Optimizer	Python
6	Multi-Account Network with AWS Organizations	Terraform
7	Site-to-Site VPN with Simulated On-Premises	Terraform
8	AWS Network Firewall Deployment	Terraform
9	Global Accelerator & CloudFront Edge Networking	Terraform
10	VPC Lattice for Service-to-Service Networking	Terraform
11	PrivateLink Service Provider	Terraform
12	Complete Multi-Region Network Architecture	Terraform

Why This Path Works

By completing these projects, you won’t just “know” AWS networking—you’ll understand it deeply:

You’ll understand isolation (VPCs, subnets, Security Groups)
You’ll understand connectivity (Peering, Transit Gateway, VPN, Direct Connect)
You’ll understand security (NACLs, Network Firewall, PrivateLink)
You’ll understand scale (multi-account, multi-region)
You’ll understand optimization (VPC Endpoints, cost analysis)
You’ll understand observability (Flow Logs, metrics, debugging)

Most importantly, you’ll be able to design networks for real enterprises—the kind that handle millions of requests, maintain security compliance, and scale globally.

Build these projects, and you’ll go from “AWS user” to “AWS network architect.”