← Back to all projects

AWS NETWORKING DEEP DIVE PROJECTS

AWS Networking Deep Dive: From Zero to Cloud Network Architect

Why This Matters

Every application running in AWS sits on top of a network. If you don’t understand AWS networking, you’re essentially building skyscrapers without understanding the foundation. Misconfigurations in AWS networking are among the most common causes of outages, security breaches, and cost overruns.

The Problems AWS Networking Solves:

  • How do you isolate workloads in a shared cloud environment?
  • How do resources in the cloud talk to each other securely?
  • How do you connect your on-premises data center to AWS?
  • How do you control who can access what, and from where?
  • How do you build networks that span regions and accounts?

What You’ll Understand After This Learning Path:

  • Why VPCs exist and how they provide isolation
  • How subnets, route tables, and gateways work together
  • The difference between Security Groups and NACLs (and when to use each)
  • How VPC Peering vs Transit Gateway decisions are made
  • How hybrid connectivity (VPN, Direct Connect) works
  • How to design highly-available, multi-region networks
  • How to secure traffic at the network perimeter
  • How to optimize costs while maintaining security

Goal: AWS Networking Mastery

By the end of this learning path, you will possess a deep, internalized understanding of AWS networking that goes far beyond surface-level knowledge. You won’t just know what a VPC is—you’ll understand why AWS architected it this way, how packets flow through every component, where security boundaries exist, and how to design resilient multi-region networks that can handle real-world complexity.

You’ll be able to:

  • Visualize packet flow through complex AWS network topologies in your mind
  • Debug network issues by understanding exactly where traffic can and cannot go
  • Design secure, cost-effective networks that follow AWS best practices
  • Make architectural decisions about VPC Peering vs Transit Gateway vs PrivateLink
  • Implement hybrid connectivity that bridges on-premises and cloud seamlessly
  • Explain to others why AWS networking works the way it does, not just how to configure it

This isn’t about memorizing console clicks. It’s about building a mental model so complete that AWS networking becomes intuitive.


Concept Internalization Map

This table maps each major concept cluster to what you need to deeply internalize—not just memorize, but truly understand.

Concept Cluster What Must Be Internalized
VPC Fundamentals • How AWS implements logical isolation using encapsulation
• Why CIDR block selection matters (can’t change it later)
• How VPC spans multiple AZs but subnets are AZ-specific
• The relationship between VPC, subnets, route tables, and gateways
• Default VPC vs custom VPC tradeoffs
Subnets & Routing • Route table longest prefix matching algorithm
• Why “public” vs “private” is just routing, not configuration
• How packet flow works through route tables step-by-step
• NAT Gateway vs NAT Instance tradeoffs
• Implicit router at 10.0.0.0/16 + 1 (e.g., 10.0.0.1 in 10.0.0.0/16 VPC)
• Why you lose 5 IPs per subnet (AWS reserved addresses)
Security Layers • Stateful vs stateless packet filtering mechanics
• Connection tracking in Security Groups (how it actually works)
• Why NACLs need ephemeral port ranges (1024-65535)
• Defense in depth: when to use SG, NACL, WAF, Network Firewall
• Rule evaluation order (NACL) vs all-rules evaluation (SG)
• How to debug “connection refused” vs “timeout” (SG vs NACL/routing)
VPC Connectivity • Peering non-transitivity and why it exists
• Transit Gateway routing and route table propagation
• When to use Peering vs TGW vs PrivateLink
• Cross-region peering limitations
• How to avoid IP CIDR overlaps across VPCs
• Shared VPC vs multi-VPC strategies
Hybrid Networking • BGP (Border Gateway Protocol) basics for Direct Connect
• IPSec tunnel establishment and packet encapsulation
• How AWS uses VLAN tagging for Direct Connect virtual interfaces
• VPN vs DX cost structure (port hours vs data transfer)
• Failover scenarios and routing priority
• MACsec encryption for Direct Connect
DNS & Service Discovery • Route 53 Resolver and VPC DNS (.2 address in VPC)
• Private Hosted Zones for internal DNS
• How enableDnsHostnames and enableDnsSupport work
• DNS resolution over VPC peering/TGW
• Route 53 Resolver endpoints for hybrid DNS
Advanced Patterns • VPC Endpoints (Gateway vs Interface)
• PrivateLink for service exposure without peering
• IPv6 dual-stack VPCs
• VPC Flow Logs for network monitoring
• Network Firewall for perimeter security
• Transit Gateway Network Manager for multi-region

Deep Dive Reading by Concept

Map AWS networking concepts to foundational knowledge in your technical books. AWS networking is built on standard networking principles—these books will give you the “why” behind AWS’s design decisions.

1. VPC & Subnets → Network Fundamentals

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 1: Sections on network layering, internetworking concepts
  • Chapter 5: The Network Layer - IP addressing, subnetting, CIDR notation
    • Section 5.6: IP Addresses (crucial for understanding VPC CIDR blocks)
    • Section 5.6.3: CIDR and route aggregation
    • Section 5.6.4: NAT (directly applicable to AWS NAT Gateways)

Why: AWS VPCs implement standard IP networking. Understanding CIDR, subnetting math, and network masks is essential for proper VPC design.

Key Takeaway: When you create a VPC with 10.0.0.0/16, you’re using CIDR notation that comes from the need to efficiently aggregate routes on the Internet. AWS didn’t invent this—they’re using established networking standards.


2. Route Tables & Packet Forwarding → Routing Algorithms

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 5: The Network Layer
    • Section 5.2: Routing algorithms (how routers make forwarding decisions)
    • Section 5.3: Hierarchical routing
    • Section 5.4: Broadcast and multicast routing

Why: AWS route tables use longest prefix matching—a fundamental routing concept. Understanding how routers make forwarding decisions helps you design efficient route tables.

Key Takeaway: The implicit router in every VPC (at the +1 address) uses the same forwarding logic as any Internet router. Route table lookups aren’t AWS magic—they’re standard routing algorithms.


3. Security Groups & NACLs → Firewalls & Packet Filtering

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 8: Network Security
    • Section 8.9: Firewalls (packet filtering, stateful inspection)

TCP/IP Illustrated, Volume 1 by Stevens

  • Chapter 13: TCP Connection Management
    • Understanding the three-way handshake helps you understand why stateful firewalls work
    • Section on connection tracking and state tables

Why: Security Groups are stateful packet filters—they track TCP connections. NACLs are stateless. Understanding the TCP state machine explains why Security Groups can auto-allow return traffic.

Key Takeaway: When you allow inbound port 80 in a Security Group, AWS maintains a connection tracking table (like conntrack in Linux iptables). This is why return traffic on ephemeral ports is automatically allowed.


4. NAT Gateway → Network Address Translation

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 5: The Network Layer
    • Section 5.6.4: Network Address Translation (NAT)
    • How NAT maintains translation tables
    • Port Address Translation (PAT/NAPT)

Why: AWS NAT Gateways use PAT to allow many private IPs to share one public IP. Understanding the NAT translation table helps you debug connectivity issues.

Key Takeaway: NAT is a workaround for IPv4 address exhaustion. AWS NAT Gateways translate thousands of private IPs to a single Elastic IP using port mapping—not AWS-specific, but an industry-standard technique.


5. VPN & IPSec → Cryptographic Tunnels

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 8: Network Security
    • Section 8.7: IPSec (the protocol AWS VPN uses)
    • Section 8.8: VPNs
    • Tunnel mode vs transport mode

Why: AWS Site-to-Site VPN uses IPSec. Understanding how IPSec encrypts and encapsulates packets helps you troubleshoot VPN connectivity and understand performance characteristics.

Key Takeaway: IPSec adds overhead—encryption/decryption CPU cost and packet size increase. This is why VPN throughput is limited compared to Direct Connect’s raw physical connection.


6. BGP & Direct Connect → Routing Protocols

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 5: The Network Layer
    • Section 5.3.4: Routing in the Internet
    • Border Gateway Protocol (BGP) basics

Why: AWS Direct Connect uses BGP to exchange routes between your network and AWS. Understanding BGP path selection helps you control traffic flow and implement proper failover.

Key Takeaway: BGP is how the entire Internet exchanges routing information. AWS Direct Connect uses the same protocol, so learning BGP fundamentals gives you control over how traffic routes between your data center and AWS.


7. DNS in VPC → Domain Name System

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 7: The Application Layer
    • Section 7.1: DNS
    • Recursive vs iterative queries
    • DNS caching

Why: Every VPC has a DNS resolver at the +2 address (e.g., 10.0.0.2 in a 10.0.0.0/16 VPC). Route 53 Resolver endpoints allow DNS queries across hybrid networks.

Key Takeaway: AWS Route 53 Resolver is just a DNS recursive resolver—same concept as running your own DNS server, but managed by AWS.


8. VPC Flow Logs → Network Monitoring

The Practice of Network Security Monitoring by Bejtlich

  • Chapter 2: Network Traffic Collection
  • Chapter 3: Network Traffic Analysis
    • Flow data vs packet capture
    • NetFlow and similar technologies

Why: VPC Flow Logs capture metadata about traffic (source, destination, ports, protocol, action). Understanding flow data helps you monitor and troubleshoot network security.

Key Takeaway: VPC Flow Logs are AWS’s implementation of NetFlow/IPFIX—industry-standard network monitoring. They don’t capture packet payloads, only metadata, which is why they’re efficient.


9. TCP/IP Deep Dive → Protocol Understanding

TCP/IP Illustrated, Volume 1 by Stevens

  • Chapter 2: The Internet Protocol
  • Chapter 13: TCP Connection Establishment and Termination
  • Chapter 17: TCP Interactive Data Flow
  • Chapter 20: TCP Bulk Data Flow

Why: Debugging AWS network issues requires understanding TCP behavior—SYN floods, connection timeouts, TCP window scaling, MTU issues.

Key Takeaway: AWS networking doesn’t change how TCP works. If you understand TCP’s three-way handshake, you’ll understand why a Security Group blocking port 80 results in a timeout (SYN never gets ACK) vs a connection refused.


10. Linux Network Stack → Practical Implementation

The Linux Programming Interface by Kerrisk

  • Chapter 59: Sockets: Internet Domains
  • Chapter 61: Sockets: Advanced Topics

Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron

  • Chapter 11: Network Programming
    • Section 11.3: The Sockets Interface
    • Section 11.4: Client-Server model

Why: EC2 instances run Linux (or Windows, but Linux dominates). Understanding the socket API, network namespaces, and iptables helps you debug instance-level networking.

Key Takeaway: EC2 networking is just Linux networking with AWS-managed interfaces (ENIs). If you know how Linux network namespaces work, you’ll understand how containers on ECS/EKS get network isolation.


11. TLS/SSL & Certificates → Secure Communications

Computer Networks, Fifth Edition by Tanenbaum

  • Chapter 8: Network Security
    • Section 8.6: Communication Security (TLS/SSL)

High Performance Browser Networking by Ilya Grigorik

  • Chapter 4: Transport Layer Security (TLS)
    • TLS handshake
    • Certificate validation
    • Performance implications

Why: AWS Certificate Manager (ACM), Application Load Balancer TLS termination, and VPN encryption all use TLS/SSL. Understanding certificate chains and handshakes helps you troubleshoot HTTPS issues.

Key Takeaway: TLS adds latency (handshake) and CPU cost (encryption). Understanding this helps you make decisions about where to terminate TLS (ALB vs EC2) and when to use HTTP/2.


12. Performance & Latency → Network Optimization

High Performance Browser Networking by Ilya Grigorik

  • Chapter 1: Primer on Latency and Bandwidth
  • Chapter 2: Building Blocks of TCP
    • TCP’s impact on application performance
    • Slow start, congestion control

Why: AWS networking has latency characteristics—cross-AZ, cross-region, to Internet. Understanding TCP’s behavior helps you optimize application performance.

Key Takeaway: Cross-AZ traffic has ~1-2ms latency. Cross-region has 10-100+ms. TCP slow start means the first few KB of a connection are slower. This knowledge helps you design distributed systems on AWS.


How AWS Networking Actually Works: A Mental Model

Before you touch Terraform or the AWS console, you need a mental model of what’s actually happening when you create a VPC. This isn’t about memorizing console clicks—it’s about understanding the underlying mechanisms so deeply that you can debug any networking problem.

The Physical Reality Behind the Abstraction

When you create a VPC, you’re not actually creating a physical network. AWS runs a software-defined network (SDN) on top of its physical infrastructure. Here’s what’s actually happening:

┌─────────────────────────────────────────────────────────────────────────┐
│                        AWS PHYSICAL INFRASTRUCTURE                       │
│                                                                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                 │
│  │  Physical   │    │  Physical   │    │  Physical   │                 │
│  │  Server A   │    │  Server B   │    │  Server C   │   ... thousands │
│  │  (Host)     │    │  (Host)     │    │  (Host)     │       more      │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘                 │
│         │                  │                  │                         │
│         └──────────────────┴──────────────────┘                         │
│                            │                                            │
│              ┌─────────────┴─────────────┐                              │
│              │    AWS Physical Network    │                              │
│              │    (High-speed backbone)   │                              │
│              └─────────────┬─────────────┘                              │
└─────────────────────────────┼───────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    AWS SOFTWARE-DEFINED NETWORK LAYER                    │
│                                                                         │
│   Your VPC (10.0.0.0/16) is a LOGICAL construct overlaid on physical    │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐  │
│   │                         YOUR VPC                                 │  │
│   │                      (10.0.0.0/16)                               │  │
│   │                                                                  │  │
│   │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │  │
│   │  │Public Subnet │  │Private Subnet│  │  DB Subnet   │           │  │
│   │  │ 10.0.1.0/24  │  │ 10.0.10.0/24 │  │ 10.0.20.0/24 │           │  │
│   │  │              │  │              │  │              │           │  │
│   │  │  ┌───────┐   │  │  ┌───────┐   │  │  ┌───────┐   │           │  │
│   │  │  │ EC2-1 │   │  │  │ EC2-2 │   │  │  │  RDS  │   │           │  │
│   │  │  │10.0.1.│   │  │  │10.0.10│   │  │  │10.0.20│   │           │  │
│   │  │  │  50   │   │  │  │  50   │   │  │  │  50   │   │           │  │
│   │  │  └───────┘   │  │  └───────┘   │  │  └───────┘   │           │  │
│   │  └──────────────┘  └──────────────┘  └──────────────┘           │  │
│   └─────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│   Even though EC2-1 and EC2-2 might be on DIFFERENT physical servers,  │
│   they see each other as if on the same logical 10.0.0.0/16 network    │
└─────────────────────────────────────────────────────────────────────────┘

The Key Insight: Your VPC isn’t a physical network—it’s a set of rules that AWS’s SDN enforces. When EC2-1 sends a packet to EC2-2, the packet might traverse multiple physical switches and routers, but the SDN layer makes it appear as if they’re on the same local network.

How a Packet Flows Through Your VPC

Let’s trace a packet from an EC2 instance in a private subnet trying to reach the internet. This is the single most important thing to understand:

Step-by-Step: EC2 in Private Subnet → Internet

┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 1: EC2 Instance Generates Packet                                    │
│                                                                         │
│   EC2 (10.0.10.50) wants to reach api.github.com (140.82.121.4)         │
│                                                                         │
│   Packet created:                                                        │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src IP: 10.0.10.50 │ Dst IP: 140.82.121.4 │ Dst Port: 443    │     │
│   └───────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 2: Security Group Check (Outbound)                                  │
│                                                                         │
│   EC2's Security Group is checked for OUTBOUND rules                     │
│                                                                         │
│   sg-app-tier rules:                                                     │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Type     │ Protocol │ Port   │ Destination   │ Action          │    │
│   ├──────────┼──────────┼────────┼───────────────┼─────────────────┤    │
│   │ All      │ All      │ All    │ 0.0.0.0/0     │ ALLOW           │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   ✓ Outbound to 140.82.121.4:443 is ALLOWED                             │
│   (Security Groups are stateful - return traffic auto-allowed)          │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 3: Route Table Lookup                                               │
│                                                                         │
│   Private subnet's route table is consulted:                             │
│                                                                         │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Destination    │ Target                                        │    │
│   ├────────────────┼──────────────────────────────────────────────┤    │
│   │ 10.0.0.0/16    │ local (within VPC)                           │    │
│   │ 0.0.0.0/0      │ nat-gateway-id (NAT Gateway)                 │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Destination 140.82.121.4 matches 0.0.0.0/0 → Send to NAT Gateway      │
│   (Longest prefix match: /0 is the catch-all for anything not local)    │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 4: NACL Check (Outbound from Private Subnet)                        │
│                                                                         │
│   Network ACL for private subnet is checked:                             │
│                                                                         │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Rule # │ Type  │ Protocol │ Port    │ Dest      │ Allow/Deny  │    │
│   ├────────┼───────┼──────────┼─────────┼───────────┼─────────────┤    │
│   │ 100    │ All   │ All      │ All     │ 0.0.0.0/0 │ ALLOW       │    │
│   │ *      │ All   │ All      │ All     │ 0.0.0.0/0 │ DENY        │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   ✓ Rule 100 matches → ALLOW                                            │
│   (NACLs are stateless - must also allow return traffic separately!)    │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 5: NAT Gateway Performs Translation                                 │
│                                                                         │
│   NAT Gateway in PUBLIC subnet receives packet and:                      │
│   1. Replaces source IP with its Elastic IP                              │
│   2. Tracks the connection in its translation table                      │
│                                                                         │
│   Original packet:                                                       │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src: 10.0.10.50:49152 │ Dst: 140.82.121.4:443                │     │
│   └───────────────────────────────────────────────────────────────┘     │
│                              ↓                                          │
│   Translated packet:                                                     │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src: 54.123.45.67:32001 │ Dst: 140.82.121.4:443              │     │
│   └───────────────────────────────────────────────────────────────┘     │
│                                                                         │
│   Translation table entry:                                               │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Internal: 10.0.10.50:49152 ↔ External: 54.123.45.67:32001    │     │
│   │ Destination: 140.82.121.4:443                                 │     │
│   └───────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 6: Route to Internet Gateway                                        │
│                                                                         │
│   Public subnet's route table:                                           │
│   ┌────────────────────────────────────────────────────────────────┐    │
│   │ Destination    │ Target                                        │    │
│   ├────────────────┼──────────────────────────────────────────────┤    │
│   │ 10.0.0.0/16    │ local                                        │    │
│   │ 0.0.0.0/0      │ igw-id (Internet Gateway)                    │    │
│   └────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│   Packet sent to Internet Gateway → AWS backbone → Internet             │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 7: Response Returns (Reverse Path)                                  │
│                                                                         │
│   api.github.com responds:                                               │
│   ┌───────────────────────────────────────────────────────────────┐     │
│   │ Src: 140.82.121.4:443 │ Dst: 54.123.45.67:32001              │     │
│   └───────────────────────────────────────────────────────────────┘     │
│                                                                         │
│   1. IGW receives, routes to NAT Gateway (it's the destination)         │
│   2. NAT Gateway looks up translation table, reverse-translates:        │
│      ┌───────────────────────────────────────────────────────────┐      │
│      │ Src: 140.82.121.4:443 │ Dst: 10.0.10.50:49152             │      │
│      └───────────────────────────────────────────────────────────┘      │
│   3. Route table sends to private subnet (10.0.10.0/24 is local)        │
│   4. NACL inbound check (must ALLOW ephemeral ports 1024-65535!)        │
│   5. Security Group: return traffic auto-allowed (stateful!)            │
│   6. Packet delivered to EC2                                            │
└─────────────────────────────────────────────────────────────────────────┘

The Critical Insight: Notice how Security Groups and NACLs behave differently on the return path:

  • Security Group: Automatically allows return traffic (stateful)
  • NACL: Must explicitly allow inbound traffic on ephemeral ports (stateless)

This is why misconfigured NACLs cause “I can send but not receive” problems.


Security Groups vs NACLs: The Deep Dive

This is the most misunderstood part of AWS networking. Let’s break it down completely.

Connection Tracking: How Security Groups Work

Security Groups are stateful firewalls. This means they track connections. Here’s what happens under the hood:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HOW SECURITY GROUP CONNECTION TRACKING WORKS          │
│                                                                         │
│   When you allow inbound port 443, AWS maintains a connection table:    │
│                                                                         │
│   CLIENT (203.0.113.50:52341) ──── SYN ────► EC2 (10.0.1.50:443)        │
│                                                                         │
│   Security Group sees: "New inbound connection to allowed port 443"     │
│                                                                         │
│   Connection Table Entry Created:                                        │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ Connection ID: 12345                                             │   │
│   │ Protocol: TCP                                                    │   │
│   │ Local: 10.0.1.50:443                                             │   │
│   │ Remote: 203.0.113.50:52341                                       │   │
│   │ State: ESTABLISHED                                               │   │
│   │ Direction: INBOUND (originally)                                  │   │
│   │ Created: 2024-12-22 14:32:01                                     │   │
│   │ Last Activity: 2024-12-22 14:32:05                               │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   EC2 (10.0.1.50:443) ──── SYN-ACK ────► CLIENT (203.0.113.50:52341)   │
│                                                                         │
│   Security Group sees: "Outbound to 203.0.113.50:52341"                 │
│   Checks connection table: "This is return traffic for connection 12345"│
│   Result: ALLOWED (no outbound rule needed!)                            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

NACLs: Stateless Packet Filtering

NACLs don’t track connections. Each packet is evaluated independently:

┌─────────────────────────────────────────────────────────────────────────┐
│                    HOW NACLs EVALUATE PACKETS (STATELESS)                │
│                                                                         │
│   CLIENT (203.0.113.50:52341) ──── SYN ────► EC2 (10.0.1.50:443)        │
│                                                                         │
│   NACL Inbound Evaluation:                                               │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ Packet: Src=203.0.113.50:52341, Dst=10.0.1.50:443, TCP          │   │
│   │                                                                  │   │
│   │ Rule 100: Allow TCP 443 from 0.0.0.0/0 → MATCH! → ALLOW         │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   EC2 (10.0.1.50:443) ──── SYN-ACK ────► CLIENT (203.0.113.50:52341)   │
│                                                                         │
│   NACL Outbound Evaluation:                                              │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │ Packet: Src=10.0.1.50:443, Dst=203.0.113.50:52341, TCP          │   │
│   │                                                                  │   │
│   │ Rule 100: Allow TCP 443 from 0.0.0.0/0                          │   │
│   │          Port 443? NO! Destination port is 52341                │   │
│   │          → NO MATCH                                              │   │
│   │                                                                  │   │
│   │ You need: Allow TCP 1024-65535 to 0.0.0.0/0 (ephemeral ports)   │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   ⚠️  WITHOUT EPHEMERAL PORT RULE, RETURN TRAFFIC IS BLOCKED!           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Complete Security Groups vs NACLs Comparison

Aspect Security Groups NACLs
State Stateful (tracks connections) Stateless (each packet independent)
Scope Instance/ENI level Subnet level
Rules Allow only Allow AND Deny
Rule Order All rules evaluated Evaluated in numbered order
Return Traffic Automatically allowed Must be explicitly allowed
Default Deny all inbound, allow all outbound Allow all (default NACL)
Use Case Primary security layer Block specific IPs, defense-in-depth

When a Connection Fails: Timeout vs Connection Refused

Understanding error messages helps you debug:

┌─────────────────────────────────────────────────────────────────────────┐
│                    DIAGNOSING NETWORK FAILURES                           │
│                                                                         │
│  SCENARIO 1: Connection Timeout                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│  $ curl https://10.0.2.50:443                                           │
│  curl: (28) Connection timed out after 30001 milliseconds               │
│                                                                         │
│  What happened:                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ CLIENT ──── SYN ────► [DROPPED SILENTLY] ──✗                    │    │
│  │                                                                  │    │
│  │ Packet never reached destination or response never came back    │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Causes:                                                                 │
│  • Security Group blocking (drops packet silently)                      │
│  • NACL blocking (drops packet silently)                                │
│  • Route table misconfiguration (packet sent to wrong place)            │
│  • NAT Gateway issue (no route to internet)                             │
│                                                                         │
│  ─────────────────────────────────────────────────────────────────────  │
│  SCENARIO 2: Connection Refused                                          │
│  ─────────────────────────────────────────────────────────────────────  │
│  $ curl https://10.0.2.50:443                                           │
│  curl: (7) Failed to connect: Connection refused                        │
│                                                                         │
│  What happened:                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ CLIENT ──── SYN ────► EC2 ──── RST ────► CLIENT                 │    │
│  │                                                                  │    │
│  │ Packet reached destination, but nothing listening on that port  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                         │
│  Causes:                                                                 │
│  • Service not running on target port                                   │
│  • Service bound to localhost only (127.0.0.1)                          │
│  • iptables on EC2 blocking (OS-level firewall)                         │
│                                                                         │
│  KEY INSIGHT: Connection refused means network path is WORKING!         │
│  The problem is at the application level, not network level.            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The VPC Address Space: Reserved IPs and CIDR Math

Every subnet loses 5 IP addresses to AWS. Understanding why helps you plan capacity:

┌─────────────────────────────────────────────────────────────────────────┐
│                    SUBNET: 10.0.1.0/24 (256 IPs theoretically)           │
│                                                                         │
│   Reserved by AWS:                                                       │
│   ┌───────────────┬───────────────────────────────────────────────────┐ │
│   │ 10.0.1.0      │ Network address (standard networking)              │ │
│   │ 10.0.1.1      │ VPC Router (implicit router for the subnet)       │ │
│   │ 10.0.1.2      │ DNS Server (Amazon-provided DNS)                  │ │
│   │ 10.0.1.3      │ Reserved for future use                           │ │
│   │ 10.0.1.255    │ Broadcast address (VPC doesn't support broadcast) │ │
│   └───────────────┴───────────────────────────────────────────────────┘ │
│                                                                         │
│   Usable IPs: 10.0.1.4 through 10.0.1.254 = 251 IPs                     │
│                                                                         │
│   ⚠️  CRITICAL: The VPC router (10.0.1.1) is how traffic leaves the     │
│   subnet. All outbound traffic goes here first, then route table        │
│   determines next hop.                                                   │
│                                                                         │
│   ⚠️  CRITICAL: The DNS server (10.0.1.2) is why enableDnsSupport and   │
│   enableDnsHostnames matter. Without these, EC2 instances can't         │
│   resolve DNS names.                                                     │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

CIDR Block Planning: Common Mistakes

┌─────────────────────────────────────────────────────────────────────────┐
│                    CIDR PLANNING: AVOID THESE MISTAKES                   │
│                                                                         │
│   MISTAKE 1: VPC CIDR too small                                          │
│   ─────────────────────────────────────────────────────────────────────  │
│   Created: 10.0.0.0/24 (256 IPs)                                         │
│   Problem: Only room for ~4 small subnets                                │
│   Can't expand VPC CIDR after creation!                                  │
│                                                                         │
│   Better: 10.0.0.0/16 (65,536 IPs) - room to grow                       │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   MISTAKE 2: Overlapping CIDRs across VPCs                               │
│   ─────────────────────────────────────────────────────────────────────  │
│   VPC-A: 10.0.0.0/16                                                     │
│   VPC-B: 10.0.0.0/16  ← SAME CIDR!                                       │
│                                                                         │
│   Problem: Can NEVER peer these VPCs or connect via Transit Gateway     │
│   Routing would be ambiguous: is 10.0.1.50 in VPC-A or VPC-B?           │
│                                                                         │
│   Better: Plan non-overlapping ranges:                                   │
│   VPC-A: 10.0.0.0/16                                                     │
│   VPC-B: 10.1.0.0/16                                                     │
│   VPC-C: 10.2.0.0/16                                                     │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   MISTAKE 3: Using 172.17.0.0/16 with Docker                             │
│   ─────────────────────────────────────────────────────────────────────  │
│   Docker's default bridge network: 172.17.0.0/16                         │
│                                                                         │
│   If your VPC is 172.17.0.0/16, containers can't reach VPC resources!   │
│   Route goes to Docker bridge instead of VPC router.                     │
│                                                                         │
│   Better: Avoid 172.17.x.x for VPCs if using Docker                     │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

VPC Flow Logs: Seeing What’s Actually Happening

VPC Flow Logs capture metadata about every network flow. They’re your primary debugging and security monitoring tool.

Flow Log Record Format

┌─────────────────────────────────────────────────────────────────────────┐
│                    VPC FLOW LOG RECORD (Version 2)                       │
│                                                                         │
│   Raw log entry:                                                         │
│   2 123456789012 eni-abc123 10.0.1.50 10.0.2.100 49152 443 6 25 1234    │
│   1639489200 1639489260 ACCEPT OK                                        │
│                                                                         │
│   Parsed:                                                                │
│   ┌───────────────────────────────────────────────────────────────────┐ │
│   │ version        │ 2                                                │ │
│   │ account-id     │ 123456789012                                     │ │
│   │ interface-id   │ eni-abc123 (which ENI saw this traffic)         │ │
│   │ srcaddr        │ 10.0.1.50 (source IP)                           │ │
│   │ dstaddr        │ 10.0.2.100 (destination IP)                     │ │
│   │ srcport        │ 49152 (source port - ephemeral)                 │ │
│   │ dstport        │ 443 (destination port - HTTPS)                  │ │
│   │ protocol       │ 6 (TCP - see IANA protocol numbers)            │ │
│   │ packets        │ 25                                               │ │
│   │ bytes          │ 1234                                             │ │
│   │ start          │ 1639489200 (Unix timestamp)                     │ │
│   │ end            │ 1639489260                                       │ │
│   │ action         │ ACCEPT (or REJECT)                              │ │
│   │ log-status     │ OK                                               │ │
│   └───────────────────────────────────────────────────────────────────┘ │
│                                                                         │
│   KEY INSIGHT: action=REJECT means Security Group OR NACL blocked it   │
│   Flow logs don't tell you WHICH one - you must investigate both       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Using Flow Logs to Detect Attacks

┌─────────────────────────────────────────────────────────────────────────┐
│                    SECURITY PATTERNS IN FLOW LOGS                        │
│                                                                         │
│   PATTERN 1: Port Scan Detection                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   Same source IP hitting many destination ports in short time:          │
│                                                                         │
│   203.0.113.50 → 10.0.1.100:22   REJECT                                 │
│   203.0.113.50 → 10.0.1.100:23   REJECT                                 │
│   203.0.113.50 → 10.0.1.100:80   REJECT                                 │
│   203.0.113.50 → 10.0.1.100:443  REJECT                                 │
│   203.0.113.50 → 10.0.1.100:3306 REJECT                                 │
│   203.0.113.50 → 10.0.1.100:5432 REJECT                                 │
│                                                                         │
│   Query: Group by srcaddr, count distinct dstport, filter > 10 ports    │
│   Action: Add 203.0.113.50 to NACL deny list                            │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   PATTERN 2: Data Exfiltration                                           │
│   ─────────────────────────────────────────────────────────────────────  │
│   Unusual outbound data volume to external IP:                          │
│                                                                         │
│   10.0.2.50 → 185.143.223.100:443 ACCEPT bytes=2,147,483,648            │
│   (2 GB to unknown external IP - possible data theft!)                  │
│                                                                         │
│   Query: Group by srcaddr+dstaddr, sum bytes, filter external + large   │
│   Action: Investigate instance, check against threat intelligence       │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────────  │
│   PATTERN 3: SSH Brute Force                                             │
│   ─────────────────────────────────────────────────────────────────────  │
│   Many rejected connections to port 22 from same source:                │
│                                                                         │
│   185.234.x.x → 10.0.1.100:22 REJECT (1000+ times in 1 hour)            │
│                                                                         │
│   Query: Filter dstport=22 AND action=REJECT, group by srcaddr          │
│   Action: Block at NACL, consider IP reputation service                 │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Concept Summary Table

Concept Cluster What You Must Internalize
Packet Flow A packet leaving an EC2 goes: ENI → Security Group (outbound) → Route Table → NACL (outbound) → Next hop. Return traffic reverses but NACLs need explicit inbound rules.
Security Groups Stateful means connection tracking. Allow inbound = auto-allow return. No deny rules. Attached to ENI (instance level). All rules evaluated.
NACLs Stateless means every packet checked independently. Need ephemeral port rules (1024-65535) for return traffic. Rules evaluated in order. Can deny.
Route Tables Longest prefix match wins. Local route always exists. “Public subnet” just means route to IGW. “Private” means route to NAT.
NAT Gateway Performs PAT (Port Address Translation). Must be in public subnet. Maintains translation table. Single point of failure per AZ without redundancy.
CIDR Planning Can’t change VPC CIDR after creation. Overlapping CIDRs prevent connectivity. Lose 5 IPs per subnet. Plan for growth.
Flow Logs Capture metadata (not payload). REJECT could be SG or NACL. Essential for security monitoring and debugging.
Timeout vs Refused Timeout = packet dropped (SG/NACL/routing issue). Refused = packet reached but nothing listening (application issue).

Project 1: Build a Production-Ready VPC from Scratch

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK (TypeScript), CloudFormation (YAML), Pulumi
  • Coolness Level: Level 1: Pure Corporate Snoozefest
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Cloud Infrastructure / Networking
  • Software or Tool: AWS VPC, Terraform
  • Main Book: “AWS Certified Solutions Architect Study Guide” or AWS Documentation

What you’ll build: A fully functional VPC with public and private subnets across multiple Availability Zones, Internet Gateway, NAT Gateway, proper route tables, and Security Groups—deployable with a single command.

Why it teaches AWS networking: This is the foundation of EVERYTHING in AWS. By building it from scratch with infrastructure-as-code, you’ll understand every component and how they fit together. You can’t just click through the console—you must explicitly define every relationship.

Core challenges you’ll face:

  • CIDR block planning (avoiding overlaps, sizing for growth) → maps to IP address management
  • Multi-AZ subnet design (high availability, redundancy) → maps to fault tolerance
  • Route table associations (which subnet goes where) → maps to traffic flow control
  • NAT Gateway placement (public subnet, Elastic IP) → maps to outbound internet access
  • Security Group design (least privilege, references between groups) → maps to micro-segmentation

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic AWS console familiarity, CLI setup

Real world outcome:

$ terraform init && terraform apply

Plan: 23 resources to add

aws_vpc.main: Creating...
aws_vpc.main: Created (vpc-0abc123def456)
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
aws_internet_gateway.main: Creating...
aws_nat_gateway.main: Creating...
aws_route_table.public: Creating...
aws_route_table.private: Creating...
...

Apply complete! Resources: 23 added.

Outputs:
vpc_id = "vpc-0abc123def456"
public_subnet_ids = ["subnet-pub-a", "subnet-pub-b"]
private_subnet_ids = ["subnet-priv-a", "subnet-priv-b"]
nat_gateway_ip = "54.123.45.67"

# Verify connectivity
$ aws ec2 run-instances --subnet-id subnet-priv-a --image-id ami-xxx
$ ssh -J bastion@54.x.x.x ec2-user@10.0.2.15
[ec2-user@ip-10-0-2-15 ~]$ curl ifconfig.me
54.123.45.67  # Traffic exits via NAT Gateway!

Implementation Hints: The VPC structure should look like:

VPC: 10.0.0.0/16 (65,536 IPs)
├── Public Subnet A: 10.0.1.0/24 (AZ-a) - 256 IPs
├── Public Subnet B: 10.0.2.0/24 (AZ-b) - 256 IPs
├── Private Subnet A: 10.0.10.0/24 (AZ-a) - 256 IPs
├── Private Subnet B: 10.0.11.0/24 (AZ-b) - 256 IPs
├── Database Subnet A: 10.0.20.0/24 (AZ-a) - 256 IPs
└── Database Subnet B: 10.0.21.0/24 (AZ-b) - 256 IPs

Key Terraform resources to create:

# Pseudo-Terraform structure
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

resource "aws_subnet" "public" {
  for_each = { "a" = "10.0.1.0/24", "b" = "10.0.2.0/24" }
  vpc_id                  = aws_vpc.main.id
  cidr_block              = each.value
  availability_zone       = "us-east-1${each.key}"
  map_public_ip_on_launch = true
}

resource "aws_nat_gateway" "main" {
  subnet_id     = aws_subnet.public["a"].id
  allocation_id = aws_eip.nat.id
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }
}

Learning milestones:

  1. VPC deploys with all subnets → You understand VPC structure
  2. EC2 in public subnet is reachable → You understand IGW and public routing
  3. EC2 in private subnet can reach internet → You understand NAT Gateway flow
  4. Resources in private subnet are NOT directly reachable → You understand isolation

Real World Outcome

When you complete this project, you will have a fully deployable VPC infrastructure that you can use for any application. Here’s exactly what you’ll see:

# Step 1: Initialize and apply your Terraform configuration
$ cd vpc-project && terraform init

Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.0"...
- Installing hashicorp/aws v5.31.0...

Terraform has been successfully initialized!

$ terraform plan

Terraform will perform the following actions:

  # aws_eip.nat will be created
  + resource "aws_eip" "nat" {
      + allocation_id        = (known after apply)
      + domain               = "vpc"
      + public_ip            = (known after apply)
    }

  # aws_internet_gateway.main will be created
  + resource "aws_internet_gateway" "main" {
      + id      = (known after apply)
      + vpc_id  = (known after apply)
    }

  # aws_nat_gateway.main will be created
  + resource "aws_nat_gateway" "main" {
      + allocation_id        = (known after apply)
      + connectivity_type    = "public"
      + public_ip            = (known after apply)
      + subnet_id            = (known after apply)
    }

  # aws_vpc.main will be created
  + resource "aws_vpc" "main" {
      + cidr_block           = "10.0.0.0/16"
      + enable_dns_hostnames = true
      + enable_dns_support   = true
      + id                   = (known after apply)
    }

  ... (23 resources total)

Plan: 23 to add, 0 to change, 0 to destroy.

$ terraform apply -auto-approve

aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public["a"]: Creating...
aws_subnet.public["b"]: Creating...
aws_subnet.private["a"]: Creating...
aws_subnet.private["b"]: Creating...
aws_internet_gateway.main: Creation complete after 1s [id=igw-0def456789abc123]
aws_eip.nat: Creating...
aws_eip.nat: Creation complete after 1s [id=eipalloc-0123456789abcdef]
aws_nat_gateway.main: Creating...
aws_nat_gateway.main: Still creating... [1m0s elapsed]
aws_nat_gateway.main: Creation complete after 1m45s [id=nat-0abcdef123456789]
aws_route_table.private: Creating...
aws_route_table.public: Creating...
...

Apply complete! Resources: 23 added, 0 to change, 0 to destroy.

Outputs:

vpc_id = "vpc-0abc123def456789"
vpc_cidr = "10.0.0.0/16"
public_subnet_ids = [
  "subnet-0pub1111111111111",
  "subnet-0pub2222222222222",
]
private_subnet_ids = [
  "subnet-0prv1111111111111",
  "subnet-0prv2222222222222",
]
database_subnet_ids = [
  "subnet-0db11111111111111",
  "subnet-0db22222222222222",
]
nat_gateway_public_ip = "54.123.45.67"
internet_gateway_id = "igw-0def456789abc123"

# Step 2: Verify the infrastructure in AWS Console or CLI
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 --no-cli-pager

{
    "Vpcs": [{
        "CidrBlock": "10.0.0.0/16",
        "VpcId": "vpc-0abc123def456789",
        "State": "available",
        "EnableDnsHostnames": true,
        "EnableDnsSupport": true
    }]
}

# Step 3: Test connectivity by launching instances
$ aws ec2 run-instances \
    --image-id ami-0c55b159cbfafe1f0 \
    --instance-type t3.micro \
    --subnet-id subnet-0prv1111111111111 \
    --security-group-ids sg-0app1111111111111 \
    --key-name my-key \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=test-private}]'

# Step 4: SSH via bastion and verify NAT Gateway
$ ssh -J ec2-user@bastion.example.com ec2-user@10.0.10.50

[ec2-user@ip-10-0-10-50 ~]$ curl -s ifconfig.me
54.123.45.67

# Your private instance's traffic exits via the NAT Gateway!
# The public IP shown is the NAT Gateway's Elastic IP, not the instance's

[ec2-user@ip-10-0-10-50 ~]$ curl -s https://api.github.com | head -5
{
  "current_user_url": "https://api.github.com/user",
  "current_user_authorizations_html_url": "https://github.com/...",
  ...
}

# Private subnet instance can reach the internet (outbound only)!

Visual Architecture You’ve Built:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              YOUR VPC (10.0.0.0/16)                          │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        AVAILABILITY ZONE A                           │   │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │   │
│  │  │  Public Subnet  │  │  Private Subnet │  │    DB Subnet    │      │   │
│  │  │  10.0.1.0/24    │  │  10.0.10.0/24   │  │  10.0.20.0/24   │      │   │
│  │  │                 │  │                 │  │                 │      │   │
│  │  │  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │      │   │
│  │  │  │NAT Gateway│  │  │  │ App Server│  │  │  │    RDS    │  │      │   │
│  │  │  │  + EIP    │  │  │  │           │  │  │  │  Primary  │  │      │   │
│  │  │  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │      │   │
│  │  │  ┌───────────┐  │  │                 │  │                 │      │   │
│  │  │  │  Bastion  │  │  │                 │  │                 │      │   │
│  │  │  │   Host    │  │  │                 │  │                 │      │   │
│  │  │  └───────────┘  │  │                 │  │                 │      │   │
│  │  └─────────────────┘  └─────────────────┘  └─────────────────┘      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                        AVAILABILITY ZONE B                           │   │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐      │   │
│  │  │  Public Subnet  │  │  Private Subnet │  │    DB Subnet    │      │   │
│  │  │  10.0.2.0/24    │  │  10.0.11.0/24   │  │  10.0.21.0/24   │      │   │
│  │  │                 │  │                 │  │                 │      │   │
│  │  │  ┌───────────┐  │  │  ┌───────────┐  │  │  ┌───────────┐  │      │   │
│  │  │  │    ALB    │  │  │  │ App Server│  │  │  │    RDS    │  │      │   │
│  │  │  │  (spare)  │  │  │  │  (spare)  │  │  │  │  Standby  │  │      │   │
│  │  │  └───────────┘  │  │  └───────────┘  │  │  └───────────┘  │      │   │
│  │  └─────────────────┘  └─────────────────┘  └─────────────────┘      │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌──────────────────────┐                                                  │
│  │   Internet Gateway   │◄───── 0.0.0.0/0 route from public subnets        │
│  └──────────────────────┘                                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                │
                ▼
          ┌──────────┐
          │ Internet │
          └──────────┘

The Core Question You’re Answering

“What actually makes a subnet ‘public’ vs ‘private’, and how does traffic flow between them and to the internet?”

This question gets to the heart of VPC design. There’s no checkbox that says “make this subnet public.” A subnet is public because of its route table configuration—specifically, whether it has a route to an Internet Gateway. Understanding this relationship is the foundation of all AWS networking.


Concepts You Must Understand First

Stop and research these before coding:

  1. CIDR Notation and Subnetting
    • What does 10.0.0.0/16 actually mean in binary?
    • How do you calculate how many IPs are in a /24 vs /20 vs /16?
    • Why can’t VPC CIDRs overlap if you want to peer them?
    • What’s the difference between 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16? (RFC 1918)
    • Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.6: IP Addresses
  2. Route Tables and Longest Prefix Match
    • How does a router decide where to send a packet?
    • If you have routes for 10.0.0.0/16 and 10.0.1.0/24, which wins for 10.0.1.50?
    • Why does every VPC route table have a “local” route?
    • What happens if there’s no matching route?
    • Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.2: Routing Algorithms
  3. NAT (Network Address Translation)
    • Why can’t private IPs (10.x.x.x) be used on the public internet?
    • How does NAT “hide” hundreds of instances behind one public IP?
    • What’s the difference between SNAT (source NAT) and DNAT (destination NAT)?
    • Why does NAT break some protocols (like FTP active mode)?
    • Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.6.4: Network Address Translation
  4. Availability Zones and High Availability
    • What actually IS an Availability Zone physically?
    • Why do you need subnets in multiple AZs?
    • What’s the latency between AZs in the same region?
    • What happens if one AZ goes down?
    • Book Reference: AWS Well-Architected Framework — Reliability Pillar
  5. DNS in VPCs
    • What is the .2 address in every VPC (e.g., 10.0.0.2)?
    • What do enableDnsSupport and enableDnsHostnames actually do?
    • Why do some services require DNS hostnames to work?
    • Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 7.1: DNS

Questions to Guide Your Design

Before implementing, think through these:

  1. CIDR Planning
    • How many IP addresses do you need today? In 5 years?
    • If you use 10.0.0.0/16 for this VPC, what CIDRs will you use for future VPCs?
    • How will you organize subnets? By tier (web/app/db)? By AZ? Both?
    • What if you later need to peer with a VPC that has 10.0.0.0/16?
  2. Subnet Design
    • Why put public and private subnets in separate CIDR ranges (10.0.1.x vs 10.0.10.x)?
    • How many IPs do you need per subnet? (Remember: AWS reserves 5 per subnet)
    • Should database subnets be different from app private subnets?
    • Do you need isolated subnets (no internet access at all)?
  3. Route Table Design
    • How many route tables do you need? (Hint: minimum 2 - public and private)
    • Should each private subnet have its own NAT Gateway for HA?
    • What happens to private subnet traffic if the NAT Gateway fails?
  4. NAT Gateway vs NAT Instance
    • NAT Gateway: Managed, scales automatically, ~$0.045/hour + data processing
    • NAT Instance: Self-managed EC2, cheaper for low traffic, single point of failure
    • When would you choose one over the other?
  5. Cost Considerations
    • NAT Gateway costs: $0.045/hour × 730 hours/month = $32.85/month just to exist
    • Plus $0.045/GB data processed
    • How can you reduce costs? (VPC Endpoints for AWS services, consider NAT Instance for dev)

Thinking Exercise

Before coding, trace these scenarios on paper:

Scenario 1: EC2 in private subnet wants to call api.github.com

Draw the packet flow:

1. EC2 (10.0.10.50) creates packet: src=10.0.10.50, dst=140.82.121.4 (github)
2. Route table lookup: 140.82.121.4 matches 0.0.0.0/0 → NAT Gateway
3. NAT Gateway receives packet, performs SNAT:
   - New src IP = NAT Gateway's Elastic IP (54.123.45.67)
   - Stores mapping in translation table
4. Route table in public subnet: 0.0.0.0/0 → Internet Gateway
5. Packet goes to Internet Gateway → Internet → GitHub
6. GitHub responds to 54.123.45.67
7. NAT Gateway receives response, looks up translation table
8. Translates dst back to 10.0.10.50, sends to private subnet
9. EC2 receives response

Question: What happens if NAT Gateway is deleted mid-connection?

Scenario 2: Someone on the internet tries to reach 10.0.10.50 directly

Draw what happens:

1. Attacker sends packet: src=attacker, dst=10.0.10.50
2. Packet arrives at... where exactly?
3. Can 10.0.10.50 even be routed on the public internet?

Answer: The packet never arrives. Private IPs (10.x.x.x) are not routable
on the public internet. Routers drop them. This is why private subnets
are "private" - they're literally unreachable from outside.

Scenario 3: EC2 in public subnet vs EC2 in private subnet

What’s actually different?

Public subnet EC2 (10.0.1.50):
- Has route 0.0.0.0/0 → Internet Gateway
- Can be assigned a public IP (Elastic IP or auto-assign)
- Traffic uses its own public IP for outbound
- Can receive inbound traffic from internet (if SG allows)

Private subnet EC2 (10.0.10.50):
- Has route 0.0.0.0/0 → NAT Gateway
- Cannot have a public IP
- Traffic uses NAT Gateway's IP for outbound
- Cannot receive inbound traffic from internet (no route back)

The ONLY difference is the route table. The subnet itself has no
"public" or "private" property - it's ALL about routing.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What makes a subnet public vs private in AWS?”
  2. “Can an EC2 instance in a private subnet access the internet? How?”
  3. “What’s the difference between an Internet Gateway and a NAT Gateway?”
  4. “You have a VPC with CIDR 10.0.0.0/16. Can you peer it with another VPC that has 10.0.0.0/24? Why or why not?”
  5. “Your private instances can’t reach the internet. How do you troubleshoot?”
  6. “Why do you need subnets in multiple Availability Zones?”
  7. “What happens to your application if one AZ goes down and you only have a NAT Gateway in that AZ?”
  8. “How would you reduce NAT Gateway costs for a development environment?”
  9. “What’s the maximum CIDR block size for a VPC?”
  10. “How many IP addresses are usable in a /24 subnet in AWS?”

Hints in Layers

Hint 1: Start with the VPC and Subnets

Your first Terraform file should just create the VPC and subnets:

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "production-vpc"
  }
}

# Public subnets - one per AZ
resource "aws_subnet" "public" {
  for_each = {
    "a" = { cidr = "10.0.1.0/24", az = "us-east-1a" }
    "b" = { cidr = "10.0.2.0/24", az = "us-east-1b" }
  }

  vpc_id                  = aws_vpc.main.id
  cidr_block              = each.value.cidr
  availability_zone       = each.value.az
  map_public_ip_on_launch = true  # This is what makes instances get public IPs

  tags = {
    Name = "public-${each.key}"
    Tier = "public"
  }
}

Run terraform apply and verify in the console that your VPC and subnets exist.

Hint 2: Add the Internet Gateway and Public Route Table

Without this, even “public” subnets can’t reach the internet:

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "main-igw"
  }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "public-rt"
  }
}

# Associate public subnets with public route table
resource "aws_route_table_association" "public" {
  for_each       = aws_subnet.public
  subnet_id      = each.value.id
  route_table_id = aws_route_table.public.id
}

Hint 3: Add NAT Gateway for Private Subnets

NAT Gateway needs an Elastic IP and must be in a PUBLIC subnet:

resource "aws_eip" "nat" {
  domain = "vpc"

  tags = {
    Name = "nat-eip"
  }
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public["a"].id  # Must be in PUBLIC subnet!

  tags = {
    Name = "main-nat"
  }

  depends_on = [aws_internet_gateway.main]
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }

  tags = {
    Name = "private-rt"
  }
}

Hint 4: Test Your Setup

After deploying, verify everything works:

# Launch a test instance in private subnet
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type t3.micro \
  --subnet-id <your-private-subnet-id> \
  --key-name <your-key> \
  --no-associate-public-ip-address

# Use Session Manager (no bastion needed) or SSH via bastion
# Then test outbound connectivity:
curl -s ifconfig.me  # Should show NAT Gateway's EIP

# Try to ping the instance from the internet - it should fail
# (because there's no inbound route)

Books That Will Help

Topic Book Chapter
IP addressing & CIDR “Computer Networks, Fifth Edition” by Tanenbaum Ch. 5.6: IP Addresses
Routing fundamentals “Computer Networks, Fifth Edition” by Tanenbaum Ch. 5.2: Routing Algorithms
NAT mechanics “Computer Networks, Fifth Edition” by Tanenbaum Ch. 5.6.4: Network Address Translation
AWS VPC deep dive “AWS Certified Solutions Architect Study Guide” VPC Chapter
High availability design “AWS Well-Architected Framework” Reliability Pillar
Terraform basics “Terraform: Up & Running” by Yevgeniy Brikman Ch. 2-4
Infrastructure as Code “Infrastructure as Code” by Kief Morris Ch. 1-5

Project 2: Security Group Traffic Flow Debugger

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Bash with AWS CLI
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Security / Networking
  • Software or Tool: AWS Security Groups, boto3
  • Main Book: “AWS Security” by Dylan Shields

What you’ll build: A CLI tool that analyzes Security Groups and tells you whether traffic can flow between two resources, tracing the path through all relevant security controls.

Why it teaches AWS networking: Security Groups are the most misunderstood AWS feature. People add rules without understanding stateful behavior, or create circular dependencies they can’t debug. This tool forces you to understand exactly how SG evaluation works.

Core challenges you’ll face:

  • Understanding stateful filtering (return traffic is auto-allowed) → maps to connection tracking
  • Tracing SG references (SG-A allows SG-B, SG-B allows SG-C) → maps to graph traversal
  • Rule evaluation order (most permissive wins) → maps to rule processing
  • ENI-level association (one resource, multiple SGs) → maps to AWS networking model

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Python, boto3, understanding of TCP/IP

Real world outcome:

$ ./sg-debug can-connect --from i-abc123 --to i-def456 --port 443

Analyzing connectivity: i-abc123 → i-def456:443

SOURCE INSTANCE: i-abc123
  ENI: eni-111
  Private IP: 10.0.1.50
  Security Groups: [sg-web-tier]

TARGET INSTANCE: i-def456
  ENI: eni-222
  Private IP: 10.0.2.100
  Security Groups: [sg-app-tier]

OUTBOUND CHECK (sg-web-tier):
  ✓ Rule found: "Allow all outbound" (0.0.0.0/0, all ports)

INBOUND CHECK (sg-app-tier):
  ✗ No rule allows TCP:443 from 10.0.1.50 or sg-web-tier

RESULT: CONNECTION BLOCKED ❌

RECOMMENDATION:
  Add inbound rule to sg-app-tier:
    Protocol: TCP
    Port: 443
    Source: sg-web-tier (recommended) or 10.0.1.50/32

$ ./sg-debug can-connect --from i-abc123 --to i-def456 --port 443 --after-fix
RESULT: CONNECTION ALLOWED ✓
  Return traffic: Auto-allowed (Security Groups are stateful)

Implementation Hints:

# Pseudo-code structure
def can_connect(source_instance, target_instance, port, protocol="tcp"):
    # Get ENI and SG info for both instances
    source_enis = get_enis(source_instance)
    target_enis = get_enis(target_instance)

    source_sgs = get_security_groups(source_enis)
    target_sgs = get_security_groups(target_enis)

    # Check outbound from source
    outbound_allowed = check_outbound_rules(
        source_sgs,
        target_ip=get_private_ip(target_enis[0]),
        target_sg_ids=[sg.id for sg in target_sgs],
        port=port,
        protocol=protocol
    )

    if not outbound_allowed:
        return False, "Outbound blocked by source Security Group"

    # Check inbound to target
    inbound_allowed = check_inbound_rules(
        target_sgs,
        source_ip=get_private_ip(source_enis[0]),
        source_sg_ids=[sg.id for sg in source_sgs],
        port=port,
        protocol=protocol
    )

    if not inbound_allowed:
        return False, "Inbound blocked by target Security Group"

    # SGs are stateful - return traffic auto-allowed
    return True, "Connection allowed (return traffic auto-allowed)"

def check_inbound_rules(sgs, source_ip, source_sg_ids, port, protocol):
    for sg in sgs:
        for rule in sg.ip_permissions:
            # Check if rule matches protocol
            if rule.ip_protocol != protocol and rule.ip_protocol != "-1":
                continue
            # Check if port is in range
            if not port_in_range(port, rule.from_port, rule.to_port):
                continue
            # Check if source matches
            if matches_source(rule, source_ip, source_sg_ids):
                return True
    return False

Learning milestones:

  1. Tool correctly identifies blocked connections → You understand SG rule evaluation
  2. Tool explains WHY traffic is blocked → You can debug SG issues
  3. Understands SG references (sg-xxx as source) → You understand SG chaining
  4. Correctly handles stateful behavior → You understand return traffic

Real World Outcome

When you complete this project, you’ll have a CLI tool that saves hours of debugging time by instantly showing whether traffic can flow between any two AWS resources. Here’s exactly what the tool will do:

# Basic usage - check if web server can talk to app server
$ ./sg-debug can-connect --from i-web123 --to i-app456 --port 8080

╔══════════════════════════════════════════════════════════════════════════════╗
║                    SECURITY GROUP CONNECTIVITY ANALYSIS                       ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  CONNECTION: i-web123 (10.0.1.50) → i-app456 (10.0.10.100):8080/TCP         ║
║                                                                              ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ SOURCE INSTANCE: i-web123                                              │ ║
║  │   Name: web-server-1                                                   │ ║
║  │   ENI: eni-0abc111111111111                                            │ ║
║  │   Private IP: 10.0.1.50                                                │ ║
║  │   Subnet: subnet-pub-a (10.0.1.0/24) - PUBLIC                         │ ║
║  │   Security Groups:                                                     │ ║
║  │     • sg-0web111111111111 (web-tier-sg)                               │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                      │                                       ║
║                                      ▼                                       ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ OUTBOUND CHECK (sg-web-tier-sg)                                        │ ║
║  │                                                                        │ ║
║  │   Checking rules for: TCP:8080 to 10.0.10.100                         │ ║
║  │                                                                        │ ║
║  │   Rule 1: Type=All, Protocol=All, Port=All, Dest=0.0.0.0/0           │ ║
║  │           ✓ MATCH - Destination 10.0.10.100 in 0.0.0.0/0             │ ║
║  │                                                                        │ ║
║  │   RESULT: ✓ OUTBOUND ALLOWED                                          │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                      │                                       ║
║                                      ▼                                       ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ TARGET INSTANCE: i-app456                                              │ ║
║  │   Name: app-server-1                                                   │ ║
║  │   ENI: eni-0def222222222222                                            │ ║
║  │   Private IP: 10.0.10.100                                              │ ║
║  │   Subnet: subnet-prv-a (10.0.10.0/24) - PRIVATE                       │ ║
║  │   Security Groups:                                                     │ ║
║  │     • sg-0app222222222222 (app-tier-sg)                               │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                      │                                       ║
║                                      ▼                                       ║
║  ┌────────────────────────────────────────────────────────────────────────┐ ║
║  │ INBOUND CHECK (sg-app-tier-sg)                                         │ ║
║  │                                                                        │ ║
║  │   Checking rules for: TCP:8080 from 10.0.1.50 or sg-web-tier-sg       │ ║
║  │                                                                        │ ║
║  │   Rule 1: Type=Custom TCP, Protocol=TCP, Port=443, Source=sg-alb-sg  │ ║
║  │           ✗ NO MATCH - Port 443 ≠ 8080                                │ ║
║  │                                                                        │ ║
║  │   Rule 2: Type=Custom TCP, Protocol=TCP, Port=22, Source=10.0.0.0/8  │ ║
║  │           ✗ NO MATCH - Port 22 ≠ 8080                                 │ ║
║  │                                                                        │ ║
║  │   No more rules to check.                                              │ ║
║  │                                                                        │ ║
║  │   RESULT: ✗ INBOUND BLOCKED                                           │ ║
║  └────────────────────────────────────────────────────────────────────────┘ ║
║                                                                              ║
║  ══════════════════════════════════════════════════════════════════════════ ║
║                                                                              ║
║  FINAL RESULT: CONNECTION BLOCKED ❌                                         ║
║                                                                              ║
║  BLOCKED AT: Inbound rules on sg-app-tier-sg                                ║
║                                                                              ║
║  RECOMMENDATIONS:                                                            ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  Option 1 (Recommended - Security Group Reference):                         ║
║    aws ec2 authorize-security-group-ingress \                               ║
║      --group-id sg-0app222222222222 \                                       ║
║      --protocol tcp \                                                        ║
║      --port 8080 \                                                           ║
║      --source-group sg-0web111111111111                                     ║
║                                                                              ║
║  Option 2 (IP-based - less flexible):                                       ║
║    aws ec2 authorize-security-group-ingress \                               ║
║      --group-id sg-0app222222222222 \                                       ║
║      --protocol tcp \                                                        ║
║      --port 8080 \                                                           ║
║      --cidr 10.0.1.50/32                                                    ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

# After adding the rule, verify the fix:
$ ./sg-debug can-connect --from i-web123 --to i-app456 --port 8080

FINAL RESULT: CONNECTION ALLOWED ✓

Traffic Path:
  i-web123 (10.0.1.50)[OUTBOUND: sg-web-tier-sg allows all]
    → i-app456 (10.0.10.100:8080)[INBOUND: sg-app-tier-sg allows from sg-web-tier-sg]

Return Traffic: Auto-allowed (Security Groups are stateful)

# Advanced usage - trace all allowed connections for an instance
$ ./sg-debug list-allowed --instance i-app456 --direction inbound

╔══════════════════════════════════════════════════════════════════════════════╗
║              ALLOWED INBOUND CONNECTIONS TO i-app456                         ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Security Group: sg-app-tier-sg                                              ║
║                                                                              ║
║  ┌────────┬──────────┬───────────────────────────┬──────────────────────┐   ║
║  │ Port   │ Protocol │ Source                    │ Description          │   ║
║  ├────────┼──────────┼───────────────────────────┼──────────────────────┤   ║
║  │ 443    │ TCP      │ sg-alb-sg (ALB)          │ HTTPS from ALB       │   ║
║  │ 8080   │ TCP      │ sg-web-tier-sg           │ API from web tier    │   ║
║  │ 22     │ TCP      │ 10.0.0.0/8               │ SSH from VPC         │   ║
║  │ 3306   │ TCP      │ sg-app-tier-sg (self)    │ DB replication       │   ║
║  └────────┴──────────┴───────────────────────────┴──────────────────────┘   ║
║                                                                              ║
║  POTENTIAL ISSUES DETECTED:                                                  ║
║  ⚠️  SSH (22) open to entire VPC - consider restricting to bastion only     ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

# Find which instances can reach a specific target
$ ./sg-debug who-can-reach --target i-db789 --port 3306

╔══════════════════════════════════════════════════════════════════════════════╗
║              INSTANCES THAT CAN REACH i-db789:3306                          ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Database instance i-db789 accepts TCP:3306 from:                           ║
║                                                                              ║
║  Via Security Group sg-app-tier-sg:                                         ║
║    • i-app456 (10.0.10.100) - app-server-1                                  ║
║    • i-app789 (10.0.11.100) - app-server-2                                  ║
║                                                                              ║
║  Via CIDR 10.0.10.0/24:                                                     ║
║    • i-app456 (10.0.10.100) - app-server-1                                  ║
║    • i-cache123 (10.0.10.200) - redis-cache                                 ║
║                                                                              ║
║  TOTAL: 3 unique instances can reach the database                           ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

The Core Question You’re Answering

“When traffic is blocked between two AWS resources, how do I know WHERE it’s blocked and WHY?”

This is the question every AWS engineer faces daily. Security Groups silently drop packets—there’s no “connection refused” or error message. Understanding exactly how Security Group rules are evaluated, how stateful filtering works, and how SG references resolve is crucial for debugging any connectivity issue.


Concepts You Must Understand First

Stop and research these before coding:

  1. Stateful vs Stateless Firewalls
    • What does “stateful” actually mean in firewall terms?
    • How does a stateful firewall track connections? (hint: connection table/conntrack)
    • Why is return traffic automatically allowed in Security Groups?
    • What’s the TCP three-way handshake and why does it matter for stateful filtering?
    • Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 8.9: Firewalls
  2. Security Group Rule Evaluation
    • Are Security Group rules evaluated in order like NACLs?
    • What happens when you have multiple Security Groups on one ENI?
    • Can you have deny rules in Security Groups?
    • How does “All traffic” rule (-1 protocol) work?
    • Book Reference: AWS Security Groups Documentation
  3. Security Group References (SG-to-SG)
    • What does it mean when a rule has “sg-xxx” as source instead of a CIDR?
    • How does AWS resolve SG references when checking rules?
    • Why is SG referencing more flexible than IP-based rules?
    • What happens if you reference a SG from a different VPC?
    • Book Reference: AWS Security Best Practices Whitepaper
  4. ENI and Security Group Association
    • What is an ENI (Elastic Network Interface)?
    • How many Security Groups can be attached to one ENI?
    • Can different ENIs on the same instance have different Security Groups?
    • How do Lambda, RDS, and ELB use ENIs and Security Groups?
    • Book Reference: “AWS Certified Solutions Architect Study Guide” — Networking Chapter
  5. TCP/UDP and Port Ranges
    • What’s the difference between source port and destination port?
    • Why do you only need to specify destination port in SG rules?
    • What are ephemeral ports and why don’t you need to allow them inbound?
    • What does protocol “-1” mean in AWS?
    • Book Reference: “TCP/IP Illustrated, Volume 1” by Stevens — Ch. 13: TCP Connection Management

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Model
    • What information do you need to fetch from AWS to analyze connectivity?
    • How will you represent a Security Group rule in your code?
    • How will you handle multiple ENIs per instance?
    • How will you resolve SG references to actual IP addresses?
  2. Algorithm Design
    • Should you check outbound rules first or inbound rules?
    • How do you determine if a rule “matches” a connection attempt?
    • When multiple rules could match, which one wins?
    • How do you explain WHY a connection is blocked?
  3. Edge Cases
    • What if the source instance has multiple SGs and one allows while another doesn’t?
    • What if the rule uses a SG reference that includes the source instance?
    • How do you handle ICMP (ping) which doesn’t have ports?
    • What about connections to/from AWS services (RDS, ElastiCache)?
  4. User Experience
    • How should you display the results—text, JSON, visual diagram?
    • Should you recommend fixes for blocked connections?
    • How do you make the output actionable for someone who’s debugging?

Thinking Exercise

Before coding, trace these scenarios on paper:

Scenario 1: Web server trying to reach database

Setup:
- web-server (i-web) has sg-web attached
- db-server (i-db) has sg-db attached
- sg-web outbound: Allow all (0.0.0.0/0)
- sg-db inbound: Allow TCP 3306 from sg-app (NOT sg-web!)

Connection attempt: i-web → i-db:3306

Step 1: Check sg-web outbound rules
  - Rule "Allow all" matches destination 10.0.20.50:3306
  - Outbound: ALLOWED ✓

Step 2: Check sg-db inbound rules
  - Rule "Allow 3306 from sg-app"
  - Is i-web in sg-app? NO!
  - No other rules match
  - Inbound: BLOCKED ✗

Result: CONNECTION BLOCKED
Reason: sg-db only allows 3306 from sg-app, not sg-web
Fix: Either add i-web to sg-app, or add new rule allowing sg-web

Scenario 2: Understanding SG references

Setup:
- Instance A has sg-A attached (IP: 10.0.1.10)
- Instance B has sg-B attached (IP: 10.0.1.20)
- Instance C has sg-A AND sg-B attached (IP: 10.0.1.30)
- sg-target inbound: Allow TCP 443 from sg-A

Question: Which instances can reach sg-target on port 443?

Answer:
- Instance A: YES (has sg-A)
- Instance B: NO (only has sg-B)
- Instance C: YES (has sg-A, even though it also has sg-B)

The SG reference sg-A means "any ENI that has sg-A attached"

Scenario 3: Multiple Security Groups on one ENI

Setup:
- Instance has sg-1 and sg-2 attached
- sg-1 outbound: Allow TCP 443 to 0.0.0.0/0
- sg-1 outbound: (no rule for port 8080)
- sg-2 outbound: Allow TCP 8080 to 10.0.0.0/8

Connection attempt: Instance → external-api:443
- sg-1 allows it: ALLOWED ✓

Connection attempt: Instance → internal-service:8080
- sg-1 doesn't have a rule... but wait!
- sg-2 allows it: ALLOWED ✓

Key insight: Security Groups are ADDITIVE
If ANY attached SG allows the traffic, it's allowed
There's no "most restrictive wins" - it's "most permissive wins"

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What’s the difference between Security Groups and NACLs?”
  2. “Security Groups are stateful—what does that mean exactly?”
  3. “Can you block specific traffic with a Security Group? How?”
  4. “What happens when you attach multiple Security Groups to an instance?”
  5. “What’s the advantage of using Security Group references vs CIDR blocks?”
  6. “You have a connection timeout between two instances. How do you debug it?”
  7. “Can Security Groups span VPCs?”
  8. “What’s the maximum number of rules you can have in a Security Group?”
  9. “How do Security Groups work with Lambda functions?”
  10. “You allowed inbound traffic but the connection still fails. What could be wrong?”

Hints in Layers

Hint 1: Start with boto3 to fetch Security Group data

import boto3

def get_instance_security_groups(instance_id: str) -> list:
    """Get all Security Groups attached to an instance's ENIs."""
    ec2 = boto3.client('ec2')

    # Get instance details
    response = ec2.describe_instances(InstanceIds=[instance_id])
    instance = response['Reservations'][0]['Instances'][0]

    # Collect SG IDs from all network interfaces
    sg_ids = set()
    for eni in instance.get('NetworkInterfaces', []):
        for group in eni.get('Groups', []):
            sg_ids.add(group['GroupId'])

    # Get full SG details
    sg_response = ec2.describe_security_groups(GroupIds=list(sg_ids))
    return sg_response['SecurityGroups']

Hint 2: Model the rule checking logic

def check_rule_matches(rule: dict, port: int, protocol: str,
                       source_ip: str, source_sg_ids: list) -> bool:
    """Check if a single inbound rule allows the connection."""

    # Check protocol (-1 means all protocols)
    rule_protocol = rule.get('IpProtocol', '-1')
    if rule_protocol != '-1' and rule_protocol != protocol:
        return False

    # Check port range (for TCP/UDP)
    if protocol in ['tcp', 'udp', '6', '17']:
        from_port = rule.get('FromPort', 0)
        to_port = rule.get('ToPort', 65535)
        if not (from_port <= port <= to_port):
            return False

    # Check source - could be CIDR or SG reference
    # Check IP ranges
    for ip_range in rule.get('IpRanges', []):
        cidr = ip_range.get('CidrIp')
        if ip_in_cidr(source_ip, cidr):
            return True

    # Check SG references
    for sg_ref in rule.get('UserIdGroupPairs', []):
        if sg_ref.get('GroupId') in source_sg_ids:
            return True

    return False

Hint 3: Implement the full connectivity check

def can_connect(source_instance: str, target_instance: str,
                port: int, protocol: str = 'tcp') -> dict:
    """
    Check if source can connect to target on specified port.
    Returns detailed analysis of the path.
    """
    result = {
        'allowed': False,
        'source': get_instance_info(source_instance),
        'target': get_instance_info(target_instance),
        'outbound_check': None,
        'inbound_check': None,
        'blocked_at': None,
        'recommendation': None
    }

    # Get SGs for both instances
    source_sgs = get_instance_security_groups(source_instance)
    target_sgs = get_instance_security_groups(target_instance)

    source_sg_ids = [sg['GroupId'] for sg in source_sgs]
    target_ip = result['target']['private_ip']
    source_ip = result['source']['private_ip']

    # Check outbound from source (any SG allowing = pass)
    outbound_allowed = False
    for sg in source_sgs:
        for rule in sg.get('IpPermissionsEgress', []):
            if check_outbound_rule_matches(rule, port, protocol, target_ip):
                outbound_allowed = True
                result['outbound_check'] = {
                    'allowed': True,
                    'matched_sg': sg['GroupId'],
                    'matched_rule': rule
                }
                break
        if outbound_allowed:
            break

    if not outbound_allowed:
        result['blocked_at'] = 'outbound'
        result['recommendation'] = generate_outbound_fix(source_sgs[0], port, protocol)
        return result

    # Check inbound to target
    inbound_allowed = False
    for sg in target_sgs:
        for rule in sg.get('IpPermissions', []):
            if check_rule_matches(rule, port, protocol, source_ip, source_sg_ids):
                inbound_allowed = True
                result['inbound_check'] = {
                    'allowed': True,
                    'matched_sg': sg['GroupId'],
                    'matched_rule': rule
                }
                break
        if inbound_allowed:
            break

    if not inbound_allowed:
        result['blocked_at'] = 'inbound'
        result['recommendation'] = generate_inbound_fix(
            target_sgs[0], port, protocol, source_sg_ids[0]
        )
        return result

    result['allowed'] = True
    return result

Hint 4: Generate actionable fix recommendations

def generate_inbound_fix(target_sg: dict, port: int,
                         protocol: str, source_sg_id: str) -> str:
    """Generate AWS CLI command to fix blocked inbound traffic."""
    return f"""
To allow this connection, run:

aws ec2 authorize-security-group-ingress \\
  --group-id {target_sg['GroupId']} \\
  --protocol {protocol} \\
  --port {port} \\
  --source-group {source_sg_id}

Or in Terraform:

resource "aws_security_group_rule" "allow_from_source" {{
  type                     = "ingress"
  from_port                = {port}
  to_port                  = {port}
  protocol                 = "{protocol}"
  source_security_group_id = "{source_sg_id}"
  security_group_id        = "{target_sg['GroupId']}"
}}
"""

Books That Will Help

Topic Book Chapter
Firewall concepts (stateful/stateless) “Computer Networks, Fifth Edition” by Tanenbaum Ch. 8.9: Firewalls
TCP connection states “TCP/IP Illustrated, Volume 1” by Stevens Ch. 13: TCP Connection Management
AWS Security Groups deep dive “AWS Certified Security Specialty Study Guide” Security Groups Chapter
Network security monitoring “The Practice of Network Security Monitoring” by Bejtlich Ch. 2-4
Python AWS SDK (boto3) “Python for DevOps” by Gift, Behrman AWS Chapter
CLI tool design “The Linux Command Line” by Shotts Ch. 25-27: Shell Scripting

Project 3: VPC Flow Logs Analyzer

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Networking / Security Analysis
  • Software or Tool: AWS VPC Flow Logs, S3, Athena
  • Main Book: “The Practice of Network Security Monitoring” by Richard Bejtlich

What you’ll build: A system that ingests VPC Flow Logs, parses them, and provides real-time visibility into network traffic patterns, anomaly detection, and security insights.

Why it teaches AWS networking: VPC Flow Logs are how you SEE what’s actually happening on the network. Understanding them teaches you about IP flows, connection states, and how to detect both problems and attacks.

Core challenges you’ll face:

  • Parsing flow log format (version 2+ with custom fields) → maps to log parsing
  • Understanding flow states (ACCEPT, REJECT, NODATA) → maps to connection tracking
  • Correlating ENIs to resources (which instance is eni-xxx?) → maps to AWS metadata
  • Detecting anomalies (port scans, unusual traffic) → maps to security monitoring
  • Handling volume (millions of records/hour) → maps to data engineering

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Python, SQL, understanding of network protocols

Real world outcome:

$ ./flow-analyzer dashboard

╔════════════════════════════════════════════════════════════════╗
║           VPC FLOW LOGS DASHBOARD (Last 24 hours)              ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║  TRAFFIC SUMMARY                                               ║
║  ─────────────────────────────────────────────────────────────║
║  Total Flows: 2,456,789                                        ║
║  Accepted: 2,401,234 (97.7%)                                   ║
║  Rejected: 55,555 (2.3%)                                       ║
║  Data Transferred: 1.2 TB                                      ║
║                                                                ║
║  TOP TALKERS (by bytes)                                        ║
║  ─────────────────────────────────────────────────────────────║
║  1. i-abc123 (web-server-1) → 10.0.0.0/8: 245 GB               ║
║  2. i-def456 (db-primary)   → 10.0.1.0/24: 189 GB              ║
║  3. i-ghi789 (app-server)   → 0.0.0.0/0: 156 GB                ║
║                                                                ║
║  ⚠️  SECURITY ALERTS                                           ║
║  ─────────────────────────────────────────────────────────────║
║  🔴 CRITICAL: Port scan detected                               ║
║     Source: 203.0.113.50 (external)                            ║
║     Target: 10.0.1.0/24 (private subnet)                       ║
║     Ports scanned: 22, 23, 80, 443, 3306, 5432                 ║
║     Recommendation: Block 203.0.113.50 in NACL                 ║
║                                                                ║
║  🟡 WARNING: Unusual outbound traffic                          ║
║     Source: i-xyz789 (10.0.2.50)                               ║
║     Destination: 185.143.223.x (known C2 server)               ║
║     Bytes: 2.3 GB over 4 hours                                 ║
║     Recommendation: Isolate instance, investigate              ║
║                                                                ║
║  REJECTED CONNECTIONS (Top Sources)                            ║
║  ─────────────────────────────────────────────────────────────║
║  1. 192.168.1.50:* → 10.0.1.100:22 (SSH blocked) - 12,345      ║
║  2. 10.0.1.50:* → 10.0.2.100:3306 (DB not allowed) - 5,432     ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

$ ./flow-analyzer query "rejected traffic to port 22 in last hour"
Found 1,234 rejected flows to port 22:
  - 85% from external IPs (likely SSH brute force attempts)
  - 15% from internal IPs (misconfigured Security Groups)

Top external sources:
  185.143.223.x: 456 attempts (known bad IP - block in NACL)
  192.168.1.x: 234 attempts (RFC1918 - spoofed, drop at edge)

Implementation Hints: Flow log record format (v2):

version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789012 eni-abc123 10.0.1.50 10.0.2.100 49152 443 6 25 1234 1639489200 1639489260 ACCEPT OK

Processing pipeline:

# Pseudo-code
def process_flow_logs(s3_bucket, prefix):
    # Read from S3 (or Kinesis for real-time)
    for log_file in list_s3_objects(s3_bucket, prefix):
        records = parse_flow_log_file(log_file)

        for record in records:
            # Enrich with metadata
            record.source_instance = lookup_eni(record.interface_id)
            record.geo = geoip_lookup(record.srcaddr) if is_public(record.srcaddr) else None

            # Store in database (TimescaleDB, ClickHouse, etc.)
            insert_record(record)

            # Real-time anomaly detection
            if is_port_scan(record):
                alert("Port scan detected", record)

            if is_known_bad_ip(record.srcaddr):
                alert("Connection from known malicious IP", record)

def is_port_scan(record, window_seconds=60, port_threshold=10):
    # Check if same source hit many ports in short time
    recent_ports = query("""
        SELECT DISTINCT dstport FROM flows
        WHERE srcaddr = ? AND time > now() - interval ?
    """, record.srcaddr, window_seconds)

    return len(recent_ports) > port_threshold

Learning milestones:

  1. Parse and store flow logs efficiently → You understand the format
  2. Identify traffic patterns → You can analyze network behavior
  3. Detect security anomalies → You understand attack patterns
  4. Correlate with AWS resources → You connect network to infrastructure

Real World Outcome

When you complete this project, you’ll have a powerful network visibility tool that transforms raw VPC Flow Logs into actionable security intelligence. Here’s exactly what your tool will do:

# Start the flow analyzer daemon (processes logs from S3 in real-time)
$ ./flow-analyzer start --s3-bucket vpc-flow-logs-prod --region us-east-1

[2024-12-22 14:00:00] Flow Analyzer started
[2024-12-22 14:00:01] Connected to S3: vpc-flow-logs-prod
[2024-12-22 14:00:01] Loaded 156 ENI → Instance mappings
[2024-12-22 14:00:02] Processing backlog: 2,456 log files
[2024-12-22 14:00:15] Backlog processed: 12,456,789 flow records
[2024-12-22 14:00:15] Real-time processing active...

# View the live dashboard
$ ./flow-analyzer dashboard --refresh 5s

╔══════════════════════════════════════════════════════════════════════════════╗
║                    VPC FLOW LOGS DASHBOARD                                    ║
║                    Last updated: 2024-12-22 14:32:15 UTC                      ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  TRAFFIC SUMMARY (Last 24 Hours)                                             ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  ┌─────────────────┬────────────────┬─────────────────────────────────────┐ ║
║  │ Metric          │ Value          │ Graph (24h)                         │ ║
║  ├─────────────────┼────────────────┼─────────────────────────────────────┤ ║
║  │ Total Flows     │ 2,456,789      │ ▂▃▄▅▆▇█▇▆▅▄▅▆▇█▇▆▅▄▃▂▃▄▅          │ ║
║  │ Accepted        │ 2,401,234      │ 97.7% ████████████████████░░        │ ║
║  │ Rejected        │ 55,555         │  2.3% █░░░░░░░░░░░░░░░░░░░░░        │ ║
║  │ Data In         │ 892 GB         │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆          │ ║
║  │ Data Out        │ 1.2 TB         │ ▃▄▅▆▇█▇▆▅▄▃▂▃▄▅▆▇█▇▆▅▄▃▂          │ ║
║  │ Unique Sources  │ 12,456         │                                     │ ║
║  │ Unique Dests    │ 8,234          │                                     │ ║
║  └─────────────────┴────────────────┴─────────────────────────────────────┘ ║
║                                                                              ║
║  TOP TALKERS (by bytes transferred)                                          ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  ┌────┬────────────────────────────┬──────────────┬────────────────────────┐║
║  │ #  │ Source                     │ Bytes        │ Top Destination        │║
║  ├────┼────────────────────────────┼──────────────┼────────────────────────┤║
║  │ 1  │ i-abc123 (web-server-1)   │ 245.6 GB     │ ALB (internal)         │║
║  │ 2  │ i-def456 (db-primary)     │ 189.2 GB     │ db-replica (10.0.11.x) │║
║  │ 3  │ i-ghi789 (app-server-1)   │ 156.8 GB     │ S3 (via endpoint)      │║
║  │ 4  │ i-jkl012 (batch-worker)   │ 98.4 GB      │ External APIs          │║
║  │ 5  │ i-mno345 (cache-server)   │ 67.2 GB      │ app-servers            │║
║  └────┴────────────────────────────┴──────────────┴────────────────────────┘║
║                                                                              ║
║  🚨 SECURITY ALERTS                                                          ║
║  ────────────────────────────────────────────────────────────────────────── ║
║                                                                              ║
║  🔴 CRITICAL [14:28:32] Port Scan Detected                                   ║
║     ┌──────────────────────────────────────────────────────────────────────┐║
║     │ Source: 203.0.113.50 (external - Tor exit node)                      │║
║     │ Target: 10.0.1.0/24 (public subnet)                                  │║
║     │ Ports scanned: 22, 23, 80, 443, 3306, 5432, 6379, 27017 (15 total)  │║
║     │ Duration: 45 seconds                                                  │║
║     │ Status: All REJECTED by Security Groups                              │║
║     │                                                                       │║
║     │ RECOMMENDED ACTION:                                                   │║
║     │ aws ec2 create-network-acl-entry --network-acl-id acl-xxx \          │║
║     │   --rule-number 50 --protocol -1 --cidr-block 203.0.113.50/32 \      │║
║     │   --egress false --rule-action deny                                   │║
║     └──────────────────────────────────────────────────────────────────────┘║
║                                                                              ║
║  🟡 WARNING [14:15:22] Unusual Outbound Traffic                              ║
║     ┌──────────────────────────────────────────────────────────────────────┐║
║     │ Source: i-xyz789 (10.0.2.50) - Name: batch-processor-3               │║
║     │ Destination: 185.143.223.100 (external)                              │║
║     │ GeoIP: Russia, Moscow                                                 │║
║     │ Threat Intel: Listed in AbuseIPDB (C2 server suspected)              │║
║     │ Bytes transferred: 2.3 GB over 4 hours                               │║
║     │ Pattern: Large outbound, minimal inbound (data exfiltration?)        │║
║     │                                                                       │║
║     │ RECOMMENDED ACTIONS:                                                  │║
║     │ 1. Isolate instance: aws ec2 modify-instance-attribute \             │║
║     │      --instance-id i-xyz789 --groups sg-isolated                     │║
║     │ 2. Create forensic snapshot before termination                       │║
║     │ 3. Review CloudTrail for instance compromise indicators              │║
║     └──────────────────────────────────────────────────────────────────────┘║
║                                                                              ║
║  🟢 INFO [13:45:00] New Communication Path Detected                          ║
║     i-web123 (web-tier) → i-cache456 (redis) on port 6379                   ║
║     First seen: 2024-12-22 13:45:00 (new deployment?)                       ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

# Query specific traffic patterns
$ ./flow-analyzer query --sql "
  SELECT srcaddr, COUNT(*) as attempts, COUNT(DISTINCT dstport) as ports
  FROM flows
  WHERE action = 'REJECT'
    AND dstport IN (22, 23, 3389)
    AND start_time > now() - interval '1 hour'
  GROUP BY srcaddr
  HAVING COUNT(DISTINCT dstport) > 2
  ORDER BY attempts DESC
  LIMIT 10
"

┌─────────────────┬──────────┬───────┐
│ srcaddr         │ attempts │ ports │
├─────────────────┼──────────┼───────┤
│ 185.143.223.x   │ 1,456    │ 3     │
│ 203.0.113.50    │ 892      │ 3     │
│ 192.168.1.100   │ 234      │ 2     │  ← Internal! Misconfigured?
│ 45.33.32.156    │ 189      │ 3     │
└─────────────────┴──────────┴───────┘

# Generate security report
$ ./flow-analyzer report --format pdf --period "last 7 days" --output weekly-report.pdf

Generated: weekly-report.pdf
Contents:
  - Executive Summary
  - Traffic Volume Trends
  - Top Talkers Analysis
  - Security Incidents (12 alerts)
  - Rejected Traffic Analysis
  - Recommendations
  - Appendix: Raw Data

# Export to SIEM
$ ./flow-analyzer export --format splunk --dest "https://splunk.company.com:8088/services/collector"

Exported 2,456,789 records to Splunk

Architecture You’ll Build:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         VPC FLOW LOGS ANALYZER                               │
│                                                                             │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────────────────────┐ │
│  │    VPC      │      │   S3        │      │      Flow Analyzer          │ │
│  │             │      │   Bucket    │      │                             │ │
│  │  ┌───────┐  │      │             │      │  ┌──────────────────────┐  │ │
│  │  │ ENI   │──┼──────┼─► Flow Logs ├──────┼──► Log Parser           │  │ │
│  │  └───────┘  │      │             │      │  └──────────┬───────────┘  │ │
│  │  ┌───────┐  │      │             │      │             │              │ │
│  │  │ ENI   │──┼──────┤             │      │  ┌──────────▼───────────┐  │ │
│  │  └───────┘  │      │             │      │  │ Enrichment Engine    │  │ │
│  │  ┌───────┐  │      │             │      │  │  - ENI → Instance    │  │ │
│  │  │ ENI   │──┼──────┤             │      │  │  - GeoIP lookup      │  │ │
│  │  └───────┘  │      │             │      │  │  - Threat Intel      │  │ │
│  │             │      └─────────────┘      │  └──────────┬───────────┘  │ │
│  └─────────────┘                           │             │              │ │
│                                            │  ┌──────────▼───────────┐  │ │
│                                            │  │ Analytics Engine     │  │ │
│                                            │  │  - Anomaly detection │  │ │
│                                            │  │  - Pattern matching  │  │ │
│                                            │  │  - Alerting          │  │ │
│                                            │  └──────────┬───────────┘  │ │
│                                            │             │              │ │
│                                            │  ┌──────────▼───────────┐  │ │
│                                            │  │ Storage (TimescaleDB │  │ │
│                                            │  │ or ClickHouse)       │  │ │
│                                            │  └──────────┬───────────┘  │ │
│                                            │             │              │ │
│                                            │  ┌──────────▼───────────┐  │ │
│                                            │  │ CLI / Dashboard      │  │ │
│                                            │  │  - Real-time view    │  │ │
│                                            │  │  - SQL queries       │  │ │
│                                            │  │  - Reports           │  │ │
│                                            │  └──────────────────────┘  │ │
│                                            └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

The Core Question You’re Answering

“What traffic is actually flowing through my VPC, and how do I detect problems and attacks?”

VPC Flow Logs are your eyes into the network. Without them, you’re blind—you don’t know what’s communicating with what, whether traffic is being blocked, or if there’s a security incident. This project teaches you to transform raw metadata into actionable intelligence.


Concepts You Must Understand First

Stop and research these before coding:

  1. VPC Flow Log Record Format
    • What are the default fields in v2 flow logs?
    • What additional fields can you add in v3+?
    • What does each field actually represent?
    • Why are there “NODATA” and “SKIPDATA” entries?
    • Book Reference: AWS VPC Flow Logs Documentation
  2. Network Protocol Numbers
    • What does protocol “6” mean? (TCP) Protocol “17”? (UDP) Protocol “1”? (ICMP)
    • Where do these numbers come from? (IANA protocol numbers)
    • Why is this important for analyzing traffic?
    • Book Reference: “TCP/IP Illustrated, Volume 1” by Stevens — Ch. 1: Introduction
  3. Flow vs Packet
    • What’s the difference between a flow and a packet?
    • Why does AWS capture flows instead of packets?
    • How long does a flow represent? (aggregation window)
    • Book Reference: “The Practice of Network Security Monitoring” by Bejtlich — Ch. 5: Flow Data
  4. Security Attack Patterns
    • What does a port scan look like in flow logs?
    • How do you detect brute force attacks?
    • What indicates data exfiltration?
    • What’s a C2 (Command & Control) communication pattern?
    • Book Reference: “The Practice of Network Security Monitoring” by Bejtlich — Ch. 8: Analysis Techniques
  5. ENI (Elastic Network Interface) Architecture
    • What is an ENI and how does it relate to instances?
    • Why do flow logs capture at the ENI level, not instance level?
    • How do you map ENI to instance/resource?
    • Book Reference: AWS Documentation on ENIs

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Ingestion
    • Where should flow logs be delivered—S3, CloudWatch Logs, or Kinesis?
    • How do you handle the delay between event and log availability?
    • How do you process backlog when starting up?
    • What’s your strategy for handling millions of records per hour?
  2. Data Storage
    • What database is best for time-series flow data? (TimescaleDB, ClickHouse, etc.)
    • How do you partition data for efficient queries?
    • How long should you retain data?
    • How do you handle storage costs at scale?
  3. Enrichment
    • How do you map ENI IDs to instance names?
    • Should you do GeoIP lookups? Performance implications?
    • How do you integrate threat intelligence feeds?
    • How often should you refresh ENI mappings?
  4. Detection Logic
    • What thresholds define a “port scan”?
    • How do you distinguish attack traffic from legitimate scanning?
    • What’s “unusual” outbound traffic?
    • How do you avoid alert fatigue?
  5. Output & Alerting
    • How should alerts be delivered—Slack, PagerDuty, email?
    • What severity levels should you use?
    • How do you make the dashboard useful for both security and operations?

Thinking Exercise

Before coding, analyze these flow log samples:

Sample 1: Normal Web Traffic

2 123456789012 eni-web123 203.0.113.50 10.0.1.100 52341 443 6 25 15000 1639489200 1639489260 ACCEPT OK
2 123456789012 eni-web123 10.0.1.100 203.0.113.50 443 52341 6 30 125000 1639489200 1639489260 ACCEPT OK

Questions:

  • Which direction is the client → server traffic?
  • How many bytes did the server send vs receive?
  • What service is being accessed?

Sample 2: Port Scan

2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45123 22 6 1 40 1639489200 1639489201 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45124 23 6 1 40 1639489201 1639489202 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45125 80 6 1 40 1639489202 1639489203 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45126 443 6 1 40 1639489203 1639489204 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45127 3306 6 1 40 1639489204 1639489205 REJECT OK

Questions:

  • How do you know this is a port scan?
  • What’s the scan rate (ports per second)?
  • Why are all actions REJECT?
  • What NACL rule would block this?

Sample 3: Possible Data Exfiltration

2 123456789012 eni-app456 10.0.10.50 185.143.223.100 49152 443 6 50000 52428800 1639400000 1639486400 ACCEPT OK
2 123456789012 eni-app456 185.143.223.100 10.0.10.50 443 49152 6 1000 50000 1639400000 1639486400 ACCEPT OK

Questions:

  • How much data was sent outbound? (52428800 bytes = 50 MB)
  • Over what time period? (86400 seconds = 24 hours)
  • Why is the ratio suspicious? (50 MB out, 50 KB in)
  • What’s the destination? (External IP—needs investigation)

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What are VPC Flow Logs and what do they capture?”
  2. “What’s the difference between ACCEPT and REJECT in flow logs?”
  3. “How would you detect a port scan using flow logs?”
  4. “What are the limitations of VPC Flow Logs?” (no payload, no DNS queries to Amazon DNS)
  5. “How would you set up flow logs for compliance/audit requirements?”
  6. “What’s the performance impact of enabling flow logs?”
  7. “How do you analyze flow logs at scale?”
  8. “What’s the difference between flow logs sent to S3 vs CloudWatch Logs?”
  9. “How would you use flow logs to troubleshoot a connectivity issue?”
  10. “What security threats can you detect with flow logs?”

Hints in Layers

Hint 1: Parse flow log records efficiently

from dataclasses import dataclass
from typing import Optional
import ipaddress

@dataclass
class FlowRecord:
    version: int
    account_id: str
    interface_id: str
    srcaddr: str
    dstaddr: str
    srcport: int
    dstport: int
    protocol: int
    packets: int
    bytes: int
    start: int
    end: int
    action: str
    log_status: str

    @classmethod
    def from_line(cls, line: str) -> Optional['FlowRecord']:
        """Parse a single flow log line."""
        parts = line.strip().split()
        if len(parts) < 14 or parts[0] != '2':  # Version 2
            return None

        return cls(
            version=int(parts[0]),
            account_id=parts[1],
            interface_id=parts[2],
            srcaddr=parts[3],
            dstaddr=parts[4],
            srcport=int(parts[5]) if parts[5] != '-' else 0,
            dstport=int(parts[6]) if parts[6] != '-' else 0,
            protocol=int(parts[7]) if parts[7] != '-' else 0,
            packets=int(parts[8]) if parts[8] != '-' else 0,
            bytes=int(parts[9]) if parts[9] != '-' else 0,
            start=int(parts[10]) if parts[10] != '-' else 0,
            end=int(parts[11]) if parts[11] != '-' else 0,
            action=parts[12],
            log_status=parts[13]
        )

    def is_external_source(self) -> bool:
        """Check if source IP is external (not RFC1918)."""
        try:
            ip = ipaddress.ip_address(self.srcaddr)
            return not ip.is_private
        except ValueError:
            return False

    def protocol_name(self) -> str:
        """Convert protocol number to name."""
        protocols = {1: 'ICMP', 6: 'TCP', 17: 'UDP'}
        return protocols.get(self.protocol, str(self.protocol))

Hint 2: Build the ENI-to-Instance mapping

import boto3
from functools import lru_cache

class ENIMapper:
    def __init__(self):
        self.ec2 = boto3.client('ec2')
        self._cache = {}

    def refresh_cache(self):
        """Refresh ENI to instance mapping."""
        self._cache = {}

        # Get all ENIs
        paginator = self.ec2.get_paginator('describe_network_interfaces')
        for page in paginator.paginate():
            for eni in page['NetworkInterfaces']:
                eni_id = eni['NetworkInterfaceId']
                attachment = eni.get('Attachment', {})

                self._cache[eni_id] = {
                    'instance_id': attachment.get('InstanceId'),
                    'private_ip': eni.get('PrivateIpAddress'),
                    'vpc_id': eni.get('VpcId'),
                    'subnet_id': eni.get('SubnetId'),
                    'description': eni.get('Description', '')
                }

        # Enrich with instance names
        instance_ids = [v['instance_id'] for v in self._cache.values()
                       if v['instance_id']]
        if instance_ids:
            instances = self.ec2.describe_instances(InstanceIds=instance_ids)
            for reservation in instances['Reservations']:
                for instance in reservation['Instances']:
                    name = next(
                        (tag['Value'] for tag in instance.get('Tags', [])
                         if tag['Key'] == 'Name'),
                        instance['InstanceId']
                    )
                    # Update all ENIs for this instance
                    for eni_id, data in self._cache.items():
                        if data['instance_id'] == instance['InstanceId']:
                            data['instance_name'] = name

    def lookup(self, eni_id: str) -> dict:
        """Look up ENI details."""
        return self._cache.get(eni_id, {'unknown': True})

Hint 3: Implement port scan detection

from collections import defaultdict
from datetime import datetime, timedelta

class PortScanDetector:
    def __init__(self, window_seconds=60, port_threshold=10):
        self.window_seconds = window_seconds
        self.port_threshold = port_threshold
        # {source_ip: [(timestamp, port), ...]}
        self.activity = defaultdict(list)

    def check(self, record: FlowRecord) -> Optional[dict]:
        """Check if this record indicates a port scan."""
        if record.action != 'REJECT':
            return None  # Only care about rejected (probing) traffic

        now = datetime.utcnow()
        source = record.srcaddr

        # Add this activity
        self.activity[source].append((now, record.dstport))

        # Clean old activity
        cutoff = now - timedelta(seconds=self.window_seconds)
        self.activity[source] = [
            (ts, port) for ts, port in self.activity[source]
            if ts > cutoff
        ]

        # Check for port scan pattern
        recent_ports = set(port for _, port in self.activity[source])
        if len(recent_ports) >= self.port_threshold:
            return {
                'type': 'PORT_SCAN',
                'severity': 'HIGH',
                'source_ip': source,
                'ports_scanned': sorted(recent_ports),
                'window_seconds': self.window_seconds,
                'recommendation': f"Block {source} in NACL"
            }

        return None

Hint 4: Create the analytics dashboard

from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.live import Live

class Dashboard:
    def __init__(self, db):
        self.db = db
        self.console = Console()

    def render(self):
        """Render the dashboard."""
        # Traffic summary
        summary = self.db.query("""
            SELECT
                COUNT(*) as total_flows,
                SUM(CASE WHEN action = 'ACCEPT' THEN 1 ELSE 0 END) as accepted,
                SUM(CASE WHEN action = 'REJECT' THEN 1 ELSE 0 END) as rejected,
                SUM(bytes) as total_bytes
            FROM flows
            WHERE start_time > now() - interval '24 hours'
        """).fetchone()

        # Top talkers
        top_talkers = self.db.query("""
            SELECT
                srcaddr,
                SUM(bytes) as total_bytes,
                COUNT(DISTINCT dstaddr) as unique_destinations
            FROM flows
            WHERE action = 'ACCEPT'
              AND start_time > now() - interval '24 hours'
            GROUP BY srcaddr
            ORDER BY total_bytes DESC
            LIMIT 5
        """).fetchall()

        # Build display
        table = Table(title="Top Talkers (24h)")
        table.add_column("Source IP")
        table.add_column("Bytes", justify="right")
        table.add_column("Destinations", justify="right")

        for row in top_talkers:
            # Enrich with instance name
            eni_info = self.eni_mapper.lookup_by_ip(row['srcaddr'])
            name = eni_info.get('instance_name', row['srcaddr'])
            table.add_row(
                name,
                format_bytes(row['total_bytes']),
                str(row['unique_destinations'])
            )

        self.console.print(table)

Books That Will Help

Topic Book Chapter
Network flow analysis “The Practice of Network Security Monitoring” by Bejtlich Ch. 5: Flow Data, Ch. 8: Analysis
TCP/IP fundamentals “TCP/IP Illustrated, Volume 1” by Stevens Ch. 1-4: Protocol basics
Security monitoring “Applied Network Security Monitoring” by Sanders Ch. 5-7: Collection and Analysis
Time-series databases “Designing Data-Intensive Applications” by Kleppmann Ch. 3: Storage and Retrieval
Python data processing “Data Engineering with Python” by Reis Ch. 4-6: Pipelines
AWS networking “AWS Certified Advanced Networking Study Guide” VPC Flow Logs section

Project 4: VPC Peering vs Transit Gateway Lab

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK, CloudFormation
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Cloud Architecture / Networking
  • Software or Tool: AWS VPC Peering, Transit Gateway
  • Main Book: “AWS Certified Advanced Networking Study Guide”

What you’ll build: Two parallel network architectures—one using VPC Peering (full mesh) and one using Transit Gateway (hub-and-spoke)—with performance tests and cost analysis to understand when to use each.

Why it teaches AWS networking: The VPC Peering vs Transit Gateway decision is one of the most important architectural choices. This project lets you experience both, measure the differences, and make informed decisions.

Core challenges you’ll face:

  • Full mesh complexity (n VPCs = n(n-1)/2 peering connections) → maps to *scaling limits
  • Non-transitive routing (VPC A↔B↔C doesn’t mean A↔C) → maps to routing behavior
  • TGW route table design (attachment associations, propagations) → maps to hub routing
  • Latency measurement (extra hop in TGW) → maps to performance tradeoffs
  • Cost analysis (TGW hourly + data processing vs peering data transfer) → maps to FinOps

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC fundamentals, Terraform

Real world outcome:

$ terraform apply -var="architecture=peering"
Creating VPC Peering architecture (5 VPCs, full mesh)...
VPC-A ↔ VPC-B: peering-001
VPC-A ↔ VPC-C: peering-002
VPC-A ↔ VPC-D: peering-003
VPC-A ↔ VPC-E: peering-004
VPC-B ↔ VPC-C: peering-005
... (10 peering connections total)
Route tables: 10 routes per VPC

$ terraform apply -var="architecture=tgw"
Creating Transit Gateway architecture (5 VPCs, hub-and-spoke)...
Transit Gateway: tgw-0abc123
VPC-A attachment: tgw-attach-001
VPC-B attachment: tgw-attach-002
... (5 attachments total)
Route tables: 1 route per VPC (0.0.0.0/0 → TGW)

$ ./network-benchmark

╔══════════════════════════════════════════════════════════════╗
║         VPC PEERING vs TRANSIT GATEWAY COMPARISON            ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  CONFIGURATION COMPLEXITY                                    ║
║  ──────────────────────────────────────────────────────────  ║
║  Peering (5 VPCs):     10 connections, 50 route entries      ║
║  TGW (5 VPCs):         1 TGW, 5 attachments, 5 route entries ║
║                                                              ║
║  LATENCY (VPC-A to VPC-E, avg of 1000 pings)                 ║
║  ──────────────────────────────────────────────────────────  ║
║  VPC Peering:          0.3ms (direct connection)             ║
║  Transit Gateway:      0.5ms (+0.2ms for TGW hop)            ║
║                                                              ║
║  BANDWIDTH (iperf3, same AZ)                                 ║
║  ──────────────────────────────────────────────────────────  ║
║  VPC Peering:          ~unlimited (AWS backbone)             ║
║  Transit Gateway:      50 Gbps per attachment (100 Gbps now) ║
║                                                              ║
║  MONTHLY COST (5 VPCs, 1 TB cross-VPC traffic)               ║
║  ──────────────────────────────────────────────────────────  ║
║  VPC Peering:          $10.00 (data transfer only)           ║
║  Transit Gateway:      $380.00 ($36/attachment + $0.02/GB)   ║
║                                                              ║
║  SCALING TO 50 VPCs                                          ║
║  ──────────────────────────────────────────────────────────  ║
║  Peering connections:  1,225 (50*49/2) - UNMANAGEABLE!       ║
║  TGW attachments:      50 - easy to manage                   ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝

RECOMMENDATION:
- < 10 VPCs, low traffic: VPC Peering (cost-effective)
- > 10 VPCs or hybrid connectivity: Transit Gateway (manageable)
- Latency-critical applications: VPC Peering (direct path)

Implementation Hints: VPC Peering full mesh calculation:

For n VPCs:
  Peering connections = n * (n-1) / 2
  Route entries per VPC = n - 1

  5 VPCs:  10 connections, 4 routes each
  10 VPCs: 45 connections, 9 routes each
  50 VPCs: 1,225 connections, 49 routes each (NIGHTMARE!)

Transit Gateway scales linearly:

For n VPCs:
  TGW attachments = n
  Route entries per VPC = 1 (point to TGW)

  5 VPCs:  5 attachments
  50 VPCs: 50 attachments
  500 VPCs: 500 attachments (TGW supports 5,000)

Key insight about transitivity:

VPC Peering: A ↔ B, B ↔ C does NOT mean A ↔ C
Transit Gateway: A → TGW → C works automatically

# This is why peering becomes unmanageable at scale
# Every VPC needs direct peering to every other VPC

Learning milestones:

  1. Both architectures deploy successfully → You understand the components
  2. Traffic flows in both designs → You understand routing
  3. Measure latency difference → You understand the TGW hop
  4. Calculate cost breakeven point → You can make business decisions

Project 5: NAT Gateway Deep Dive & Cost Optimizer

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Bash
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Cloud Networking / FinOps
  • Software or Tool: AWS NAT Gateway, VPC Endpoints
  • Main Book: “Cloud FinOps” by J.R. Storment

What you’ll build: A tool that analyzes NAT Gateway traffic, identifies cost-saving opportunities (like using VPC Endpoints for AWS services), and provides recommendations with projected savings.

Why it teaches AWS networking: NAT Gateway is often the #1 surprise cost in AWS bills. Understanding why traffic goes through NAT (vs. VPC Endpoints) teaches you about routing, endpoints, and cost optimization.

Core challenges you’ll face:

  • Analyzing NAT Gateway bytes (CloudWatch metrics, flow logs) → maps to traffic analysis
  • Identifying AWS service traffic (S3, DynamoDB, ECR, etc.) → maps to endpoint candidates
  • Calculating endpoint savings (NAT costs vs endpoint costs) → maps to FinOps
  • Route table impact (how endpoints change routing) → maps to routing precedence

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: AWS billing understanding, CloudWatch

Real world outcome:

$ ./nat-optimizer analyze --vpc vpc-0abc123

╔════════════════════════════════════════════════════════════════╗
║           NAT GATEWAY COST ANALYSIS (Last 30 days)             ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║  NAT GATEWAY SUMMARY                                           ║
║  ─────────────────────────────────────────────────────────────║
║  NAT Gateway: nat-0abc123 (us-east-1a)                         ║
║  Data Processed: 15.2 TB                                       ║
║  Current Cost: $683.10 ($0.045/GB processing)                  ║
║  Hourly Cost: $32.40 (720 hours × $0.045)                      ║
║  TOTAL: $715.50/month                                          ║
║                                                                ║
║  TRAFFIC BREAKDOWN (by destination)                            ║
║  ─────────────────────────────────────────────────────────────║
║  1. S3 (52-region prefix lists):     8.5 TB (55.9%)            ║
║  2. ECR (Elastic Container Registry): 3.2 TB (21.1%)           ║
║  3. DynamoDB:                         1.8 TB (11.8%)           ║
║  4. Internet (non-AWS):               1.7 TB (11.2%)           ║
║                                                                ║
║  💰 SAVINGS OPPORTUNITIES                                      ║
║  ─────────────────────────────────────────────────────────────║
║                                                                ║
║  S3 Gateway Endpoint (FREE!)                                   ║
║    Current: 8.5 TB × $0.045 = $382.50                          ║
║    With Endpoint: $0.00 (gateway endpoints are free)           ║
║    SAVINGS: $382.50/month                                      ║
║                                                                ║
║  DynamoDB Gateway Endpoint (FREE!)                             ║
║    Current: 1.8 TB × $0.045 = $81.00                           ║
║    With Endpoint: $0.00                                        ║
║    SAVINGS: $81.00/month                                       ║
║                                                                ║
║  ECR Interface Endpoint                                        ║
║    Current: 3.2 TB × $0.045 = $144.00                          ║
║    Endpoint hourly: $7.30/month (2 AZs × $0.01/hr × 730)       ║
║    Endpoint data: 3.2 TB × $0.01 = $32.00                      ║
║    SAVINGS: $104.70/month                                      ║
║                                                                ║
║  ═══════════════════════════════════════════════════════════  ║
║  TOTAL POTENTIAL SAVINGS: $568.20/month (79.4% reduction!)     ║
║  REMAINING NAT COST: $147.30 (internet traffic only)           ║
║  ═══════════════════════════════════════════════════════════  ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

$ ./nat-optimizer deploy-endpoints --vpc vpc-0abc123 --dry-run
Would create:
  - S3 Gateway Endpoint (vpce-gw-s3)
  - DynamoDB Gateway Endpoint (vpce-gw-ddb)
  - ECR Interface Endpoint (vpce-if-ecr-api, vpce-if-ecr-dkr)

Route table changes:
  - rtb-private-a: Add S3 prefix list → vpce-gw-s3
  - rtb-private-b: Add S3 prefix list → vpce-gw-s3
  - rtb-private-a: Add DynamoDB prefix list → vpce-gw-ddb
  - rtb-private-b: Add DynamoDB prefix list → vpce-gw-ddb

Implementation Hints: The key insight: Gateway Endpoints are FREE (for S3 and DynamoDB), while Interface Endpoints cost ~$7.30/month per AZ but are much cheaper than NAT Gateway for high-traffic services.

# Pseudo-code for traffic analysis
def analyze_nat_traffic(vpc_id, days=30):
    # Get NAT Gateway CloudWatch metrics
    nat_gateways = get_nat_gateways(vpc_id)

    for nat in nat_gateways:
        # Get bytes processed
        metrics = cloudwatch.get_metric_statistics(
            Namespace='AWS/NATGateway',
            MetricName='BytesOutToDestination',
            Dimensions=[{'Name': 'NatGatewayId', 'Value': nat.id}],
            Period=86400,
            Statistics=['Sum']
        )

        # Analyze flow logs to determine destinations
        flows = query_flow_logs(f"""
            SELECT dstaddr, SUM(bytes) as total_bytes
            FROM flow_logs
            WHERE interface_id = '{nat.eni_id}'
            AND action = 'ACCEPT'
            GROUP BY dstaddr
            ORDER BY total_bytes DESC
        """)

        # Categorize by destination
        for flow in flows:
            if is_s3_ip(flow.dstaddr):
                traffic['s3'] += flow.total_bytes
            elif is_dynamodb_ip(flow.dstaddr):
                traffic['dynamodb'] += flow.total_bytes
            elif is_ecr_ip(flow.dstaddr):
                traffic['ecr'] += flow.total_bytes
            else:
                traffic['internet'] += flow.total_bytes

    return calculate_savings(traffic)

Learning milestones:

  1. Identify traffic patterns through NAT → You understand outbound flow
  2. Calculate endpoint savings correctly → You understand pricing
  3. Deploy endpoints without breaking traffic → You understand routing
  4. See cost reduction in AWS bill → You’ve made a real impact

Project 6: Multi-Account Network with AWS Organizations

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK, CloudFormation StackSets
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Enterprise Architecture / Networking
  • Software or Tool: AWS Organizations, RAM, Transit Gateway
  • Main Book: “AWS Certified Solutions Architect Professional Study Guide”

What you’ll build: A multi-account network architecture with a shared services VPC (containing NAT Gateways, VPC Endpoints), application VPCs in separate accounts, and Transit Gateway connecting everything—all shared via Resource Access Manager.

Why it teaches AWS networking: Enterprise AWS is multi-account. Understanding how to share networking resources across accounts, maintain isolation while enabling connectivity, and centralize common infrastructure is essential for cloud architects.

Core challenges you’ll face:

  • Resource Access Manager (sharing TGW, subnets across accounts) → maps to multi-account patterns
  • Centralized egress (shared NAT, single point of control) → maps to hub-and-spoke
  • Centralized VPC Endpoints (cost optimization, management) → maps to shared services
  • Route propagation (TGW route tables, blackhole routes) → maps to transit routing
  • Security boundaries (what CAN vs SHOULD communicate) → maps to network segmentation

Key Concepts:

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Multi-account AWS setup, Terraform, Transit Gateway basics

Real world outcome:

MULTI-ACCOUNT NETWORK ARCHITECTURE

┌──────────────────────────────────────────────────────────────────┐
│                     MANAGEMENT ACCOUNT                            │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  AWS Organizations                                          │  │
│  │  Resource Access Manager (shares TGW, Subnets)              │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│                     NETWORK ACCOUNT                               │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  Shared Services VPC (10.0.0.0/16)                        │    │
│  │  ├── NAT Gateways (centralized egress)                    │    │
│  │  ├── VPC Endpoints (S3, DynamoDB, ECR, etc.)             │    │
│  │  └── Transit Gateway (hub)                                │    │
│  └──────────────────────────────────────────────────────────┘    │
│                              │                                    │
│              ┌───────────────┼───────────────┐                    │
│              ▼               ▼               ▼                    │
└──────────────────────────────────────────────────────────────────┘
               │               │               │
               ▼               ▼               ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│  DEV ACCOUNT    │ │  STAGING ACCOUNT│ │  PROD ACCOUNT   │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ App VPC     │ │ │ │ App VPC     │ │ │ │ App VPC     │ │
│ │ 10.1.0.0/16 │ │ │ │ 10.2.0.0/16 │ │ │ │ 10.3.0.0/16 │ │
│ │             │ │ │ │             │ │ │ │             │ │
│ │ TGW Attach  │ │ │ │ TGW Attach  │ │ │ │ TGW Attach  │ │
│ └─────────────┘ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘

$ terraform apply

Network Account:
  Transit Gateway: tgw-0net123 (shared via RAM)
  Shared Services VPC: vpc-0shared
  NAT Gateway: nat-0central
  VPC Endpoints: 8 endpoints (S3, DynamoDB, ECR, SSM, etc.)

Dev Account:
  TGW Attachment: tgw-attach-dev (accepted from RAM share)
  App VPC: vpc-0dev (10.1.0.0/16)
  Route: 0.0.0.0/0 → TGW → Shared Services → NAT → Internet

Prod Account:
  TGW Attachment: tgw-attach-prod
  App VPC: vpc-0prod (10.3.0.0/16)
  ISOLATED: Cannot reach Dev VPC (TGW route table segmentation)

# Test connectivity
$ aws-vault exec dev -- ssh app-server
[dev-app]$ curl https://s3.amazonaws.com/mybucket/test.txt
OK (via centralized VPC endpoint - no NAT!)

[dev-app]$ ping 10.3.0.50  # Prod server
Request timed out (blocked by TGW route table - no route to prod)

Implementation Hints: Key architecture decisions:

  1. Centralized Egress: All outbound traffic from spoke VPCs routes through the shared services VPC. This gives you:
    • Single NAT Gateway bill (not one per VPC)
    • Central firewall/inspection point
    • Unified logging
  2. Route Table Segmentation: Use separate TGW route tables to control what can talk to what: ``` TGW Route Table: “shared-services”
    • All VPC attachments associated
    • Routes to all VPCs (hub access)

TGW Route Table: “dev”

  • Dev VPC attachment associated
  • Route to shared-services VPC: 10.0.0.0/16
  • Route to dev VPC: 10.1.0.0/16
  • NO route to prod VPC (isolation!)

TGW Route Table: “prod”

  • Prod VPC attachment associated
  • Route to shared-services VPC: 10.0.0.0/16
  • Route to prod VPC: 10.3.0.0/16
  • NO route to dev VPC ```
  1. Centralized VPC Endpoints: ```hcl

    In network account

    resource “aws_vpc_endpoint” “s3” { vpc_id = aws_vpc.shared_services.id service_name = “com.amazonaws.${var.region}.s3” # This endpoint is reachable from spoke VPCs via TGW }

Spoke VPCs route S3 traffic: VPC → TGW → Shared VPC → Endpoint


**Learning milestones**:
1. **Resources share correctly via RAM** → You understand cross-account sharing
2. **Spoke VPCs reach internet via centralized NAT** → You understand transit routing
3. **Dev cannot reach Prod** → You understand TGW route table segmentation
4. **All AWS API calls use centralized endpoints** → You understand endpoint routing

---

## Project 7: Site-to-Site VPN with Simulated On-Premises

- **File**: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- **Main Programming Language**: Terraform
- **Alternative Programming Languages**: CloudFormation, Manual (console + strongSwan)
- **Coolness Level**: Level 3: Genuinely Clever
- **Business Potential**: 3. The "Service & Support" Model
- **Difficulty**: Level 3: Advanced
- **Knowledge Area**: Hybrid Networking / VPN
- **Software or Tool**: AWS Site-to-Site VPN, strongSwan, Libreswan
- **Main Book**: "Computer Networks, Fifth Edition" by Tanenbaum (Chapter on VPNs)

**What you'll build**: A complete Site-to-Site VPN setup with an EC2 instance running strongSwan to simulate an on-premises data center, demonstrating IPsec tunnel establishment, BGP routing, and failover.

**Why it teaches AWS networking**: Hybrid connectivity is essential for enterprise AWS. By simulating the "on-premises" side with strongSwan, you'll understand both ends of the VPN tunnel—something most engineers never see.

**Core challenges you'll face**:
- **IPsec configuration** (Phase 1, Phase 2, encryption algorithms) → maps to *VPN protocols*
- **BGP configuration** (ASNs, route advertisement) → maps to *dynamic routing*
- **Tunnel redundancy** (two tunnels, failover testing) → maps to *high availability*
- **NAT-Traversal** (when CGW is behind NAT) → maps to *real-world complexity*
- **Troubleshooting tunnels** (why isn't it coming up?) → maps to *practical debugging*

**Key Concepts**:
- **Site-to-Site VPN**: [AWS VPN Documentation](https://docs.aws.amazon.com/vpn/latest/s2svpn/VPC_VPN.html)
- **IPsec Fundamentals**: *"Computer Networks, Fifth Edition"* Chapter 8 - Tanenbaum
- **BGP Basics**: [AWS BGP Routing](https://docs.aws.amazon.com/vpn/latest/s2svpn/cgw-options.html)
- **strongSwan Configuration**: [strongSwan AWS Guide](https://docs.strongswan.org/docs/5.9/interop/aws.html)

**Difficulty**: Advanced
**Time estimate**: 1-2 weeks
**Prerequisites**: Basic networking, Linux administration

**Real world outcome**:
```bash
$ terraform apply

AWS Resources Created:
  VPN Gateway: vgw-0abc123 (attached to VPC)
  Customer Gateway: cgw-0def456 (simulated on-prem)
  VPN Connection: vpn-0ghi789 (2 tunnels)
    Tunnel 1: 169.254.10.1/30 (AWS) ↔ 169.254.10.2/30 (On-prem)
    Tunnel 2: 169.254.11.1/30 (AWS) ↔ 169.254.11.2/30 (On-prem)

On-Premises Simulation (EC2 in different VPC):
  strongSwan instance: i-onprem123
  Public IP: 54.x.x.x (Customer Gateway IP)
  Internal network: 192.168.0.0/16 (simulated corporate LAN)

# Check VPN status
$ aws ec2 describe-vpn-connections --vpn-connection-ids vpn-0ghi789 \
    --query 'VpnConnections[0].VgwTelemetry'

[
  {
    "OutsideIpAddress": "52.1.2.3",
    "Status": "UP",
    "StatusMessage": "2 BGP ROUTES",
    "AcceptedRouteCount": 2
  },
  {
    "OutsideIpAddress": "52.4.5.6",
    "Status": "UP",
    "StatusMessage": "2 BGP ROUTES",
    "AcceptedRouteCount": 2
  }
]

# Test connectivity
$ ssh on-prem-server
[on-prem]$ ping 10.0.1.50  # AWS private IP
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=1 ttl=64 time=15.2 ms
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=2 ttl=64 time=14.8 ms

[on-prem]$ traceroute 10.0.1.50
 1  192.168.1.1 (gateway)        0.5 ms
 2  169.254.10.1 (AWS tunnel)    15.1 ms  # Through IPsec tunnel!
 3  10.0.1.50 (AWS instance)     15.3 ms

# Failover test
$ ssh on-prem-server
[on-prem]$ sudo ipsec down aws-tunnel-1
[on-prem]$ ping 10.0.1.50
# Brief interruption, then traffic flows through tunnel-2
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=5 ttl=64 time=16.1 ms

Implementation Hints: The strongSwan configuration on the simulated on-prem server:

# /etc/ipsec.conf (simplified)
conn aws-tunnel-1
    auto=start
    type=tunnel
    authby=secret
    left=%defaultroute
    leftid=54.x.x.x              # On-prem public IP
    leftsubnet=192.168.0.0/16    # On-prem network
    right=52.1.2.3               # AWS VPN endpoint 1
    rightsubnet=10.0.0.0/16      # AWS VPC CIDR
    ike=aes256-sha256-modp2048
    esp=aes256-sha256
    keyexchange=ikev2
    ikelifetime=8h
    lifetime=1h
    dpdaction=restart
    dpddelay=10s

conn aws-tunnel-2
    # Similar config for second tunnel
    right=52.4.5.6               # AWS VPN endpoint 2

BGP configuration for dynamic routing:

# Using quagga/FRR for BGP
router bgp 65001
  neighbor 169.254.10.1 remote-as 64512  # AWS ASN
  neighbor 169.254.11.1 remote-as 64512
  network 192.168.0.0/16                  # Advertise on-prem network

Key troubleshooting steps:

# Check IPsec status
sudo ipsec statusall

# Check BGP sessions
sudo vtysh -c "show ip bgp summary"

# Check routes learned from AWS
ip route show | grep 10.0

# Debug IPsec
sudo ipsec stroke loglevel ike 3
sudo tail -f /var/log/syslog | grep charon

Learning milestones:

  1. Both tunnels establish → You understand IPsec negotiation
  2. BGP routes are exchanged → You understand dynamic routing
  3. Traffic flows through VPN → You can verify connectivity
  4. Failover works when one tunnel dies → You understand redundancy

Project 8: AWS Network Firewall Deployment

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK, CloudFormation
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Network Security
  • Software or Tool: AWS Network Firewall
  • Main Book: “Network Security Monitoring” by Richard Bejtlich

What you’ll build: A network inspection architecture using AWS Network Firewall with domain filtering, IDS/IPS rules, and logging—positioned to inspect all egress traffic.

Why it teaches AWS networking: AWS Network Firewall provides VPC-perimeter inspection that Security Groups and NACLs can’t do. Understanding where to place it, how to route traffic through it, and how to write rules teaches deep network security.

Core challenges you’ll face:

  • Firewall subnet placement (dedicated subnets, routing) → maps to inspection architecture
  • Routing to the firewall (ingress routing, gateway route tables) → maps to traffic flow design
  • Rule group design (stateful vs stateless, domain filtering) → maps to firewall policy
  • Logging and alerting (what to log, where to send) → maps to security monitoring
  • Understanding Suricata rules (IDS/IPS syntax) → maps to threat detection

Key Concepts:

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC routing, basic firewall concepts

Real world outcome:

$ terraform apply

Resources Created:
  Network Firewall: nfw-0abc123
  Firewall Policy: policy-egress-inspection
  Rule Groups:
    - Stateless: Allow/Drop by IP/Port
    - Stateful: Domain filtering, IDS/IPS

Architecture:
  ┌─────────────────────────────────────────────────────────┐
  │                        VPC                               │
  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
  │  │ Private     │    │ Firewall    │    │ Public      │  │
  │  │ Subnet      │───▶│ Subnet      │───▶│ Subnet      │  │
  │  │ (App)       │    │ (NFW)       │    │ (NAT)       │  │
  │  └─────────────┘    └─────────────┘    └─────────────┘  │
  │       10.0.1.0/24      10.0.100.0/24      10.0.0.0/24   │
  │                             │                     │      │
  │                       ┌─────┴─────┐         ┌────┴────┐ │
  │                       │  Network  │         │   NAT   │ │
  │                       │  Firewall │         │ Gateway │ │
  │                       └───────────┘         └─────────┘ │
  └─────────────────────────────────────────────────────────┘

# Test domain filtering
$ ssh app-server
[app]$ curl https://example.com
HTTP/1.1 200 OK  # Allowed

[app]$ curl https://malware-c2-server.evil
curl: (Connection refused)  # BLOCKED by Network Firewall!

# Check firewall logs
$ aws logs filter-log-events --log-group-name /aws/network-firewall/alert

{
  "event_type": "alert",
  "alert": {
    "action": "blocked",
    "signature": "ET MALWARE Known C2 Domain",
    "src_ip": "10.0.1.50",
    "dest_ip": "185.143.x.x",
    "dest_port": 443
  }
}

# Firewall metrics
$ ./nfw-monitor

╔════════════════════════════════════════════════════════════╗
║         NETWORK FIREWALL METRICS (Last Hour)               ║
╠════════════════════════════════════════════════════════════╣
║                                                            ║
║  Traffic Processed: 2.3 GB                                 ║
║  Packets Inspected: 4,567,890                              ║
║                                                            ║
║  Actions:                                                  ║
║    ✓ Passed: 4,560,123 (99.8%)                             ║
║    ✗ Dropped: 7,767 (0.2%)                                 ║
║                                                            ║
║  Top Blocked:                                              ║
║    1. Malware C2 domains: 234 connections                  ║
║    2. Crypto mining pools: 156 connections                 ║
║    3. Known bad IPs: 89 connections                        ║
║                                                            ║
║  IDS/IPS Alerts: 47                                        ║
║    - SQL injection attempts: 12                            ║
║    - SSH brute force: 35                                   ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

Implementation Hints: The critical routing for egress inspection:

# Traffic flow: App → NFW → NAT → Internet
# Return flow: Internet → NAT → NFW → App

# Private subnet route table (where apps live)
resource "aws_route" "private_to_firewall" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  vpc_endpoint_id        = aws_networkfirewall_firewall.main.firewall_status[0].sync_states[0].attachment[0].endpoint_id
}

# Firewall subnet route table
resource "aws_route" "firewall_to_nat" {
  route_table_id         = aws_route_table.firewall.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.main.id
}

# NAT Gateway subnet needs INGRESS routing back through firewall
resource "aws_route" "nat_to_firewall" {
  route_table_id         = aws_route_table.public.id
  destination_cidr_block = "10.0.1.0/24"  # Private subnet
  vpc_endpoint_id        = aws_networkfirewall_firewall.main.firewall_status[0].sync_states[0].attachment[0].endpoint_id
}

Example firewall rules:

# Domain filtering (block malware C2)
resource "aws_networkfirewall_rule_group" "domain_block" {
  capacity = 100
  name     = "block-malicious-domains"
  type     = "STATEFUL"

  rule_group {
    rule_variables {
      ip_sets {
        key = "HOME_NET"
        ip_set { definition = ["10.0.0.0/16"] }
      }
    }
    rules_source {
      rules_source_list {
        generated_rules_type = "DENYLIST"
        target_types         = ["TLS_SNI", "HTTP_HOST"]
        targets              = [".evil.com", ".malware.net", ".c2server.xyz"]
      }
    }
  }
}

# IDS/IPS with Suricata rules
resource "aws_networkfirewall_rule_group" "ids" {
  capacity = 500
  name     = "ids-rules"
  type     = "STATEFUL"

  rule_group {
    rules_source {
      rules_string = <<EOF
alert tcp any any -> any 22 (msg:"SSH brute force attempt"; flow:to_server; threshold:type both,track by_src,count 5,seconds 60; sid:1000001; rev:1;)
drop tcp any any -> any any (msg:"SQL injection"; content:"UNION SELECT"; nocase; sid:1000002; rev:1;)
EOF
    }
  }
}

Learning milestones:

  1. Firewall inspects all egress traffic → You understand routing through NFW
  2. Domain blocking works → You understand stateful rules
  3. IDS alerts fire correctly → You understand Suricata
  4. Return traffic routes correctly → You understand symmetric routing

Project 9: Global Accelerator & CloudFront Edge Networking

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK, CloudFormation
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: CDN / Edge Networking
  • Software or Tool: AWS Global Accelerator, CloudFront
  • Main Book: “High Performance Browser Networking” by Ilya Grigorik

What you’ll build: A global application with CloudFront for static content, Global Accelerator for dynamic content, and latency-based routing—with performance benchmarks showing the improvement.

Why it teaches AWS networking: Understanding how traffic enters AWS at the edge, traverses the AWS backbone, and reaches your origin teaches you about Anycast, PoPs, and why edge networking matters for performance.

Core challenges you’ll face:

  • CloudFront distribution setup (origins, behaviors, caching) → maps to CDN architecture
  • Global Accelerator configuration (endpoints, health checks) → maps to Anycast networking
  • Understanding when to use which (static vs dynamic, TCP vs HTTP) → maps to architectural decisions
  • Measuring performance improvement (latency from different regions) → maps to performance testing

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic DNS, HTTP understanding

Real world outcome:

$ terraform apply

CloudFront Distribution: d123abc.cloudfront.net
  Origin: alb-us-east-1.example.com
  Behaviors:
    /static/* → S3 origin (cached 1 year)
    /api/*    → ALB origin (no cache)
    /*        → ALB origin (cache 5 min)

Global Accelerator: a1b2c3d4.awsglobalaccelerator.com
  Listener: TCP 443
  Endpoint Groups:
    - us-east-1: alb-east (weight 100)
    - eu-west-1: alb-west (weight 100)
  Health Checks: /health every 10s

# Performance comparison from different locations
$ ./edge-benchmark

╔════════════════════════════════════════════════════════════════╗
║              EDGE NETWORKING PERFORMANCE COMPARISON            ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║  TEST: HTTPS request to API endpoint                           ║
║  Origin: us-east-1 (N. Virginia)                               ║
║                                                                ║
║  FROM NEW YORK (same region as origin)                         ║
║  ─────────────────────────────────────────────────────────────║
║  Direct to ALB:       45ms                                     ║
║  Via CloudFront:      42ms (-7%)                               ║
║  Via Global Accel:    38ms (-16%)                              ║
║                                                                ║
║  FROM LONDON (eu-west-1)                                       ║
║  ─────────────────────────────────────────────────────────────║
║  Direct to ALB:       125ms (transatlantic internet)           ║
║  Via CloudFront:      48ms (-62%) ← enters AWS at London PoP  ║
║  Via Global Accel:    44ms (-65%) ← uses AWS backbone         ║
║                                                                ║
║  FROM TOKYO (ap-northeast-1)                                   ║
║  ─────────────────────────────────────────────────────────────║
║  Direct to ALB:       210ms (Pacific + US internet)            ║
║  Via CloudFront:      65ms (-69%) ← Tokyo PoP + AWS backbone  ║
║  Via Global Accel:    58ms (-72%) ← Anycast to nearest PoP    ║
║                                                                ║
║  STATIC CONTENT (S3 via CloudFront)                            ║
║  ─────────────────────────────────────────────────────────────║
║  First request:       85ms (cache miss, fetch from S3)         ║
║  Cached requests:     12ms (served from edge)                  ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

WHY THE IMPROVEMENT?
1. Traffic enters AWS at nearest edge location (400+ PoPs)
2. Traverses AWS private backbone (not public internet)
3. CloudFront caches responses at edge
4. Global Accelerator uses Anycast for optimal routing
5. TCP optimization (connection reuse, fast retransmits)

Implementation Hints: Key architectural understanding:

WITHOUT EDGE SERVICES:
  User (Tokyo) → Internet → Origin (us-east-1)
  - Multiple ISP hops, congested peering points
  - TCP cold start on every connection
  - ~200ms latency

WITH CLOUDFRONT:
  User (Tokyo) → Tokyo PoP (5ms) → AWS Backbone → Origin
  - Enters AWS at nearest Point of Presence
  - Uses optimized AWS backbone
  - Connection pooling to origin
  - Caching at edge
  - ~65ms latency

WITH GLOBAL ACCELERATOR:
  User (Tokyo) → Anycast → Nearest PoP → AWS Backbone → Origin
  - Anycast IP routes to geographically closest PoP
  - Static IPs (good for whitelisting)
  - TCP/UDP optimization
  - Health-checked failover
  - ~58ms latency

When to use which:

CLOUDFRONT:
  - HTTP/HTTPS traffic
  - Cacheable content
  - Complex routing rules (path-based, header-based)
  - Lambda@Edge for edge compute

GLOBAL ACCELERATOR:
  - TCP/UDP traffic (not just HTTP)
  - Non-cacheable dynamic content
  - Need static Anycast IPs
  - Multi-region active-active with health checks
  - Gaming, VoIP, financial applications

Learning milestones:

  1. CloudFront serves static content from edge → You understand CDN caching
  2. API latency improves for distant users → You understand backbone routing
  3. Global Accelerator failover works → You understand health-checked routing
  4. You can explain when to use each → You can make architectural decisions

Project 10: VPC Lattice for Service-to-Service Networking

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Service Mesh / Application Networking
  • Software or Tool: AWS VPC Lattice
  • Main Book: “Building Microservices, 2nd Edition” by Sam Newman

What you’ll build: A service mesh using VPC Lattice that enables service-to-service communication across VPCs and accounts with built-in authentication, authorization, and observability—without managing sidecars or proxies.

Why it teaches AWS networking: VPC Lattice represents AWS’s vision for application-layer networking. It abstracts away traditional network complexity (VPC peering, Transit Gateway for service communication) and provides L7 features like path-based routing and IAM authentication.

Core challenges you’ll face:

  • Service network creation (the namespace for services) → maps to service mesh concepts
  • Service association (connecting Lambda, ECS, EC2 to the mesh) → maps to compute abstraction
  • Auth policies (IAM-based service-to-service auth) → maps to zero-trust
  • Cross-account service sharing → maps to multi-account patterns
  • Traffic management (weighted routing, path-based) → maps to L7 load balancing

Key Concepts:

  • VPC Lattice: AWS VPC Lattice Documentation
  • Service Mesh Concepts: “Building Microservices, 2nd Edition” Chapter 4 - Sam Newman
  • Zero-Trust Networking: NIST Zero Trust Architecture (SP 800-207)

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC fundamentals, IAM, microservices basics

Real world outcome:

$ terraform apply

VPC Lattice Service Network: my-service-network
  Associated VPCs: [vpc-frontend, vpc-backend, vpc-data]

Services:
  - frontend-service (Lambda function)
    URL: https://frontend-svc-abc123.vpc-lattice-svcs.us-east-1.on.aws
  - orders-service (ECS Fargate)
    URL: https://orders-svc-def456.vpc-lattice-svcs.us-east-1.on.aws
  - inventory-service (EC2 instances)
    URL: https://inventory-svc-ghi789.vpc-lattice-svcs.us-east-1.on.aws

Auth Policies:
  - frontend-service → can call → orders-service
  - orders-service → can call → inventory-service
  - frontend-service → CANNOT call → inventory-service (least privilege!)

# Test service-to-service call
$ ssh frontend-host
[frontend]$ curl -H "Authorization: $IAM_SIG" \
    https://orders-svc-def456.vpc-lattice-svcs.us-east-1.on.aws/orders

{"orders": [...]}  # Allowed!

[frontend]$ curl -H "Authorization: $IAM_SIG" \
    https://inventory-svc-ghi789.vpc-lattice-svcs.us-east-1.on.aws/inventory

403 Forbidden: frontend-service is not authorized to access inventory-service

# Traffic split for canary deployment
$ terraform apply -var="orders_v2_weight=10"

Traffic routing for orders-service:
  v1 (current): 90%
  v2 (canary):  10%

# Observe metrics
$ aws cloudwatch get-metric-statistics \
    --namespace AWS/VPCLattice \
    --metric-name RequestCount

orders-service metrics:
  Total requests: 10,000
  Target v1: 9,012 (90.1%)
  Target v2: 988 (9.9%)
  4xx errors v1: 45 (0.5%)
  4xx errors v2: 12 (1.2%)

Implementation Hints: VPC Lattice architecture:

                    ┌─────────────────────────────┐
                    │     Service Network         │
                    │   "my-service-network"      │
                    └─────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┐
          │                   │                   │
    ┌─────▼─────┐      ┌─────▼─────┐      ┌─────▼─────┐
    │ frontend  │      │ orders    │      │ inventory │
    │ service   │─────▶│ service   │─────▶│ service   │
    │ (Lambda)  │      │ (ECS)     │      │ (EC2)     │
    └───────────┘      └───────────┘      └───────────┘
         │                  │                   │
    ┌────▼────┐       ┌────▼────┐        ┌────▼────┐
    │  VPC A  │       │  VPC B  │        │  VPC C  │
    └─────────┘       └─────────┘        └─────────┘

Key insight: NO VPC peering or Transit Gateway needed!
VPC Lattice provides overlay networking between services.

IAM-based service auth:

resource "aws_vpclattice_auth_policy" "orders_policy" {
  resource_identifier = aws_vpclattice_service.orders.arn

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = "*"
        Action = "vpc-lattice-svcs:Invoke"
        Resource = "*"
        Condition = {
          StringEquals = {
            "vpc-lattice-svcs:SourceVpcOwnerAccount" = var.frontend_account_id
          }
          "ForAnyValue:StringLike" = {
            "aws:PrincipalServiceName" = "lambda.amazonaws.com"
          }
        }
      }
    ]
  })
}

Why VPC Lattice matters:

BEFORE (with Transit Gateway + ALBs):
- Create TGW, attach VPCs
- Create ALB in each VPC
- Manage Security Groups between VPCs
- Build custom auth (mTLS, JWT, etc.)
- Set up monitoring per service
- Total: 50+ resources, complex routing

AFTER (with VPC Lattice):
- Create service network
- Register services
- Define auth policies (IAM-native!)
- Built-in metrics and access logs
- Total: ~10 resources, declarative

Learning milestones:

  1. Services communicate across VPCs → You understand Lattice overlay
  2. IAM policies restrict access → You understand zero-trust auth
  3. Traffic splitting works → You understand L7 capabilities
  4. Can monitor service-to-service calls → You understand observability

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: AWS CDK, CloudFormation
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Private Networking / SaaS Architecture
  • Software or Tool: AWS PrivateLink, Network Load Balancer
  • Main Book: “AWS Certified Advanced Networking Study Guide”

What you’ll build: A PrivateLink service that exposes your application to other AWS accounts privately—the same way AWS exposes services like S3, DynamoDB. Customers connect via Interface Endpoints without any public internet exposure.

Why it teaches AWS networking: PrivateLink is how you build “AWS-native” SaaS. Understanding how to become a service provider (not just consumer) teaches you about NLBs, endpoint services, cross-account networking, and how AWS builds its own services.

Core challenges you’ll face:

  • NLB as PrivateLink backend (why NLB, not ALB?) → maps to PrivateLink architecture
  • Cross-account access (allowlisting consumer accounts) → maps to multi-tenant SaaS
  • DNS for endpoints (private hosted zones, endpoint DNS) → maps to private DNS
  • Connection acceptance (auto vs manual) → maps to security controls
  • High availability (multi-AZ endpoints) → maps to reliability

Key Concepts:

Difficulty: Expert Time estimate: 2 weeks Prerequisites: NLB, VPC Endpoints, cross-account IAM

Real world outcome:

$ terraform apply -target=module.provider

PROVIDER ACCOUNT (your SaaS):
  VPC: vpc-provider
  NLB: nlb-myservice (internal)
    Listeners: 443 → target-group (your app)
  Endpoint Service: vpce-svc-abc123
    Service Name: com.amazonaws.vpce.us-east-1.vpce-svc-abc123
    Allowed Principals: [arn:aws:iam::CUSTOMER_ACCOUNT:root]

$ terraform apply -target=module.consumer

CONSUMER ACCOUNT (customer):
  VPC: vpc-consumer
  Interface Endpoint: vpce-def456
    Service: com.amazonaws.vpce.us-east-1.vpce-svc-abc123
    Subnets: [subnet-a, subnet-b]
    DNS: myservice.vpce.local → 10.1.1.100, 10.1.2.100

# From customer's VPC
$ ssh customer-server
[customer]$ nslookup myservice.vpce.local
  10.1.1.100  (ENI in subnet-a)
  10.1.2.100  (ENI in subnet-b)

[customer]$ curl https://myservice.vpce.local/api/health
{"status": "healthy", "provider": "MyService SaaS"}

# Traffic never touches internet!
[customer]$ traceroute myservice.vpce.local
  1  10.1.1.100 (vpce ENI)  0.5ms  # Direct to PrivateLink ENI
  # No internet hops - traffic stays in AWS backbone

# Provider sees customer connection
$ aws ec2 describe-vpc-endpoint-connections

{
  "VpcEndpointConnections": [{
    "VpcEndpointId": "vpce-def456",
    "VpcEndpointOwner": "CUSTOMER_ACCOUNT",
    "VpcEndpointState": "available",
    "CreationTimestamp": "2024-01-15T10:30:00Z"
  }]
}

Implementation Hints: PrivateLink architecture (provider side):

PROVIDER VPC
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐ │
│  │ Your App    │◀───│ Target      │◀───│ Network Load    │ │
│  │ (ECS/EC2)   │    │ Group       │    │ Balancer        │ │
│  └─────────────┘    └─────────────┘    └────────┬────────┘ │
│        10.0.1.x                                  │          │
│                                                  │          │
│                              ┌───────────────────▼────────┐ │
│                              │ VPC Endpoint Service       │ │
│                              │ vpce-svc-abc123            │ │
│                              │                            │ │
│                              │ Allowed: CUSTOMER_ACCOUNT  │ │
│                              └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                                        │
                                        │ AWS PrivateLink
                                        │ (private connection)
                                        │
CONSUMER VPC                            ▼
┌─────────────────────────────────────────────────────────────┐
│  ┌─────────────────┐    ┌────────────────────────────────┐  │
│  │ Interface       │───▶│ Customer App                   │  │
│  │ Endpoint        │    │ curl myservice.vpce.local      │  │
│  │ vpce-def456     │    └────────────────────────────────┘  │
│  │ 10.1.1.100      │                                        │
│  └─────────────────┘                                        │
└─────────────────────────────────────────────────────────────┘

Key Terraform resources:

# Provider side
resource "aws_vpc_endpoint_service" "myservice" {
  acceptance_required        = false  # or true for manual approval
  network_load_balancer_arns = [aws_lb.internal.arn]

  allowed_principals = [
    "arn:aws:iam::CUSTOMER_ACCOUNT:root"
  ]
}

# Consumer side
resource "aws_vpc_endpoint" "myservice" {
  vpc_id              = aws_vpc.consumer.id
  service_name        = "com.amazonaws.vpce.us-east-1.vpce-svc-abc123"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.consumer_a.id, aws_subnet.consumer_b.id]
  security_group_ids  = [aws_security_group.endpoint.id]

  private_dns_enabled = false  # Use custom DNS instead
}

# Custom DNS in consumer VPC
resource "aws_route53_zone" "private" {
  name = "vpce.local"
  vpc {
    vpc_id = aws_vpc.consumer.id
  }
}

resource "aws_route53_record" "myservice" {
  zone_id = aws_route53_zone.private.zone_id
  name    = "myservice.vpce.local"
  type    = "A"

  alias {
    name                   = aws_vpc_endpoint.myservice.dns_entry[0].dns_name
    zone_id                = aws_vpc_endpoint.myservice.dns_entry[0].hosted_zone_id
    evaluate_target_health = true
  }
}

Why NLB (not ALB)?

PrivateLink requires NLB because:
1. Layer 4 (TCP/UDP) - preserves client IP
2. Static IPs per AZ (stable endpoints)
3. Extremely high performance
4. Works with any TCP protocol (not just HTTP)

If you need HTTP features (path routing, headers):
  NLB → ALB → App (NLB for PrivateLink, ALB for L7)

Learning milestones:

  1. Endpoint service is created → You understand the provider side
  2. Consumer connects via Interface Endpoint → You understand cross-account
  3. Traffic is private (no internet route) → You verify PrivateLink privacy
  4. Multiple consumers can connect → You understand multi-tenant SaaS

Project 12: Complete Multi-Region Network Architecture (Capstone)

  • File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
  • Main Programming Language: Terraform
  • Alternative Programming Languages: Pulumi, AWS CDK
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Enterprise Architecture / Global Networking
  • Software or Tool: All AWS Networking Services
  • Main Book: “AWS Certified Solutions Architect Professional Study Guide”

What you’ll build: A production-grade, multi-region, multi-account network with Transit Gateway peering, centralized egress, Network Firewall, hybrid connectivity, and complete observability—the kind of network that runs Fortune 500 companies.

Why it teaches AWS networking: This is the synthesis of everything. You’ll face real architectural decisions: where to place firewalls, how to handle cross-region traffic, when to use which connectivity option, and how to make it all observable and maintainable.

Core challenges you’ll face:

  • Multi-region Transit Gateway peering → maps to global networking
  • Centralized egress per region → maps to inspection architecture
  • Cross-region data transfer optimization → maps to cost optimization
  • Failover and disaster recovery → maps to reliability
  • Unified monitoring and logging → maps to observability at scale

Key Concepts:

Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects

Real world outcome:

╔══════════════════════════════════════════════════════════════════════════╗
║                    ENTERPRISE MULTI-REGION NETWORK                        ║
╠══════════════════════════════════════════════════════════════════════════╣

                           US-EAST-1                    EU-WEST-1
                    ┌─────────────────────┐      ┌─────────────────────┐
                    │                     │      │                     │
   ON-PREMISES      │  ┌───────────────┐  │      │  ┌───────────────┐  │
   ┌──────────┐     │  │ Transit       │  │      │  │ Transit       │  │
   │ Data     │─VPN─┼──│ Gateway       │◀─┼─PEER─┼─▶│ Gateway       │  │
   │ Center   │     │  │ (us-east-1)   │  │      │  │ (eu-west-1)   │  │
   └──────────┘     │  └───────┬───────┘  │      │  └───────┬───────┘  │
                    │          │          │      │          │          │
                    │  ┌───────┴───────┐  │      │  ┌───────┴───────┐  │
                    │  │               │  │      │  │               │  │
                    │  ▼               ▼  │      │  ▼               ▼  │
              ┌─────────────┐   ┌─────────────┐ ┌─────────────┐  ┌─────────────┐
              │ Shared Svcs │   │ Production  │ │ Shared Svcs │  │ Production  │
              │ VPC         │   │ VPC         │ │ VPC         │  │ VPC         │
              │ ┌─────────┐ │   │             │ │ ┌─────────┐ │  │             │
              │ │ NFW     │ │   │ ┌─────────┐ │ │ │ NFW     │ │  │ ┌─────────┐ │
              │ │ NAT     │ │   │ │ App     │ │ │ │ NAT     │ │  │ │ App     │ │
              │ │ Endpoint│ │   │ │ Cluster │ │ │ │ Endpoint│ │  │ │ Cluster │ │
              │ └─────────┘ │   │ └─────────┘ │ │ └─────────┘ │  │ └─────────┘ │
              └─────────────┘   └─────────────┘ └─────────────┘  └─────────────┘
                    │                     │      │                     │
                    └─────────────────────┘      └─────────────────────┘

ACCOUNTS:
  - Network Account: Transit Gateways, Shared Services VPCs, Firewalls
  - Security Account: GuardDuty, Security Hub, Flow Logs aggregation
  - Production Account: Application workloads
  - Development Account: Dev/test workloads (isolated)

TRAFFIC FLOWS:
  ┌────────────────────────────────────────────────────────────────────────┐
  │ App (Prod US) → TGW → Shared Services VPC → NFW → NAT → Internet      │
  │ App (Prod EU) → App (Prod US): TGW EU → TGW Peering → TGW US → App    │
  │ On-Prem → AWS: VPN → TGW US → Shared Services → Production VPC        │
  └────────────────────────────────────────────────────────────────────────┘

$ ./network-dashboard

╔════════════════════════════════════════════════════════════════════════╗
║                    GLOBAL NETWORK HEALTH DASHBOARD                     ║
╠════════════════════════════════════════════════════════════════════════╣
║                                                                        ║
║  REGIONS                                                               ║
║  ─────────────────────────────────────────────────────────────────────║
║  us-east-1: ✓ Healthy                                                  ║
║    - Transit Gateway: 12 attachments, 4.2 Gbps throughput              ║
║    - Network Firewall: 156K flows/min, 23 blocks                       ║
║    - VPN Tunnels: 2/2 UP (BGP routes: 15)                              ║
║                                                                        ║
║  eu-west-1: ✓ Healthy                                                  ║
║    - Transit Gateway: 8 attachments, 2.1 Gbps throughput               ║
║    - Network Firewall: 89K flows/min, 12 blocks                        ║
║    - TGW Peering: UP (latency: 72ms to us-east-1)                      ║
║                                                                        ║
║  CROSS-REGION TRAFFIC (Last 24h)                                       ║
║  ─────────────────────────────────────────────────────────────────────║
║  us-east-1 ↔ eu-west-1: 1.2 TB                                         ║
║  Cost: $24.00 (at $0.02/GB)                                            ║
║                                                                        ║
║  ALERTS                                                                ║
║  ─────────────────────────────────────────────────────────────────────║
║  ⚠️  VPN Tunnel 1 latency spike: 180ms → 95ms (recovered)              ║
║  ⚠️  Network Firewall blocked crypto miner: 10.1.2.50 → pool.mining.com║
║                                                                        ║
║  COMPLIANCE                                                            ║
║  ─────────────────────────────────────────────────────────────────────║
║  ✓ All flow logs enabled and shipped to Security Account               ║
║  ✓ No direct internet access from production VPCs                      ║
║  ✓ All cross-account traffic via Transit Gateway (auditable)           ║
║  ✓ Network Firewall IDS rules: 2,456 signatures active                 ║
║                                                                        ║
╚════════════════════════════════════════════════════════════════════════╝

Implementation Hints: This project combines all previous projects. Key architectural decisions:

  1. Transit Gateway per region with peering:
    resource "aws_ec2_transit_gateway_peering_attachment" "us_eu" {
      provider                = aws.us_east_1
      peer_region             = "eu-west-1"
      peer_transit_gateway_id = aws_ec2_transit_gateway.eu.id
      transit_gateway_id      = aws_ec2_transit_gateway.us.id
    }
    
  2. Centralized egress with inspection: ``` Traffic: App VPC → TGW → Firewall VPC → NFW → NAT → Internet Return: Internet → NAT → NFW → TGW → App VPC

All egress inspected by Network Firewall All traffic logged via Flow Logs


3. **Route table segmentation for security**:

TGW Route Tables:

  • “production”: Routes to shared services + prod VPCs, NO dev access
  • “development”: Routes to shared services + dev VPCs, NO prod access
  • “shared-services”: Routes to all VPCs (hub) ```
  1. Observability pipeline:
    VPC Flow Logs → CloudWatch Logs → Kinesis Firehose → S3 (Security Account)
                                                     ↓
    Network Firewall Logs → CloudWatch Logs ──────────────────→ Athena/QuickSight
                                                     ↓
    TGW Flow Logs → CloudWatch Logs ───────────────────────────→ SIEM Integration
    

Learning milestones:

  1. Multi-region TGW peering works → You understand global connectivity
  2. All egress flows through central firewall → You understand inspection
  3. Prod/Dev isolation enforced → You understand segmentation
  4. Complete observability in Security Account → You understand enterprise monitoring

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Production VPC from Scratch Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐
2. Security Group Debugger Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐⭐
3. VPC Flow Logs Analyzer Advanced 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
4. VPC Peering vs TGW Lab Advanced 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐
5. NAT Gateway Cost Optimizer Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐⭐
6. Multi-Account Network Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
7. Site-to-Site VPN Lab Advanced 1-2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
8. Network Firewall Deployment Advanced 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
9. Global Accelerator & CloudFront Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐⭐
10. VPC Lattice Service Mesh Advanced 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
11. PrivateLink Service Provider Expert 2 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
12. Multi-Region Architecture Master 1-2 months ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

For Beginners (Start Here)

  1. Project 1: Production VPC - The foundation of everything
  2. Project 2: Security Group Debugger - Understand traffic flow
  3. Project 5: NAT Gateway Optimizer - Learn about VPC Endpoints

For Intermediate Cloud Engineers

  1. Project 3: VPC Flow Logs Analyzer - Visibility into your network
  2. Project 4: VPC Peering vs TGW Lab - Key architectural decision
  3. Project 9: Global Accelerator & CloudFront - Edge networking

For Advanced Engineers

  1. Project 7: Site-to-Site VPN - Hybrid connectivity
  2. Project 8: Network Firewall - Perimeter security
  3. Project 6: Multi-Account Network - Enterprise patterns

For Experts

  1. Project 10: VPC Lattice - Modern service networking
  2. Project 11: PrivateLink Provider - SaaS architecture
  3. Project 12: Multi-Region Architecture - Everything combined

Essential Resources

AWS Documentation

Architecture Guides

Security

Books

  • “Computer Networks, Fifth Edition” by Tanenbaum - Networking fundamentals
  • “The Practice of Network Security Monitoring” by Richard Bejtlich - Security analysis
  • “High Performance Browser Networking” by Ilya Grigorik - Edge networking

Summary

# Project Main Language
1 Build a Production-Ready VPC from Scratch Terraform
2 Security Group Traffic Flow Debugger Python
3 VPC Flow Logs Analyzer Python
4 VPC Peering vs Transit Gateway Lab Terraform
5 NAT Gateway Deep Dive & Cost Optimizer Python
6 Multi-Account Network with AWS Organizations Terraform
7 Site-to-Site VPN with Simulated On-Premises Terraform
8 AWS Network Firewall Deployment Terraform
9 Global Accelerator & CloudFront Edge Networking Terraform
10 VPC Lattice for Service-to-Service Networking Terraform
11 PrivateLink Service Provider Terraform
12 Complete Multi-Region Network Architecture Terraform

Why This Path Works

By completing these projects, you won’t just “know” AWS networking—you’ll understand it deeply:

  1. You’ll understand isolation (VPCs, subnets, Security Groups)
  2. You’ll understand connectivity (Peering, Transit Gateway, VPN, Direct Connect)
  3. You’ll understand security (NACLs, Network Firewall, PrivateLink)
  4. You’ll understand scale (multi-account, multi-region)
  5. You’ll understand optimization (VPC Endpoints, cost analysis)
  6. You’ll understand observability (Flow Logs, metrics, debugging)

Most importantly, you’ll be able to design networks for real enterprises—the kind that handle millions of requests, maintain security compliance, and scale globally.

Build these projects, and you’ll go from “AWS user” to “AWS network architect.”