AWS NETWORKING DEEP DIVE PROJECTS
AWS Networking Deep Dive: From Zero to Cloud Network Architect
Why This Matters
Every application running in AWS sits on top of a network. If you don’t understand AWS networking, you’re essentially building skyscrapers without understanding the foundation. Misconfigurations in AWS networking are among the most common causes of outages, security breaches, and cost overruns.
The Problems AWS Networking Solves:
- How do you isolate workloads in a shared cloud environment?
- How do resources in the cloud talk to each other securely?
- How do you connect your on-premises data center to AWS?
- How do you control who can access what, and from where?
- How do you build networks that span regions and accounts?
What You’ll Understand After This Learning Path:
- Why VPCs exist and how they provide isolation
- How subnets, route tables, and gateways work together
- The difference between Security Groups and NACLs (and when to use each)
- How VPC Peering vs Transit Gateway decisions are made
- How hybrid connectivity (VPN, Direct Connect) works
- How to design highly-available, multi-region networks
- How to secure traffic at the network perimeter
- How to optimize costs while maintaining security
Goal: AWS Networking Mastery
By the end of this learning path, you will possess a deep, internalized understanding of AWS networking that goes far beyond surface-level knowledge. You won’t just know what a VPC is—you’ll understand why AWS architected it this way, how packets flow through every component, where security boundaries exist, and how to design resilient multi-region networks that can handle real-world complexity.
You’ll be able to:
- Visualize packet flow through complex AWS network topologies in your mind
- Debug network issues by understanding exactly where traffic can and cannot go
- Design secure, cost-effective networks that follow AWS best practices
- Make architectural decisions about VPC Peering vs Transit Gateway vs PrivateLink
- Implement hybrid connectivity that bridges on-premises and cloud seamlessly
- Explain to others why AWS networking works the way it does, not just how to configure it
This isn’t about memorizing console clicks. It’s about building a mental model so complete that AWS networking becomes intuitive.
Concept Internalization Map
This table maps each major concept cluster to what you need to deeply internalize—not just memorize, but truly understand.
| Concept Cluster | What Must Be Internalized |
|---|---|
| VPC Fundamentals | • How AWS implements logical isolation using encapsulation • Why CIDR block selection matters (can’t change it later) • How VPC spans multiple AZs but subnets are AZ-specific • The relationship between VPC, subnets, route tables, and gateways • Default VPC vs custom VPC tradeoffs |
| Subnets & Routing | • Route table longest prefix matching algorithm • Why “public” vs “private” is just routing, not configuration • How packet flow works through route tables step-by-step • NAT Gateway vs NAT Instance tradeoffs • Implicit router at 10.0.0.0/16 + 1 (e.g., 10.0.0.1 in 10.0.0.0/16 VPC) • Why you lose 5 IPs per subnet (AWS reserved addresses) |
| Security Layers | • Stateful vs stateless packet filtering mechanics • Connection tracking in Security Groups (how it actually works) • Why NACLs need ephemeral port ranges (1024-65535) • Defense in depth: when to use SG, NACL, WAF, Network Firewall • Rule evaluation order (NACL) vs all-rules evaluation (SG) • How to debug “connection refused” vs “timeout” (SG vs NACL/routing) |
| VPC Connectivity | • Peering non-transitivity and why it exists • Transit Gateway routing and route table propagation • When to use Peering vs TGW vs PrivateLink • Cross-region peering limitations • How to avoid IP CIDR overlaps across VPCs • Shared VPC vs multi-VPC strategies |
| Hybrid Networking | • BGP (Border Gateway Protocol) basics for Direct Connect • IPSec tunnel establishment and packet encapsulation • How AWS uses VLAN tagging for Direct Connect virtual interfaces • VPN vs DX cost structure (port hours vs data transfer) • Failover scenarios and routing priority • MACsec encryption for Direct Connect |
| DNS & Service Discovery | • Route 53 Resolver and VPC DNS (.2 address in VPC) • Private Hosted Zones for internal DNS • How enableDnsHostnames and enableDnsSupport work • DNS resolution over VPC peering/TGW • Route 53 Resolver endpoints for hybrid DNS |
| Advanced Patterns | • VPC Endpoints (Gateway vs Interface) • PrivateLink for service exposure without peering • IPv6 dual-stack VPCs • VPC Flow Logs for network monitoring • Network Firewall for perimeter security • Transit Gateway Network Manager for multi-region |
Deep Dive Reading by Concept
Map AWS networking concepts to foundational knowledge in your technical books. AWS networking is built on standard networking principles—these books will give you the “why” behind AWS’s design decisions.
1. VPC & Subnets → Network Fundamentals
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 1: Sections on network layering, internetworking concepts
- Chapter 5: The Network Layer - IP addressing, subnetting, CIDR notation
- Section 5.6: IP Addresses (crucial for understanding VPC CIDR blocks)
- Section 5.6.3: CIDR and route aggregation
- Section 5.6.4: NAT (directly applicable to AWS NAT Gateways)
Why: AWS VPCs implement standard IP networking. Understanding CIDR, subnetting math, and network masks is essential for proper VPC design.
Key Takeaway: When you create a VPC with 10.0.0.0/16, you’re using CIDR notation that comes from the need to efficiently aggregate routes on the Internet. AWS didn’t invent this—they’re using established networking standards.
2. Route Tables & Packet Forwarding → Routing Algorithms
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 5: The Network Layer
- Section 5.2: Routing algorithms (how routers make forwarding decisions)
- Section 5.3: Hierarchical routing
- Section 5.4: Broadcast and multicast routing
Why: AWS route tables use longest prefix matching—a fundamental routing concept. Understanding how routers make forwarding decisions helps you design efficient route tables.
Key Takeaway: The implicit router in every VPC (at the +1 address) uses the same forwarding logic as any Internet router. Route table lookups aren’t AWS magic—they’re standard routing algorithms.
3. Security Groups & NACLs → Firewalls & Packet Filtering
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 8: Network Security
- Section 8.9: Firewalls (packet filtering, stateful inspection)
TCP/IP Illustrated, Volume 1 by Stevens
- Chapter 13: TCP Connection Management
- Understanding the three-way handshake helps you understand why stateful firewalls work
- Section on connection tracking and state tables
Why: Security Groups are stateful packet filters—they track TCP connections. NACLs are stateless. Understanding the TCP state machine explains why Security Groups can auto-allow return traffic.
Key Takeaway: When you allow inbound port 80 in a Security Group, AWS maintains a connection tracking table (like conntrack in Linux iptables). This is why return traffic on ephemeral ports is automatically allowed.
4. NAT Gateway → Network Address Translation
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 5: The Network Layer
- Section 5.6.4: Network Address Translation (NAT)
- How NAT maintains translation tables
- Port Address Translation (PAT/NAPT)
Why: AWS NAT Gateways use PAT to allow many private IPs to share one public IP. Understanding the NAT translation table helps you debug connectivity issues.
Key Takeaway: NAT is a workaround for IPv4 address exhaustion. AWS NAT Gateways translate thousands of private IPs to a single Elastic IP using port mapping—not AWS-specific, but an industry-standard technique.
5. VPN & IPSec → Cryptographic Tunnels
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 8: Network Security
- Section 8.7: IPSec (the protocol AWS VPN uses)
- Section 8.8: VPNs
- Tunnel mode vs transport mode
Why: AWS Site-to-Site VPN uses IPSec. Understanding how IPSec encrypts and encapsulates packets helps you troubleshoot VPN connectivity and understand performance characteristics.
Key Takeaway: IPSec adds overhead—encryption/decryption CPU cost and packet size increase. This is why VPN throughput is limited compared to Direct Connect’s raw physical connection.
6. BGP & Direct Connect → Routing Protocols
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 5: The Network Layer
- Section 5.3.4: Routing in the Internet
- Border Gateway Protocol (BGP) basics
Why: AWS Direct Connect uses BGP to exchange routes between your network and AWS. Understanding BGP path selection helps you control traffic flow and implement proper failover.
Key Takeaway: BGP is how the entire Internet exchanges routing information. AWS Direct Connect uses the same protocol, so learning BGP fundamentals gives you control over how traffic routes between your data center and AWS.
7. DNS in VPC → Domain Name System
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 7: The Application Layer
- Section 7.1: DNS
- Recursive vs iterative queries
- DNS caching
Why: Every VPC has a DNS resolver at the +2 address (e.g., 10.0.0.2 in a 10.0.0.0/16 VPC). Route 53 Resolver endpoints allow DNS queries across hybrid networks.
Key Takeaway: AWS Route 53 Resolver is just a DNS recursive resolver—same concept as running your own DNS server, but managed by AWS.
8. VPC Flow Logs → Network Monitoring
The Practice of Network Security Monitoring by Bejtlich
- Chapter 2: Network Traffic Collection
- Chapter 3: Network Traffic Analysis
- Flow data vs packet capture
- NetFlow and similar technologies
Why: VPC Flow Logs capture metadata about traffic (source, destination, ports, protocol, action). Understanding flow data helps you monitor and troubleshoot network security.
Key Takeaway: VPC Flow Logs are AWS’s implementation of NetFlow/IPFIX—industry-standard network monitoring. They don’t capture packet payloads, only metadata, which is why they’re efficient.
9. TCP/IP Deep Dive → Protocol Understanding
TCP/IP Illustrated, Volume 1 by Stevens
- Chapter 2: The Internet Protocol
- Chapter 13: TCP Connection Establishment and Termination
- Chapter 17: TCP Interactive Data Flow
- Chapter 20: TCP Bulk Data Flow
Why: Debugging AWS network issues requires understanding TCP behavior—SYN floods, connection timeouts, TCP window scaling, MTU issues.
Key Takeaway: AWS networking doesn’t change how TCP works. If you understand TCP’s three-way handshake, you’ll understand why a Security Group blocking port 80 results in a timeout (SYN never gets ACK) vs a connection refused.
10. Linux Network Stack → Practical Implementation
The Linux Programming Interface by Kerrisk
- Chapter 59: Sockets: Internet Domains
- Chapter 61: Sockets: Advanced Topics
Computer Systems: A Programmer’s Perspective by Bryant & O’Hallaron
- Chapter 11: Network Programming
- Section 11.3: The Sockets Interface
- Section 11.4: Client-Server model
Why: EC2 instances run Linux (or Windows, but Linux dominates). Understanding the socket API, network namespaces, and iptables helps you debug instance-level networking.
Key Takeaway: EC2 networking is just Linux networking with AWS-managed interfaces (ENIs). If you know how Linux network namespaces work, you’ll understand how containers on ECS/EKS get network isolation.
11. TLS/SSL & Certificates → Secure Communications
Computer Networks, Fifth Edition by Tanenbaum
- Chapter 8: Network Security
- Section 8.6: Communication Security (TLS/SSL)
High Performance Browser Networking by Ilya Grigorik
- Chapter 4: Transport Layer Security (TLS)
- TLS handshake
- Certificate validation
- Performance implications
Why: AWS Certificate Manager (ACM), Application Load Balancer TLS termination, and VPN encryption all use TLS/SSL. Understanding certificate chains and handshakes helps you troubleshoot HTTPS issues.
Key Takeaway: TLS adds latency (handshake) and CPU cost (encryption). Understanding this helps you make decisions about where to terminate TLS (ALB vs EC2) and when to use HTTP/2.
12. Performance & Latency → Network Optimization
High Performance Browser Networking by Ilya Grigorik
- Chapter 1: Primer on Latency and Bandwidth
- Chapter 2: Building Blocks of TCP
- TCP’s impact on application performance
- Slow start, congestion control
Why: AWS networking has latency characteristics—cross-AZ, cross-region, to Internet. Understanding TCP’s behavior helps you optimize application performance.
Key Takeaway: Cross-AZ traffic has ~1-2ms latency. Cross-region has 10-100+ms. TCP slow start means the first few KB of a connection are slower. This knowledge helps you design distributed systems on AWS.
How AWS Networking Actually Works: A Mental Model
Before you touch Terraform or the AWS console, you need a mental model of what’s actually happening when you create a VPC. This isn’t about memorizing console clicks—it’s about understanding the underlying mechanisms so deeply that you can debug any networking problem.
The Physical Reality Behind the Abstraction
When you create a VPC, you’re not actually creating a physical network. AWS runs a software-defined network (SDN) on top of its physical infrastructure. Here’s what’s actually happening:
┌─────────────────────────────────────────────────────────────────────────┐
│ AWS PHYSICAL INFRASTRUCTURE │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Physical │ │ Physical │ │ Physical │ │
│ │ Server A │ │ Server B │ │ Server C │ ... thousands │
│ │ (Host) │ │ (Host) │ │ (Host) │ more │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ┌─────────────┴─────────────┐ │
│ │ AWS Physical Network │ │
│ │ (High-speed backbone) │ │
│ └─────────────┬─────────────┘ │
└─────────────────────────────┼───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ AWS SOFTWARE-DEFINED NETWORK LAYER │
│ │
│ Your VPC (10.0.0.0/16) is a LOGICAL construct overlaid on physical │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ YOUR VPC │ │
│ │ (10.0.0.0/16) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │Public Subnet │ │Private Subnet│ │ DB Subnet │ │ │
│ │ │ 10.0.1.0/24 │ │ 10.0.10.0/24 │ │ 10.0.20.0/24 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │ │
│ │ │ │ EC2-1 │ │ │ │ EC2-2 │ │ │ │ RDS │ │ │ │
│ │ │ │10.0.1.│ │ │ │10.0.10│ │ │ │10.0.20│ │ │ │
│ │ │ │ 50 │ │ │ │ 50 │ │ │ │ 50 │ │ │ │
│ │ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Even though EC2-1 and EC2-2 might be on DIFFERENT physical servers, │
│ they see each other as if on the same logical 10.0.0.0/16 network │
└─────────────────────────────────────────────────────────────────────────┘
The Key Insight: Your VPC isn’t a physical network—it’s a set of rules that AWS’s SDN enforces. When EC2-1 sends a packet to EC2-2, the packet might traverse multiple physical switches and routers, but the SDN layer makes it appear as if they’re on the same local network.
How a Packet Flows Through Your VPC
Let’s trace a packet from an EC2 instance in a private subnet trying to reach the internet. This is the single most important thing to understand:
Step-by-Step: EC2 in Private Subnet → Internet
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 1: EC2 Instance Generates Packet │
│ │
│ EC2 (10.0.10.50) wants to reach api.github.com (140.82.121.4) │
│ │
│ Packet created: │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Src IP: 10.0.10.50 │ Dst IP: 140.82.121.4 │ Dst Port: 443 │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 2: Security Group Check (Outbound) │
│ │
│ EC2's Security Group is checked for OUTBOUND rules │
│ │
│ sg-app-tier rules: │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Type │ Protocol │ Port │ Destination │ Action │ │
│ ├──────────┼──────────┼────────┼───────────────┼─────────────────┤ │
│ │ All │ All │ All │ 0.0.0.0/0 │ ALLOW │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ✓ Outbound to 140.82.121.4:443 is ALLOWED │
│ (Security Groups are stateful - return traffic auto-allowed) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 3: Route Table Lookup │
│ │
│ Private subnet's route table is consulted: │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Destination │ Target │ │
│ ├────────────────┼──────────────────────────────────────────────┤ │
│ │ 10.0.0.0/16 │ local (within VPC) │ │
│ │ 0.0.0.0/0 │ nat-gateway-id (NAT Gateway) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Destination 140.82.121.4 matches 0.0.0.0/0 → Send to NAT Gateway │
│ (Longest prefix match: /0 is the catch-all for anything not local) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 4: NACL Check (Outbound from Private Subnet) │
│ │
│ Network ACL for private subnet is checked: │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Rule # │ Type │ Protocol │ Port │ Dest │ Allow/Deny │ │
│ ├────────┼───────┼──────────┼─────────┼───────────┼─────────────┤ │
│ │ 100 │ All │ All │ All │ 0.0.0.0/0 │ ALLOW │ │
│ │ * │ All │ All │ All │ 0.0.0.0/0 │ DENY │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ✓ Rule 100 matches → ALLOW │
│ (NACLs are stateless - must also allow return traffic separately!) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 5: NAT Gateway Performs Translation │
│ │
│ NAT Gateway in PUBLIC subnet receives packet and: │
│ 1. Replaces source IP with its Elastic IP │
│ 2. Tracks the connection in its translation table │
│ │
│ Original packet: │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Src: 10.0.10.50:49152 │ Dst: 140.82.121.4:443 │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Translated packet: │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Src: 54.123.45.67:32001 │ Dst: 140.82.121.4:443 │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ Translation table entry: │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Internal: 10.0.10.50:49152 ↔ External: 54.123.45.67:32001 │ │
│ │ Destination: 140.82.121.4:443 │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 6: Route to Internet Gateway │
│ │
│ Public subnet's route table: │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Destination │ Target │ │
│ ├────────────────┼──────────────────────────────────────────────┤ │
│ │ 10.0.0.0/16 │ local │ │
│ │ 0.0.0.0/0 │ igw-id (Internet Gateway) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Packet sent to Internet Gateway → AWS backbone → Internet │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STEP 7: Response Returns (Reverse Path) │
│ │
│ api.github.com responds: │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Src: 140.82.121.4:443 │ Dst: 54.123.45.67:32001 │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ 1. IGW receives, routes to NAT Gateway (it's the destination) │
│ 2. NAT Gateway looks up translation table, reverse-translates: │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Src: 140.82.121.4:443 │ Dst: 10.0.10.50:49152 │ │
│ └───────────────────────────────────────────────────────────┘ │
│ 3. Route table sends to private subnet (10.0.10.0/24 is local) │
│ 4. NACL inbound check (must ALLOW ephemeral ports 1024-65535!) │
│ 5. Security Group: return traffic auto-allowed (stateful!) │
│ 6. Packet delivered to EC2 │
└─────────────────────────────────────────────────────────────────────────┘
The Critical Insight: Notice how Security Groups and NACLs behave differently on the return path:
- Security Group: Automatically allows return traffic (stateful)
- NACL: Must explicitly allow inbound traffic on ephemeral ports (stateless)
This is why misconfigured NACLs cause “I can send but not receive” problems.
Security Groups vs NACLs: The Deep Dive
This is the most misunderstood part of AWS networking. Let’s break it down completely.
Connection Tracking: How Security Groups Work
Security Groups are stateful firewalls. This means they track connections. Here’s what happens under the hood:
┌─────────────────────────────────────────────────────────────────────────┐
│ HOW SECURITY GROUP CONNECTION TRACKING WORKS │
│ │
│ When you allow inbound port 443, AWS maintains a connection table: │
│ │
│ CLIENT (203.0.113.50:52341) ──── SYN ────► EC2 (10.0.1.50:443) │
│ │
│ Security Group sees: "New inbound connection to allowed port 443" │
│ │
│ Connection Table Entry Created: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Connection ID: 12345 │ │
│ │ Protocol: TCP │ │
│ │ Local: 10.0.1.50:443 │ │
│ │ Remote: 203.0.113.50:52341 │ │
│ │ State: ESTABLISHED │ │
│ │ Direction: INBOUND (originally) │ │
│ │ Created: 2024-12-22 14:32:01 │ │
│ │ Last Activity: 2024-12-22 14:32:05 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ EC2 (10.0.1.50:443) ──── SYN-ACK ────► CLIENT (203.0.113.50:52341) │
│ │
│ Security Group sees: "Outbound to 203.0.113.50:52341" │
│ Checks connection table: "This is return traffic for connection 12345"│
│ Result: ALLOWED (no outbound rule needed!) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
NACLs: Stateless Packet Filtering
NACLs don’t track connections. Each packet is evaluated independently:
┌─────────────────────────────────────────────────────────────────────────┐
│ HOW NACLs EVALUATE PACKETS (STATELESS) │
│ │
│ CLIENT (203.0.113.50:52341) ──── SYN ────► EC2 (10.0.1.50:443) │
│ │
│ NACL Inbound Evaluation: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Packet: Src=203.0.113.50:52341, Dst=10.0.1.50:443, TCP │ │
│ │ │ │
│ │ Rule 100: Allow TCP 443 from 0.0.0.0/0 → MATCH! → ALLOW │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ EC2 (10.0.1.50:443) ──── SYN-ACK ────► CLIENT (203.0.113.50:52341) │
│ │
│ NACL Outbound Evaluation: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Packet: Src=10.0.1.50:443, Dst=203.0.113.50:52341, TCP │ │
│ │ │ │
│ │ Rule 100: Allow TCP 443 from 0.0.0.0/0 │ │
│ │ Port 443? NO! Destination port is 52341 │ │
│ │ → NO MATCH │ │
│ │ │ │
│ │ You need: Allow TCP 1024-65535 to 0.0.0.0/0 (ephemeral ports) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ⚠️ WITHOUT EPHEMERAL PORT RULE, RETURN TRAFFIC IS BLOCKED! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Complete Security Groups vs NACLs Comparison
| Aspect | Security Groups | NACLs |
|---|---|---|
| State | Stateful (tracks connections) | Stateless (each packet independent) |
| Scope | Instance/ENI level | Subnet level |
| Rules | Allow only | Allow AND Deny |
| Rule Order | All rules evaluated | Evaluated in numbered order |
| Return Traffic | Automatically allowed | Must be explicitly allowed |
| Default | Deny all inbound, allow all outbound | Allow all (default NACL) |
| Use Case | Primary security layer | Block specific IPs, defense-in-depth |
When a Connection Fails: Timeout vs Connection Refused
Understanding error messages helps you debug:
┌─────────────────────────────────────────────────────────────────────────┐
│ DIAGNOSING NETWORK FAILURES │
│ │
│ SCENARIO 1: Connection Timeout │
│ ───────────────────────────────────────────────────────────────────── │
│ $ curl https://10.0.2.50:443 │
│ curl: (28) Connection timed out after 30001 milliseconds │
│ │
│ What happened: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CLIENT ──── SYN ────► [DROPPED SILENTLY] ──✗ │ │
│ │ │ │
│ │ Packet never reached destination or response never came back │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Causes: │
│ • Security Group blocking (drops packet silently) │
│ • NACL blocking (drops packet silently) │
│ • Route table misconfiguration (packet sent to wrong place) │
│ • NAT Gateway issue (no route to internet) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ SCENARIO 2: Connection Refused │
│ ───────────────────────────────────────────────────────────────────── │
│ $ curl https://10.0.2.50:443 │
│ curl: (7) Failed to connect: Connection refused │
│ │
│ What happened: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ CLIENT ──── SYN ────► EC2 ──── RST ────► CLIENT │ │
│ │ │ │
│ │ Packet reached destination, but nothing listening on that port │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Causes: │
│ • Service not running on target port │
│ • Service bound to localhost only (127.0.0.1) │
│ • iptables on EC2 blocking (OS-level firewall) │
│ │
│ KEY INSIGHT: Connection refused means network path is WORKING! │
│ The problem is at the application level, not network level. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The VPC Address Space: Reserved IPs and CIDR Math
Every subnet loses 5 IP addresses to AWS. Understanding why helps you plan capacity:
┌─────────────────────────────────────────────────────────────────────────┐
│ SUBNET: 10.0.1.0/24 (256 IPs theoretically) │
│ │
│ Reserved by AWS: │
│ ┌───────────────┬───────────────────────────────────────────────────┐ │
│ │ 10.0.1.0 │ Network address (standard networking) │ │
│ │ 10.0.1.1 │ VPC Router (implicit router for the subnet) │ │
│ │ 10.0.1.2 │ DNS Server (Amazon-provided DNS) │ │
│ │ 10.0.1.3 │ Reserved for future use │ │
│ │ 10.0.1.255 │ Broadcast address (VPC doesn't support broadcast) │ │
│ └───────────────┴───────────────────────────────────────────────────┘ │
│ │
│ Usable IPs: 10.0.1.4 through 10.0.1.254 = 251 IPs │
│ │
│ ⚠️ CRITICAL: The VPC router (10.0.1.1) is how traffic leaves the │
│ subnet. All outbound traffic goes here first, then route table │
│ determines next hop. │
│ │
│ ⚠️ CRITICAL: The DNS server (10.0.1.2) is why enableDnsSupport and │
│ enableDnsHostnames matter. Without these, EC2 instances can't │
│ resolve DNS names. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
CIDR Block Planning: Common Mistakes
┌─────────────────────────────────────────────────────────────────────────┐
│ CIDR PLANNING: AVOID THESE MISTAKES │
│ │
│ MISTAKE 1: VPC CIDR too small │
│ ───────────────────────────────────────────────────────────────────── │
│ Created: 10.0.0.0/24 (256 IPs) │
│ Problem: Only room for ~4 small subnets │
│ Can't expand VPC CIDR after creation! │
│ │
│ Better: 10.0.0.0/16 (65,536 IPs) - room to grow │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ MISTAKE 2: Overlapping CIDRs across VPCs │
│ ───────────────────────────────────────────────────────────────────── │
│ VPC-A: 10.0.0.0/16 │
│ VPC-B: 10.0.0.0/16 ← SAME CIDR! │
│ │
│ Problem: Can NEVER peer these VPCs or connect via Transit Gateway │
│ Routing would be ambiguous: is 10.0.1.50 in VPC-A or VPC-B? │
│ │
│ Better: Plan non-overlapping ranges: │
│ VPC-A: 10.0.0.0/16 │
│ VPC-B: 10.1.0.0/16 │
│ VPC-C: 10.2.0.0/16 │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ MISTAKE 3: Using 172.17.0.0/16 with Docker │
│ ───────────────────────────────────────────────────────────────────── │
│ Docker's default bridge network: 172.17.0.0/16 │
│ │
│ If your VPC is 172.17.0.0/16, containers can't reach VPC resources! │
│ Route goes to Docker bridge instead of VPC router. │
│ │
│ Better: Avoid 172.17.x.x for VPCs if using Docker │
│ │
└─────────────────────────────────────────────────────────────────────────┘
VPC Flow Logs: Seeing What’s Actually Happening
VPC Flow Logs capture metadata about every network flow. They’re your primary debugging and security monitoring tool.
Flow Log Record Format
┌─────────────────────────────────────────────────────────────────────────┐
│ VPC FLOW LOG RECORD (Version 2) │
│ │
│ Raw log entry: │
│ 2 123456789012 eni-abc123 10.0.1.50 10.0.2.100 49152 443 6 25 1234 │
│ 1639489200 1639489260 ACCEPT OK │
│ │
│ Parsed: │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ version │ 2 │ │
│ │ account-id │ 123456789012 │ │
│ │ interface-id │ eni-abc123 (which ENI saw this traffic) │ │
│ │ srcaddr │ 10.0.1.50 (source IP) │ │
│ │ dstaddr │ 10.0.2.100 (destination IP) │ │
│ │ srcport │ 49152 (source port - ephemeral) │ │
│ │ dstport │ 443 (destination port - HTTPS) │ │
│ │ protocol │ 6 (TCP - see IANA protocol numbers) │ │
│ │ packets │ 25 │ │
│ │ bytes │ 1234 │ │
│ │ start │ 1639489200 (Unix timestamp) │ │
│ │ end │ 1639489260 │ │
│ │ action │ ACCEPT (or REJECT) │ │
│ │ log-status │ OK │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ KEY INSIGHT: action=REJECT means Security Group OR NACL blocked it │
│ Flow logs don't tell you WHICH one - you must investigate both │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Using Flow Logs to Detect Attacks
┌─────────────────────────────────────────────────────────────────────────┐
│ SECURITY PATTERNS IN FLOW LOGS │
│ │
│ PATTERN 1: Port Scan Detection │
│ ───────────────────────────────────────────────────────────────────── │
│ Same source IP hitting many destination ports in short time: │
│ │
│ 203.0.113.50 → 10.0.1.100:22 REJECT │
│ 203.0.113.50 → 10.0.1.100:23 REJECT │
│ 203.0.113.50 → 10.0.1.100:80 REJECT │
│ 203.0.113.50 → 10.0.1.100:443 REJECT │
│ 203.0.113.50 → 10.0.1.100:3306 REJECT │
│ 203.0.113.50 → 10.0.1.100:5432 REJECT │
│ │
│ Query: Group by srcaddr, count distinct dstport, filter > 10 ports │
│ Action: Add 203.0.113.50 to NACL deny list │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ PATTERN 2: Data Exfiltration │
│ ───────────────────────────────────────────────────────────────────── │
│ Unusual outbound data volume to external IP: │
│ │
│ 10.0.2.50 → 185.143.223.100:443 ACCEPT bytes=2,147,483,648 │
│ (2 GB to unknown external IP - possible data theft!) │
│ │
│ Query: Group by srcaddr+dstaddr, sum bytes, filter external + large │
│ Action: Investigate instance, check against threat intelligence │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ PATTERN 3: SSH Brute Force │
│ ───────────────────────────────────────────────────────────────────── │
│ Many rejected connections to port 22 from same source: │
│ │
│ 185.234.x.x → 10.0.1.100:22 REJECT (1000+ times in 1 hour) │
│ │
│ Query: Filter dstport=22 AND action=REJECT, group by srcaddr │
│ Action: Block at NACL, consider IP reputation service │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Concept Summary Table
| Concept Cluster | What You Must Internalize |
|---|---|
| Packet Flow | A packet leaving an EC2 goes: ENI → Security Group (outbound) → Route Table → NACL (outbound) → Next hop. Return traffic reverses but NACLs need explicit inbound rules. |
| Security Groups | Stateful means connection tracking. Allow inbound = auto-allow return. No deny rules. Attached to ENI (instance level). All rules evaluated. |
| NACLs | Stateless means every packet checked independently. Need ephemeral port rules (1024-65535) for return traffic. Rules evaluated in order. Can deny. |
| Route Tables | Longest prefix match wins. Local route always exists. “Public subnet” just means route to IGW. “Private” means route to NAT. |
| NAT Gateway | Performs PAT (Port Address Translation). Must be in public subnet. Maintains translation table. Single point of failure per AZ without redundancy. |
| CIDR Planning | Can’t change VPC CIDR after creation. Overlapping CIDRs prevent connectivity. Lose 5 IPs per subnet. Plan for growth. |
| Flow Logs | Capture metadata (not payload). REJECT could be SG or NACL. Essential for security monitoring and debugging. |
| Timeout vs Refused | Timeout = packet dropped (SG/NACL/routing issue). Refused = packet reached but nothing listening (application issue). |
Project 1: Build a Production-Ready VPC from Scratch
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK (TypeScript), CloudFormation (YAML), Pulumi
- Coolness Level: Level 1: Pure Corporate Snoozefest
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cloud Infrastructure / Networking
- Software or Tool: AWS VPC, Terraform
- Main Book: “AWS Certified Solutions Architect Study Guide” or AWS Documentation
What you’ll build: A fully functional VPC with public and private subnets across multiple Availability Zones, Internet Gateway, NAT Gateway, proper route tables, and Security Groups—deployable with a single command.
Why it teaches AWS networking: This is the foundation of EVERYTHING in AWS. By building it from scratch with infrastructure-as-code, you’ll understand every component and how they fit together. You can’t just click through the console—you must explicitly define every relationship.
Core challenges you’ll face:
- CIDR block planning (avoiding overlaps, sizing for growth) → maps to IP address management
- Multi-AZ subnet design (high availability, redundancy) → maps to fault tolerance
- Route table associations (which subnet goes where) → maps to traffic flow control
- NAT Gateway placement (public subnet, Elastic IP) → maps to outbound internet access
- Security Group design (least privilege, references between groups) → maps to micro-segmentation
Key Concepts:
- VPC Fundamentals: AWS VPC User Guide
- CIDR Notation: “Computer Networks, Fifth Edition” Chapter 4 - Tanenbaum
- Terraform Basics: HashiCorp Terraform AWS Provider Documentation
- Multi-AZ Design: AWS VPC Design Best Practices
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic AWS console familiarity, CLI setup
Real world outcome:
$ terraform init && terraform apply
Plan: 23 resources to add
aws_vpc.main: Creating...
aws_vpc.main: Created (vpc-0abc123def456)
aws_subnet.public_a: Creating...
aws_subnet.public_b: Creating...
aws_subnet.private_a: Creating...
aws_subnet.private_b: Creating...
aws_internet_gateway.main: Creating...
aws_nat_gateway.main: Creating...
aws_route_table.public: Creating...
aws_route_table.private: Creating...
...
Apply complete! Resources: 23 added.
Outputs:
vpc_id = "vpc-0abc123def456"
public_subnet_ids = ["subnet-pub-a", "subnet-pub-b"]
private_subnet_ids = ["subnet-priv-a", "subnet-priv-b"]
nat_gateway_ip = "54.123.45.67"
# Verify connectivity
$ aws ec2 run-instances --subnet-id subnet-priv-a --image-id ami-xxx
$ ssh -J bastion@54.x.x.x ec2-user@10.0.2.15
[ec2-user@ip-10-0-2-15 ~]$ curl ifconfig.me
54.123.45.67 # Traffic exits via NAT Gateway!
Implementation Hints: The VPC structure should look like:
VPC: 10.0.0.0/16 (65,536 IPs)
├── Public Subnet A: 10.0.1.0/24 (AZ-a) - 256 IPs
├── Public Subnet B: 10.0.2.0/24 (AZ-b) - 256 IPs
├── Private Subnet A: 10.0.10.0/24 (AZ-a) - 256 IPs
├── Private Subnet B: 10.0.11.0/24 (AZ-b) - 256 IPs
├── Database Subnet A: 10.0.20.0/24 (AZ-a) - 256 IPs
└── Database Subnet B: 10.0.21.0/24 (AZ-b) - 256 IPs
Key Terraform resources to create:
# Pseudo-Terraform structure
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}
resource "aws_subnet" "public" {
for_each = { "a" = "10.0.1.0/24", "b" = "10.0.2.0/24" }
vpc_id = aws_vpc.main.id
cidr_block = each.value
availability_zone = "us-east-1${each.key}"
map_public_ip_on_launch = true
}
resource "aws_nat_gateway" "main" {
subnet_id = aws_subnet.public["a"].id
allocation_id = aws_eip.nat.id
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main.id
}
}
Learning milestones:
- VPC deploys with all subnets → You understand VPC structure
- EC2 in public subnet is reachable → You understand IGW and public routing
- EC2 in private subnet can reach internet → You understand NAT Gateway flow
- Resources in private subnet are NOT directly reachable → You understand isolation
Real World Outcome
When you complete this project, you will have a fully deployable VPC infrastructure that you can use for any application. Here’s exactly what you’ll see:
# Step 1: Initialize and apply your Terraform configuration
$ cd vpc-project && terraform init
Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.0"...
- Installing hashicorp/aws v5.31.0...
Terraform has been successfully initialized!
$ terraform plan
Terraform will perform the following actions:
# aws_eip.nat will be created
+ resource "aws_eip" "nat" {
+ allocation_id = (known after apply)
+ domain = "vpc"
+ public_ip = (known after apply)
}
# aws_internet_gateway.main will be created
+ resource "aws_internet_gateway" "main" {
+ id = (known after apply)
+ vpc_id = (known after apply)
}
# aws_nat_gateway.main will be created
+ resource "aws_nat_gateway" "main" {
+ allocation_id = (known after apply)
+ connectivity_type = "public"
+ public_ip = (known after apply)
+ subnet_id = (known after apply)
}
# aws_vpc.main will be created
+ resource "aws_vpc" "main" {
+ cidr_block = "10.0.0.0/16"
+ enable_dns_hostnames = true
+ enable_dns_support = true
+ id = (known after apply)
}
... (23 resources total)
Plan: 23 to add, 0 to change, 0 to destroy.
$ terraform apply -auto-approve
aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 2s [id=vpc-0abc123def456789]
aws_internet_gateway.main: Creating...
aws_subnet.public["a"]: Creating...
aws_subnet.public["b"]: Creating...
aws_subnet.private["a"]: Creating...
aws_subnet.private["b"]: Creating...
aws_internet_gateway.main: Creation complete after 1s [id=igw-0def456789abc123]
aws_eip.nat: Creating...
aws_eip.nat: Creation complete after 1s [id=eipalloc-0123456789abcdef]
aws_nat_gateway.main: Creating...
aws_nat_gateway.main: Still creating... [1m0s elapsed]
aws_nat_gateway.main: Creation complete after 1m45s [id=nat-0abcdef123456789]
aws_route_table.private: Creating...
aws_route_table.public: Creating...
...
Apply complete! Resources: 23 added, 0 to change, 0 to destroy.
Outputs:
vpc_id = "vpc-0abc123def456789"
vpc_cidr = "10.0.0.0/16"
public_subnet_ids = [
"subnet-0pub1111111111111",
"subnet-0pub2222222222222",
]
private_subnet_ids = [
"subnet-0prv1111111111111",
"subnet-0prv2222222222222",
]
database_subnet_ids = [
"subnet-0db11111111111111",
"subnet-0db22222222222222",
]
nat_gateway_public_ip = "54.123.45.67"
internet_gateway_id = "igw-0def456789abc123"
# Step 2: Verify the infrastructure in AWS Console or CLI
$ aws ec2 describe-vpcs --vpc-ids vpc-0abc123def456789 --no-cli-pager
{
"Vpcs": [{
"CidrBlock": "10.0.0.0/16",
"VpcId": "vpc-0abc123def456789",
"State": "available",
"EnableDnsHostnames": true,
"EnableDnsSupport": true
}]
}
# Step 3: Test connectivity by launching instances
$ aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type t3.micro \
--subnet-id subnet-0prv1111111111111 \
--security-group-ids sg-0app1111111111111 \
--key-name my-key \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=test-private}]'
# Step 4: SSH via bastion and verify NAT Gateway
$ ssh -J ec2-user@bastion.example.com ec2-user@10.0.10.50
[ec2-user@ip-10-0-10-50 ~]$ curl -s ifconfig.me
54.123.45.67
# Your private instance's traffic exits via the NAT Gateway!
# The public IP shown is the NAT Gateway's Elastic IP, not the instance's
[ec2-user@ip-10-0-10-50 ~]$ curl -s https://api.github.com | head -5
{
"current_user_url": "https://api.github.com/user",
"current_user_authorizations_html_url": "https://github.com/...",
...
}
# Private subnet instance can reach the internet (outbound only)!
Visual Architecture You’ve Built:
┌─────────────────────────────────────────────────────────────────────────────┐
│ YOUR VPC (10.0.0.0/16) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ AVAILABILITY ZONE A │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Public Subnet │ │ Private Subnet │ │ DB Subnet │ │ │
│ │ │ 10.0.1.0/24 │ │ 10.0.10.0/24 │ │ 10.0.20.0/24 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │
│ │ │ │NAT Gateway│ │ │ │ App Server│ │ │ │ RDS │ │ │ │
│ │ │ │ + EIP │ │ │ │ │ │ │ │ Primary │ │ │ │
│ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ │
│ │ │ ┌───────────┐ │ │ │ │ │ │ │
│ │ │ │ Bastion │ │ │ │ │ │ │ │
│ │ │ │ Host │ │ │ │ │ │ │ │
│ │ │ └───────────┘ │ │ │ │ │ │ │
│ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ AVAILABILITY ZONE B │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Public Subnet │ │ Private Subnet │ │ DB Subnet │ │ │
│ │ │ 10.0.2.0/24 │ │ 10.0.11.0/24 │ │ 10.0.21.0/24 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │
│ │ │ │ ALB │ │ │ │ App Server│ │ │ │ RDS │ │ │ │
│ │ │ │ (spare) │ │ │ │ (spare) │ │ │ │ Standby │ │ │ │
│ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ │
│ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────┐ │
│ │ Internet Gateway │◄───── 0.0.0.0/0 route from public subnets │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────┐
│ Internet │
└──────────┘
The Core Question You’re Answering
“What actually makes a subnet ‘public’ vs ‘private’, and how does traffic flow between them and to the internet?”
This question gets to the heart of VPC design. There’s no checkbox that says “make this subnet public.” A subnet is public because of its route table configuration—specifically, whether it has a route to an Internet Gateway. Understanding this relationship is the foundation of all AWS networking.
Concepts You Must Understand First
Stop and research these before coding:
- CIDR Notation and Subnetting
- What does 10.0.0.0/16 actually mean in binary?
- How do you calculate how many IPs are in a /24 vs /20 vs /16?
- Why can’t VPC CIDRs overlap if you want to peer them?
- What’s the difference between 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16? (RFC 1918)
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.6: IP Addresses
- Route Tables and Longest Prefix Match
- How does a router decide where to send a packet?
- If you have routes for 10.0.0.0/16 and 10.0.1.0/24, which wins for 10.0.1.50?
- Why does every VPC route table have a “local” route?
- What happens if there’s no matching route?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.2: Routing Algorithms
- NAT (Network Address Translation)
- Why can’t private IPs (10.x.x.x) be used on the public internet?
- How does NAT “hide” hundreds of instances behind one public IP?
- What’s the difference between SNAT (source NAT) and DNAT (destination NAT)?
- Why does NAT break some protocols (like FTP active mode)?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 5.6.4: Network Address Translation
- Availability Zones and High Availability
- What actually IS an Availability Zone physically?
- Why do you need subnets in multiple AZs?
- What’s the latency between AZs in the same region?
- What happens if one AZ goes down?
- Book Reference: AWS Well-Architected Framework — Reliability Pillar
- DNS in VPCs
- What is the .2 address in every VPC (e.g., 10.0.0.2)?
- What do
enableDnsSupportandenableDnsHostnamesactually do? - Why do some services require DNS hostnames to work?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 7.1: DNS
Questions to Guide Your Design
Before implementing, think through these:
- CIDR Planning
- How many IP addresses do you need today? In 5 years?
- If you use 10.0.0.0/16 for this VPC, what CIDRs will you use for future VPCs?
- How will you organize subnets? By tier (web/app/db)? By AZ? Both?
- What if you later need to peer with a VPC that has 10.0.0.0/16?
- Subnet Design
- Why put public and private subnets in separate CIDR ranges (10.0.1.x vs 10.0.10.x)?
- How many IPs do you need per subnet? (Remember: AWS reserves 5 per subnet)
- Should database subnets be different from app private subnets?
- Do you need isolated subnets (no internet access at all)?
- Route Table Design
- How many route tables do you need? (Hint: minimum 2 - public and private)
- Should each private subnet have its own NAT Gateway for HA?
- What happens to private subnet traffic if the NAT Gateway fails?
- NAT Gateway vs NAT Instance
- NAT Gateway: Managed, scales automatically, ~$0.045/hour + data processing
- NAT Instance: Self-managed EC2, cheaper for low traffic, single point of failure
- When would you choose one over the other?
- Cost Considerations
- NAT Gateway costs: $0.045/hour × 730 hours/month = $32.85/month just to exist
- Plus $0.045/GB data processed
- How can you reduce costs? (VPC Endpoints for AWS services, consider NAT Instance for dev)
Thinking Exercise
Before coding, trace these scenarios on paper:
Scenario 1: EC2 in private subnet wants to call api.github.com
Draw the packet flow:
1. EC2 (10.0.10.50) creates packet: src=10.0.10.50, dst=140.82.121.4 (github)
2. Route table lookup: 140.82.121.4 matches 0.0.0.0/0 → NAT Gateway
3. NAT Gateway receives packet, performs SNAT:
- New src IP = NAT Gateway's Elastic IP (54.123.45.67)
- Stores mapping in translation table
4. Route table in public subnet: 0.0.0.0/0 → Internet Gateway
5. Packet goes to Internet Gateway → Internet → GitHub
6. GitHub responds to 54.123.45.67
7. NAT Gateway receives response, looks up translation table
8. Translates dst back to 10.0.10.50, sends to private subnet
9. EC2 receives response
Question: What happens if NAT Gateway is deleted mid-connection?
Scenario 2: Someone on the internet tries to reach 10.0.10.50 directly
Draw what happens:
1. Attacker sends packet: src=attacker, dst=10.0.10.50
2. Packet arrives at... where exactly?
3. Can 10.0.10.50 even be routed on the public internet?
Answer: The packet never arrives. Private IPs (10.x.x.x) are not routable
on the public internet. Routers drop them. This is why private subnets
are "private" - they're literally unreachable from outside.
Scenario 3: EC2 in public subnet vs EC2 in private subnet
What’s actually different?
Public subnet EC2 (10.0.1.50):
- Has route 0.0.0.0/0 → Internet Gateway
- Can be assigned a public IP (Elastic IP or auto-assign)
- Traffic uses its own public IP for outbound
- Can receive inbound traffic from internet (if SG allows)
Private subnet EC2 (10.0.10.50):
- Has route 0.0.0.0/0 → NAT Gateway
- Cannot have a public IP
- Traffic uses NAT Gateway's IP for outbound
- Cannot receive inbound traffic from internet (no route back)
The ONLY difference is the route table. The subnet itself has no
"public" or "private" property - it's ALL about routing.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What makes a subnet public vs private in AWS?”
- “Can an EC2 instance in a private subnet access the internet? How?”
- “What’s the difference between an Internet Gateway and a NAT Gateway?”
- “You have a VPC with CIDR 10.0.0.0/16. Can you peer it with another VPC that has 10.0.0.0/24? Why or why not?”
- “Your private instances can’t reach the internet. How do you troubleshoot?”
- “Why do you need subnets in multiple Availability Zones?”
- “What happens to your application if one AZ goes down and you only have a NAT Gateway in that AZ?”
- “How would you reduce NAT Gateway costs for a development environment?”
- “What’s the maximum CIDR block size for a VPC?”
- “How many IP addresses are usable in a /24 subnet in AWS?”
Hints in Layers
Hint 1: Start with the VPC and Subnets
Your first Terraform file should just create the VPC and subnets:
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "production-vpc"
}
}
# Public subnets - one per AZ
resource "aws_subnet" "public" {
for_each = {
"a" = { cidr = "10.0.1.0/24", az = "us-east-1a" }
"b" = { cidr = "10.0.2.0/24", az = "us-east-1b" }
}
vpc_id = aws_vpc.main.id
cidr_block = each.value.cidr
availability_zone = each.value.az
map_public_ip_on_launch = true # This is what makes instances get public IPs
tags = {
Name = "public-${each.key}"
Tier = "public"
}
}
Run terraform apply and verify in the console that your VPC and subnets exist.
Hint 2: Add the Internet Gateway and Public Route Table
Without this, even “public” subnets can’t reach the internet:
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "main-igw"
}
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "public-rt"
}
}
# Associate public subnets with public route table
resource "aws_route_table_association" "public" {
for_each = aws_subnet.public
subnet_id = each.value.id
route_table_id = aws_route_table.public.id
}
Hint 3: Add NAT Gateway for Private Subnets
NAT Gateway needs an Elastic IP and must be in a PUBLIC subnet:
resource "aws_eip" "nat" {
domain = "vpc"
tags = {
Name = "nat-eip"
}
}
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public["a"].id # Must be in PUBLIC subnet!
tags = {
Name = "main-nat"
}
depends_on = [aws_internet_gateway.main]
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main.id
}
tags = {
Name = "private-rt"
}
}
Hint 4: Test Your Setup
After deploying, verify everything works:
# Launch a test instance in private subnet
aws ec2 run-instances \
--image-id ami-0c55b159cbfafe1f0 \
--instance-type t3.micro \
--subnet-id <your-private-subnet-id> \
--key-name <your-key> \
--no-associate-public-ip-address
# Use Session Manager (no bastion needed) or SSH via bastion
# Then test outbound connectivity:
curl -s ifconfig.me # Should show NAT Gateway's EIP
# Try to ping the instance from the internet - it should fail
# (because there's no inbound route)
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| IP addressing & CIDR | “Computer Networks, Fifth Edition” by Tanenbaum | Ch. 5.6: IP Addresses |
| Routing fundamentals | “Computer Networks, Fifth Edition” by Tanenbaum | Ch. 5.2: Routing Algorithms |
| NAT mechanics | “Computer Networks, Fifth Edition” by Tanenbaum | Ch. 5.6.4: Network Address Translation |
| AWS VPC deep dive | “AWS Certified Solutions Architect Study Guide” | VPC Chapter |
| High availability design | “AWS Well-Architected Framework” | Reliability Pillar |
| Terraform basics | “Terraform: Up & Running” by Yevgeniy Brikman | Ch. 2-4 |
| Infrastructure as Code | “Infrastructure as Code” by Kief Morris | Ch. 1-5 |
Project 2: Security Group Traffic Flow Debugger
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Bash with AWS CLI
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Security / Networking
- Software or Tool: AWS Security Groups, boto3
- Main Book: “AWS Security” by Dylan Shields
What you’ll build: A CLI tool that analyzes Security Groups and tells you whether traffic can flow between two resources, tracing the path through all relevant security controls.
Why it teaches AWS networking: Security Groups are the most misunderstood AWS feature. People add rules without understanding stateful behavior, or create circular dependencies they can’t debug. This tool forces you to understand exactly how SG evaluation works.
Core challenges you’ll face:
- Understanding stateful filtering (return traffic is auto-allowed) → maps to connection tracking
- Tracing SG references (SG-A allows SG-B, SG-B allows SG-C) → maps to graph traversal
- Rule evaluation order (most permissive wins) → maps to rule processing
- ENI-level association (one resource, multiple SGs) → maps to AWS networking model
Key Concepts:
- Security Groups: AWS Security Group Documentation
- Stateful vs Stateless: Security Group vs NACL
- VPC Reachability Analyzer: AWS Network Access Analyzer
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Python, boto3, understanding of TCP/IP
Real world outcome:
$ ./sg-debug can-connect --from i-abc123 --to i-def456 --port 443
Analyzing connectivity: i-abc123 → i-def456:443
SOURCE INSTANCE: i-abc123
ENI: eni-111
Private IP: 10.0.1.50
Security Groups: [sg-web-tier]
TARGET INSTANCE: i-def456
ENI: eni-222
Private IP: 10.0.2.100
Security Groups: [sg-app-tier]
OUTBOUND CHECK (sg-web-tier):
✓ Rule found: "Allow all outbound" (0.0.0.0/0, all ports)
INBOUND CHECK (sg-app-tier):
✗ No rule allows TCP:443 from 10.0.1.50 or sg-web-tier
RESULT: CONNECTION BLOCKED ❌
RECOMMENDATION:
Add inbound rule to sg-app-tier:
Protocol: TCP
Port: 443
Source: sg-web-tier (recommended) or 10.0.1.50/32
$ ./sg-debug can-connect --from i-abc123 --to i-def456 --port 443 --after-fix
RESULT: CONNECTION ALLOWED ✓
Return traffic: Auto-allowed (Security Groups are stateful)
Implementation Hints:
# Pseudo-code structure
def can_connect(source_instance, target_instance, port, protocol="tcp"):
# Get ENI and SG info for both instances
source_enis = get_enis(source_instance)
target_enis = get_enis(target_instance)
source_sgs = get_security_groups(source_enis)
target_sgs = get_security_groups(target_enis)
# Check outbound from source
outbound_allowed = check_outbound_rules(
source_sgs,
target_ip=get_private_ip(target_enis[0]),
target_sg_ids=[sg.id for sg in target_sgs],
port=port,
protocol=protocol
)
if not outbound_allowed:
return False, "Outbound blocked by source Security Group"
# Check inbound to target
inbound_allowed = check_inbound_rules(
target_sgs,
source_ip=get_private_ip(source_enis[0]),
source_sg_ids=[sg.id for sg in source_sgs],
port=port,
protocol=protocol
)
if not inbound_allowed:
return False, "Inbound blocked by target Security Group"
# SGs are stateful - return traffic auto-allowed
return True, "Connection allowed (return traffic auto-allowed)"
def check_inbound_rules(sgs, source_ip, source_sg_ids, port, protocol):
for sg in sgs:
for rule in sg.ip_permissions:
# Check if rule matches protocol
if rule.ip_protocol != protocol and rule.ip_protocol != "-1":
continue
# Check if port is in range
if not port_in_range(port, rule.from_port, rule.to_port):
continue
# Check if source matches
if matches_source(rule, source_ip, source_sg_ids):
return True
return False
Learning milestones:
- Tool correctly identifies blocked connections → You understand SG rule evaluation
- Tool explains WHY traffic is blocked → You can debug SG issues
- Understands SG references (sg-xxx as source) → You understand SG chaining
- Correctly handles stateful behavior → You understand return traffic
Real World Outcome
When you complete this project, you’ll have a CLI tool that saves hours of debugging time by instantly showing whether traffic can flow between any two AWS resources. Here’s exactly what the tool will do:
# Basic usage - check if web server can talk to app server
$ ./sg-debug can-connect --from i-web123 --to i-app456 --port 8080
╔══════════════════════════════════════════════════════════════════════════════╗
║ SECURITY GROUP CONNECTIVITY ANALYSIS ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ ║
║ CONNECTION: i-web123 (10.0.1.50) → i-app456 (10.0.10.100):8080/TCP ║
║ ║
║ ┌────────────────────────────────────────────────────────────────────────┐ ║
║ │ SOURCE INSTANCE: i-web123 │ ║
║ │ Name: web-server-1 │ ║
║ │ ENI: eni-0abc111111111111 │ ║
║ │ Private IP: 10.0.1.50 │ ║
║ │ Subnet: subnet-pub-a (10.0.1.0/24) - PUBLIC │ ║
║ │ Security Groups: │ ║
║ │ • sg-0web111111111111 (web-tier-sg) │ ║
║ └────────────────────────────────────────────────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌────────────────────────────────────────────────────────────────────────┐ ║
║ │ OUTBOUND CHECK (sg-web-tier-sg) │ ║
║ │ │ ║
║ │ Checking rules for: TCP:8080 to 10.0.10.100 │ ║
║ │ │ ║
║ │ Rule 1: Type=All, Protocol=All, Port=All, Dest=0.0.0.0/0 │ ║
║ │ ✓ MATCH - Destination 10.0.10.100 in 0.0.0.0/0 │ ║
║ │ │ ║
║ │ RESULT: ✓ OUTBOUND ALLOWED │ ║
║ └────────────────────────────────────────────────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌────────────────────────────────────────────────────────────────────────┐ ║
║ │ TARGET INSTANCE: i-app456 │ ║
║ │ Name: app-server-1 │ ║
║ │ ENI: eni-0def222222222222 │ ║
║ │ Private IP: 10.0.10.100 │ ║
║ │ Subnet: subnet-prv-a (10.0.10.0/24) - PRIVATE │ ║
║ │ Security Groups: │ ║
║ │ • sg-0app222222222222 (app-tier-sg) │ ║
║ └────────────────────────────────────────────────────────────────────────┘ ║
║ │ ║
║ ▼ ║
║ ┌────────────────────────────────────────────────────────────────────────┐ ║
║ │ INBOUND CHECK (sg-app-tier-sg) │ ║
║ │ │ ║
║ │ Checking rules for: TCP:8080 from 10.0.1.50 or sg-web-tier-sg │ ║
║ │ │ ║
║ │ Rule 1: Type=Custom TCP, Protocol=TCP, Port=443, Source=sg-alb-sg │ ║
║ │ ✗ NO MATCH - Port 443 ≠ 8080 │ ║
║ │ │ ║
║ │ Rule 2: Type=Custom TCP, Protocol=TCP, Port=22, Source=10.0.0.0/8 │ ║
║ │ ✗ NO MATCH - Port 22 ≠ 8080 │ ║
║ │ │ ║
║ │ No more rules to check. │ ║
║ │ │ ║
║ │ RESULT: ✗ INBOUND BLOCKED │ ║
║ └────────────────────────────────────────────────────────────────────────┘ ║
║ ║
║ ══════════════════════════════════════════════════════════════════════════ ║
║ ║
║ FINAL RESULT: CONNECTION BLOCKED ❌ ║
║ ║
║ BLOCKED AT: Inbound rules on sg-app-tier-sg ║
║ ║
║ RECOMMENDATIONS: ║
║ ────────────────────────────────────────────────────────────────────────── ║
║ ║
║ Option 1 (Recommended - Security Group Reference): ║
║ aws ec2 authorize-security-group-ingress \ ║
║ --group-id sg-0app222222222222 \ ║
║ --protocol tcp \ ║
║ --port 8080 \ ║
║ --source-group sg-0web111111111111 ║
║ ║
║ Option 2 (IP-based - less flexible): ║
║ aws ec2 authorize-security-group-ingress \ ║
║ --group-id sg-0app222222222222 \ ║
║ --protocol tcp \ ║
║ --port 8080 \ ║
║ --cidr 10.0.1.50/32 ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
# After adding the rule, verify the fix:
$ ./sg-debug can-connect --from i-web123 --to i-app456 --port 8080
FINAL RESULT: CONNECTION ALLOWED ✓
Traffic Path:
i-web123 (10.0.1.50)
→ [OUTBOUND: sg-web-tier-sg allows all]
→ i-app456 (10.0.10.100:8080)
→ [INBOUND: sg-app-tier-sg allows from sg-web-tier-sg]
Return Traffic: Auto-allowed (Security Groups are stateful)
# Advanced usage - trace all allowed connections for an instance
$ ./sg-debug list-allowed --instance i-app456 --direction inbound
╔══════════════════════════════════════════════════════════════════════════════╗
║ ALLOWED INBOUND CONNECTIONS TO i-app456 ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ ║
║ Security Group: sg-app-tier-sg ║
║ ║
║ ┌────────┬──────────┬───────────────────────────┬──────────────────────┐ ║
║ │ Port │ Protocol │ Source │ Description │ ║
║ ├────────┼──────────┼───────────────────────────┼──────────────────────┤ ║
║ │ 443 │ TCP │ sg-alb-sg (ALB) │ HTTPS from ALB │ ║
║ │ 8080 │ TCP │ sg-web-tier-sg │ API from web tier │ ║
║ │ 22 │ TCP │ 10.0.0.0/8 │ SSH from VPC │ ║
║ │ 3306 │ TCP │ sg-app-tier-sg (self) │ DB replication │ ║
║ └────────┴──────────┴───────────────────────────┴──────────────────────┘ ║
║ ║
║ POTENTIAL ISSUES DETECTED: ║
║ ⚠️ SSH (22) open to entire VPC - consider restricting to bastion only ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
# Find which instances can reach a specific target
$ ./sg-debug who-can-reach --target i-db789 --port 3306
╔══════════════════════════════════════════════════════════════════════════════╗
║ INSTANCES THAT CAN REACH i-db789:3306 ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ ║
║ Database instance i-db789 accepts TCP:3306 from: ║
║ ║
║ Via Security Group sg-app-tier-sg: ║
║ • i-app456 (10.0.10.100) - app-server-1 ║
║ • i-app789 (10.0.11.100) - app-server-2 ║
║ ║
║ Via CIDR 10.0.10.0/24: ║
║ • i-app456 (10.0.10.100) - app-server-1 ║
║ • i-cache123 (10.0.10.200) - redis-cache ║
║ ║
║ TOTAL: 3 unique instances can reach the database ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
The Core Question You’re Answering
“When traffic is blocked between two AWS resources, how do I know WHERE it’s blocked and WHY?”
This is the question every AWS engineer faces daily. Security Groups silently drop packets—there’s no “connection refused” or error message. Understanding exactly how Security Group rules are evaluated, how stateful filtering works, and how SG references resolve is crucial for debugging any connectivity issue.
Concepts You Must Understand First
Stop and research these before coding:
- Stateful vs Stateless Firewalls
- What does “stateful” actually mean in firewall terms?
- How does a stateful firewall track connections? (hint: connection table/conntrack)
- Why is return traffic automatically allowed in Security Groups?
- What’s the TCP three-way handshake and why does it matter for stateful filtering?
- Book Reference: “Computer Networks, Fifth Edition” by Tanenbaum — Ch. 8.9: Firewalls
- Security Group Rule Evaluation
- Are Security Group rules evaluated in order like NACLs?
- What happens when you have multiple Security Groups on one ENI?
- Can you have deny rules in Security Groups?
- How does “All traffic” rule (-1 protocol) work?
- Book Reference: AWS Security Groups Documentation
- Security Group References (SG-to-SG)
- What does it mean when a rule has “sg-xxx” as source instead of a CIDR?
- How does AWS resolve SG references when checking rules?
- Why is SG referencing more flexible than IP-based rules?
- What happens if you reference a SG from a different VPC?
- Book Reference: AWS Security Best Practices Whitepaper
- ENI and Security Group Association
- What is an ENI (Elastic Network Interface)?
- How many Security Groups can be attached to one ENI?
- Can different ENIs on the same instance have different Security Groups?
- How do Lambda, RDS, and ELB use ENIs and Security Groups?
- Book Reference: “AWS Certified Solutions Architect Study Guide” — Networking Chapter
- TCP/UDP and Port Ranges
- What’s the difference between source port and destination port?
- Why do you only need to specify destination port in SG rules?
- What are ephemeral ports and why don’t you need to allow them inbound?
- What does protocol “-1” mean in AWS?
- Book Reference: “TCP/IP Illustrated, Volume 1” by Stevens — Ch. 13: TCP Connection Management
Questions to Guide Your Design
Before implementing, think through these:
- Data Model
- What information do you need to fetch from AWS to analyze connectivity?
- How will you represent a Security Group rule in your code?
- How will you handle multiple ENIs per instance?
- How will you resolve SG references to actual IP addresses?
- Algorithm Design
- Should you check outbound rules first or inbound rules?
- How do you determine if a rule “matches” a connection attempt?
- When multiple rules could match, which one wins?
- How do you explain WHY a connection is blocked?
- Edge Cases
- What if the source instance has multiple SGs and one allows while another doesn’t?
- What if the rule uses a SG reference that includes the source instance?
- How do you handle ICMP (ping) which doesn’t have ports?
- What about connections to/from AWS services (RDS, ElastiCache)?
- User Experience
- How should you display the results—text, JSON, visual diagram?
- Should you recommend fixes for blocked connections?
- How do you make the output actionable for someone who’s debugging?
Thinking Exercise
Before coding, trace these scenarios on paper:
Scenario 1: Web server trying to reach database
Setup:
- web-server (i-web) has sg-web attached
- db-server (i-db) has sg-db attached
- sg-web outbound: Allow all (0.0.0.0/0)
- sg-db inbound: Allow TCP 3306 from sg-app (NOT sg-web!)
Connection attempt: i-web → i-db:3306
Step 1: Check sg-web outbound rules
- Rule "Allow all" matches destination 10.0.20.50:3306
- Outbound: ALLOWED ✓
Step 2: Check sg-db inbound rules
- Rule "Allow 3306 from sg-app"
- Is i-web in sg-app? NO!
- No other rules match
- Inbound: BLOCKED ✗
Result: CONNECTION BLOCKED
Reason: sg-db only allows 3306 from sg-app, not sg-web
Fix: Either add i-web to sg-app, or add new rule allowing sg-web
Scenario 2: Understanding SG references
Setup:
- Instance A has sg-A attached (IP: 10.0.1.10)
- Instance B has sg-B attached (IP: 10.0.1.20)
- Instance C has sg-A AND sg-B attached (IP: 10.0.1.30)
- sg-target inbound: Allow TCP 443 from sg-A
Question: Which instances can reach sg-target on port 443?
Answer:
- Instance A: YES (has sg-A)
- Instance B: NO (only has sg-B)
- Instance C: YES (has sg-A, even though it also has sg-B)
The SG reference sg-A means "any ENI that has sg-A attached"
Scenario 3: Multiple Security Groups on one ENI
Setup:
- Instance has sg-1 and sg-2 attached
- sg-1 outbound: Allow TCP 443 to 0.0.0.0/0
- sg-1 outbound: (no rule for port 8080)
- sg-2 outbound: Allow TCP 8080 to 10.0.0.0/8
Connection attempt: Instance → external-api:443
- sg-1 allows it: ALLOWED ✓
Connection attempt: Instance → internal-service:8080
- sg-1 doesn't have a rule... but wait!
- sg-2 allows it: ALLOWED ✓
Key insight: Security Groups are ADDITIVE
If ANY attached SG allows the traffic, it's allowed
There's no "most restrictive wins" - it's "most permissive wins"
The Interview Questions They’ll Ask
Prepare to answer these:
- “What’s the difference between Security Groups and NACLs?”
- “Security Groups are stateful—what does that mean exactly?”
- “Can you block specific traffic with a Security Group? How?”
- “What happens when you attach multiple Security Groups to an instance?”
- “What’s the advantage of using Security Group references vs CIDR blocks?”
- “You have a connection timeout between two instances. How do you debug it?”
- “Can Security Groups span VPCs?”
- “What’s the maximum number of rules you can have in a Security Group?”
- “How do Security Groups work with Lambda functions?”
- “You allowed inbound traffic but the connection still fails. What could be wrong?”
Hints in Layers
Hint 1: Start with boto3 to fetch Security Group data
import boto3
def get_instance_security_groups(instance_id: str) -> list:
"""Get all Security Groups attached to an instance's ENIs."""
ec2 = boto3.client('ec2')
# Get instance details
response = ec2.describe_instances(InstanceIds=[instance_id])
instance = response['Reservations'][0]['Instances'][0]
# Collect SG IDs from all network interfaces
sg_ids = set()
for eni in instance.get('NetworkInterfaces', []):
for group in eni.get('Groups', []):
sg_ids.add(group['GroupId'])
# Get full SG details
sg_response = ec2.describe_security_groups(GroupIds=list(sg_ids))
return sg_response['SecurityGroups']
Hint 2: Model the rule checking logic
def check_rule_matches(rule: dict, port: int, protocol: str,
source_ip: str, source_sg_ids: list) -> bool:
"""Check if a single inbound rule allows the connection."""
# Check protocol (-1 means all protocols)
rule_protocol = rule.get('IpProtocol', '-1')
if rule_protocol != '-1' and rule_protocol != protocol:
return False
# Check port range (for TCP/UDP)
if protocol in ['tcp', 'udp', '6', '17']:
from_port = rule.get('FromPort', 0)
to_port = rule.get('ToPort', 65535)
if not (from_port <= port <= to_port):
return False
# Check source - could be CIDR or SG reference
# Check IP ranges
for ip_range in rule.get('IpRanges', []):
cidr = ip_range.get('CidrIp')
if ip_in_cidr(source_ip, cidr):
return True
# Check SG references
for sg_ref in rule.get('UserIdGroupPairs', []):
if sg_ref.get('GroupId') in source_sg_ids:
return True
return False
Hint 3: Implement the full connectivity check
def can_connect(source_instance: str, target_instance: str,
port: int, protocol: str = 'tcp') -> dict:
"""
Check if source can connect to target on specified port.
Returns detailed analysis of the path.
"""
result = {
'allowed': False,
'source': get_instance_info(source_instance),
'target': get_instance_info(target_instance),
'outbound_check': None,
'inbound_check': None,
'blocked_at': None,
'recommendation': None
}
# Get SGs for both instances
source_sgs = get_instance_security_groups(source_instance)
target_sgs = get_instance_security_groups(target_instance)
source_sg_ids = [sg['GroupId'] for sg in source_sgs]
target_ip = result['target']['private_ip']
source_ip = result['source']['private_ip']
# Check outbound from source (any SG allowing = pass)
outbound_allowed = False
for sg in source_sgs:
for rule in sg.get('IpPermissionsEgress', []):
if check_outbound_rule_matches(rule, port, protocol, target_ip):
outbound_allowed = True
result['outbound_check'] = {
'allowed': True,
'matched_sg': sg['GroupId'],
'matched_rule': rule
}
break
if outbound_allowed:
break
if not outbound_allowed:
result['blocked_at'] = 'outbound'
result['recommendation'] = generate_outbound_fix(source_sgs[0], port, protocol)
return result
# Check inbound to target
inbound_allowed = False
for sg in target_sgs:
for rule in sg.get('IpPermissions', []):
if check_rule_matches(rule, port, protocol, source_ip, source_sg_ids):
inbound_allowed = True
result['inbound_check'] = {
'allowed': True,
'matched_sg': sg['GroupId'],
'matched_rule': rule
}
break
if inbound_allowed:
break
if not inbound_allowed:
result['blocked_at'] = 'inbound'
result['recommendation'] = generate_inbound_fix(
target_sgs[0], port, protocol, source_sg_ids[0]
)
return result
result['allowed'] = True
return result
Hint 4: Generate actionable fix recommendations
def generate_inbound_fix(target_sg: dict, port: int,
protocol: str, source_sg_id: str) -> str:
"""Generate AWS CLI command to fix blocked inbound traffic."""
return f"""
To allow this connection, run:
aws ec2 authorize-security-group-ingress \\
--group-id {target_sg['GroupId']} \\
--protocol {protocol} \\
--port {port} \\
--source-group {source_sg_id}
Or in Terraform:
resource "aws_security_group_rule" "allow_from_source" {{
type = "ingress"
from_port = {port}
to_port = {port}
protocol = "{protocol}"
source_security_group_id = "{source_sg_id}"
security_group_id = "{target_sg['GroupId']}"
}}
"""
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Firewall concepts (stateful/stateless) | “Computer Networks, Fifth Edition” by Tanenbaum | Ch. 8.9: Firewalls |
| TCP connection states | “TCP/IP Illustrated, Volume 1” by Stevens | Ch. 13: TCP Connection Management |
| AWS Security Groups deep dive | “AWS Certified Security Specialty Study Guide” | Security Groups Chapter |
| Network security monitoring | “The Practice of Network Security Monitoring” by Bejtlich | Ch. 2-4 |
| Python AWS SDK (boto3) | “Python for DevOps” by Gift, Behrman | AWS Chapter |
| CLI tool design | “The Linux Command Line” by Shotts | Ch. 25-27: Shell Scripting |
Project 3: VPC Flow Logs Analyzer
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Networking / Security Analysis
- Software or Tool: AWS VPC Flow Logs, S3, Athena
- Main Book: “The Practice of Network Security Monitoring” by Richard Bejtlich
What you’ll build: A system that ingests VPC Flow Logs, parses them, and provides real-time visibility into network traffic patterns, anomaly detection, and security insights.
Why it teaches AWS networking: VPC Flow Logs are how you SEE what’s actually happening on the network. Understanding them teaches you about IP flows, connection states, and how to detect both problems and attacks.
Core challenges you’ll face:
- Parsing flow log format (version 2+ with custom fields) → maps to log parsing
- Understanding flow states (ACCEPT, REJECT, NODATA) → maps to connection tracking
- Correlating ENIs to resources (which instance is eni-xxx?) → maps to AWS metadata
- Detecting anomalies (port scans, unusual traffic) → maps to security monitoring
- Handling volume (millions of records/hour) → maps to data engineering
Key Concepts:
- VPC Flow Logs Format: AWS VPC Flow Logs Documentation
- Network Traffic Analysis: “The Practice of Network Security Monitoring” Chapter 5 - Richard Bejtlich
- Athena Queries: Querying VPC Flow Logs with Athena
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Python, SQL, understanding of network protocols
Real world outcome:
$ ./flow-analyzer dashboard
╔════════════════════════════════════════════════════════════════╗
║ VPC FLOW LOGS DASHBOARD (Last 24 hours) ║
╠════════════════════════════════════════════════════════════════╣
║ ║
║ TRAFFIC SUMMARY ║
║ ─────────────────────────────────────────────────────────────║
║ Total Flows: 2,456,789 ║
║ Accepted: 2,401,234 (97.7%) ║
║ Rejected: 55,555 (2.3%) ║
║ Data Transferred: 1.2 TB ║
║ ║
║ TOP TALKERS (by bytes) ║
║ ─────────────────────────────────────────────────────────────║
║ 1. i-abc123 (web-server-1) → 10.0.0.0/8: 245 GB ║
║ 2. i-def456 (db-primary) → 10.0.1.0/24: 189 GB ║
║ 3. i-ghi789 (app-server) → 0.0.0.0/0: 156 GB ║
║ ║
║ ⚠️ SECURITY ALERTS ║
║ ─────────────────────────────────────────────────────────────║
║ 🔴 CRITICAL: Port scan detected ║
║ Source: 203.0.113.50 (external) ║
║ Target: 10.0.1.0/24 (private subnet) ║
║ Ports scanned: 22, 23, 80, 443, 3306, 5432 ║
║ Recommendation: Block 203.0.113.50 in NACL ║
║ ║
║ 🟡 WARNING: Unusual outbound traffic ║
║ Source: i-xyz789 (10.0.2.50) ║
║ Destination: 185.143.223.x (known C2 server) ║
║ Bytes: 2.3 GB over 4 hours ║
║ Recommendation: Isolate instance, investigate ║
║ ║
║ REJECTED CONNECTIONS (Top Sources) ║
║ ─────────────────────────────────────────────────────────────║
║ 1. 192.168.1.50:* → 10.0.1.100:22 (SSH blocked) - 12,345 ║
║ 2. 10.0.1.50:* → 10.0.2.100:3306 (DB not allowed) - 5,432 ║
║ ║
╚════════════════════════════════════════════════════════════════╝
$ ./flow-analyzer query "rejected traffic to port 22 in last hour"
Found 1,234 rejected flows to port 22:
- 85% from external IPs (likely SSH brute force attempts)
- 15% from internal IPs (misconfigured Security Groups)
Top external sources:
185.143.223.x: 456 attempts (known bad IP - block in NACL)
192.168.1.x: 234 attempts (RFC1918 - spoofed, drop at edge)
Implementation Hints: Flow log record format (v2):
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789012 eni-abc123 10.0.1.50 10.0.2.100 49152 443 6 25 1234 1639489200 1639489260 ACCEPT OK
Processing pipeline:
# Pseudo-code
def process_flow_logs(s3_bucket, prefix):
# Read from S3 (or Kinesis for real-time)
for log_file in list_s3_objects(s3_bucket, prefix):
records = parse_flow_log_file(log_file)
for record in records:
# Enrich with metadata
record.source_instance = lookup_eni(record.interface_id)
record.geo = geoip_lookup(record.srcaddr) if is_public(record.srcaddr) else None
# Store in database (TimescaleDB, ClickHouse, etc.)
insert_record(record)
# Real-time anomaly detection
if is_port_scan(record):
alert("Port scan detected", record)
if is_known_bad_ip(record.srcaddr):
alert("Connection from known malicious IP", record)
def is_port_scan(record, window_seconds=60, port_threshold=10):
# Check if same source hit many ports in short time
recent_ports = query("""
SELECT DISTINCT dstport FROM flows
WHERE srcaddr = ? AND time > now() - interval ?
""", record.srcaddr, window_seconds)
return len(recent_ports) > port_threshold
Learning milestones:
- Parse and store flow logs efficiently → You understand the format
- Identify traffic patterns → You can analyze network behavior
- Detect security anomalies → You understand attack patterns
- Correlate with AWS resources → You connect network to infrastructure
Real World Outcome
When you complete this project, you’ll have a powerful network visibility tool that transforms raw VPC Flow Logs into actionable security intelligence. Here’s exactly what your tool will do:
# Start the flow analyzer daemon (processes logs from S3 in real-time)
$ ./flow-analyzer start --s3-bucket vpc-flow-logs-prod --region us-east-1
[2024-12-22 14:00:00] Flow Analyzer started
[2024-12-22 14:00:01] Connected to S3: vpc-flow-logs-prod
[2024-12-22 14:00:01] Loaded 156 ENI → Instance mappings
[2024-12-22 14:00:02] Processing backlog: 2,456 log files
[2024-12-22 14:00:15] Backlog processed: 12,456,789 flow records
[2024-12-22 14:00:15] Real-time processing active...
# View the live dashboard
$ ./flow-analyzer dashboard --refresh 5s
╔══════════════════════════════════════════════════════════════════════════════╗
║ VPC FLOW LOGS DASHBOARD ║
║ Last updated: 2024-12-22 14:32:15 UTC ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ ║
║ TRAFFIC SUMMARY (Last 24 Hours) ║
║ ────────────────────────────────────────────────────────────────────────── ║
║ ║
║ ┌─────────────────┬────────────────┬─────────────────────────────────────┐ ║
║ │ Metric │ Value │ Graph (24h) │ ║
║ ├─────────────────┼────────────────┼─────────────────────────────────────┤ ║
║ │ Total Flows │ 2,456,789 │ ▂▃▄▅▆▇█▇▆▅▄▅▆▇█▇▆▅▄▃▂▃▄▅ │ ║
║ │ Accepted │ 2,401,234 │ 97.7% ████████████████████░░ │ ║
║ │ Rejected │ 55,555 │ 2.3% █░░░░░░░░░░░░░░░░░░░░░ │ ║
║ │ Data In │ 892 GB │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆ │ ║
║ │ Data Out │ 1.2 TB │ ▃▄▅▆▇█▇▆▅▄▃▂▃▄▅▆▇█▇▆▅▄▃▂ │ ║
║ │ Unique Sources │ 12,456 │ │ ║
║ │ Unique Dests │ 8,234 │ │ ║
║ └─────────────────┴────────────────┴─────────────────────────────────────┘ ║
║ ║
║ TOP TALKERS (by bytes transferred) ║
║ ────────────────────────────────────────────────────────────────────────── ║
║ ║
║ ┌────┬────────────────────────────┬──────────────┬────────────────────────┐║
║ │ # │ Source │ Bytes │ Top Destination │║
║ ├────┼────────────────────────────┼──────────────┼────────────────────────┤║
║ │ 1 │ i-abc123 (web-server-1) │ 245.6 GB │ ALB (internal) │║
║ │ 2 │ i-def456 (db-primary) │ 189.2 GB │ db-replica (10.0.11.x) │║
║ │ 3 │ i-ghi789 (app-server-1) │ 156.8 GB │ S3 (via endpoint) │║
║ │ 4 │ i-jkl012 (batch-worker) │ 98.4 GB │ External APIs │║
║ │ 5 │ i-mno345 (cache-server) │ 67.2 GB │ app-servers │║
║ └────┴────────────────────────────┴──────────────┴────────────────────────┘║
║ ║
║ 🚨 SECURITY ALERTS ║
║ ────────────────────────────────────────────────────────────────────────── ║
║ ║
║ 🔴 CRITICAL [14:28:32] Port Scan Detected ║
║ ┌──────────────────────────────────────────────────────────────────────┐║
║ │ Source: 203.0.113.50 (external - Tor exit node) │║
║ │ Target: 10.0.1.0/24 (public subnet) │║
║ │ Ports scanned: 22, 23, 80, 443, 3306, 5432, 6379, 27017 (15 total) │║
║ │ Duration: 45 seconds │║
║ │ Status: All REJECTED by Security Groups │║
║ │ │║
║ │ RECOMMENDED ACTION: │║
║ │ aws ec2 create-network-acl-entry --network-acl-id acl-xxx \ │║
║ │ --rule-number 50 --protocol -1 --cidr-block 203.0.113.50/32 \ │║
║ │ --egress false --rule-action deny │║
║ └──────────────────────────────────────────────────────────────────────┘║
║ ║
║ 🟡 WARNING [14:15:22] Unusual Outbound Traffic ║
║ ┌──────────────────────────────────────────────────────────────────────┐║
║ │ Source: i-xyz789 (10.0.2.50) - Name: batch-processor-3 │║
║ │ Destination: 185.143.223.100 (external) │║
║ │ GeoIP: Russia, Moscow │║
║ │ Threat Intel: Listed in AbuseIPDB (C2 server suspected) │║
║ │ Bytes transferred: 2.3 GB over 4 hours │║
║ │ Pattern: Large outbound, minimal inbound (data exfiltration?) │║
║ │ │║
║ │ RECOMMENDED ACTIONS: │║
║ │ 1. Isolate instance: aws ec2 modify-instance-attribute \ │║
║ │ --instance-id i-xyz789 --groups sg-isolated │║
║ │ 2. Create forensic snapshot before termination │║
║ │ 3. Review CloudTrail for instance compromise indicators │║
║ └──────────────────────────────────────────────────────────────────────┘║
║ ║
║ 🟢 INFO [13:45:00] New Communication Path Detected ║
║ i-web123 (web-tier) → i-cache456 (redis) on port 6379 ║
║ First seen: 2024-12-22 13:45:00 (new deployment?) ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
# Query specific traffic patterns
$ ./flow-analyzer query --sql "
SELECT srcaddr, COUNT(*) as attempts, COUNT(DISTINCT dstport) as ports
FROM flows
WHERE action = 'REJECT'
AND dstport IN (22, 23, 3389)
AND start_time > now() - interval '1 hour'
GROUP BY srcaddr
HAVING COUNT(DISTINCT dstport) > 2
ORDER BY attempts DESC
LIMIT 10
"
┌─────────────────┬──────────┬───────┐
│ srcaddr │ attempts │ ports │
├─────────────────┼──────────┼───────┤
│ 185.143.223.x │ 1,456 │ 3 │
│ 203.0.113.50 │ 892 │ 3 │
│ 192.168.1.100 │ 234 │ 2 │ ← Internal! Misconfigured?
│ 45.33.32.156 │ 189 │ 3 │
└─────────────────┴──────────┴───────┘
# Generate security report
$ ./flow-analyzer report --format pdf --period "last 7 days" --output weekly-report.pdf
Generated: weekly-report.pdf
Contents:
- Executive Summary
- Traffic Volume Trends
- Top Talkers Analysis
- Security Incidents (12 alerts)
- Rejected Traffic Analysis
- Recommendations
- Appendix: Raw Data
# Export to SIEM
$ ./flow-analyzer export --format splunk --dest "https://splunk.company.com:8088/services/collector"
Exported 2,456,789 records to Splunk
Architecture You’ll Build:
┌─────────────────────────────────────────────────────────────────────────────┐
│ VPC FLOW LOGS ANALYZER │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────┐ │
│ │ VPC │ │ S3 │ │ Flow Analyzer │ │
│ │ │ │ Bucket │ │ │ │
│ │ ┌───────┐ │ │ │ │ ┌──────────────────────┐ │ │
│ │ │ ENI │──┼──────┼─► Flow Logs ├──────┼──► Log Parser │ │ │
│ │ └───────┘ │ │ │ │ └──────────┬───────────┘ │ │
│ │ ┌───────┐ │ │ │ │ │ │ │
│ │ │ ENI │──┼──────┤ │ │ ┌──────────▼───────────┐ │ │
│ │ └───────┘ │ │ │ │ │ Enrichment Engine │ │ │
│ │ ┌───────┐ │ │ │ │ │ - ENI → Instance │ │ │
│ │ │ ENI │──┼──────┤ │ │ │ - GeoIP lookup │ │ │
│ │ └───────┘ │ │ │ │ │ - Threat Intel │ │ │
│ │ │ └─────────────┘ │ └──────────┬───────────┘ │ │
│ └─────────────┘ │ │ │ │
│ │ ┌──────────▼───────────┐ │ │
│ │ │ Analytics Engine │ │ │
│ │ │ - Anomaly detection │ │ │
│ │ │ - Pattern matching │ │ │
│ │ │ - Alerting │ │ │
│ │ └──────────┬───────────┘ │ │
│ │ │ │ │
│ │ ┌──────────▼───────────┐ │ │
│ │ │ Storage (TimescaleDB │ │ │
│ │ │ or ClickHouse) │ │ │
│ │ └──────────┬───────────┘ │ │
│ │ │ │ │
│ │ ┌──────────▼───────────┐ │ │
│ │ │ CLI / Dashboard │ │ │
│ │ │ - Real-time view │ │ │
│ │ │ - SQL queries │ │ │
│ │ │ - Reports │ │ │
│ │ └──────────────────────┘ │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
The Core Question You’re Answering
“What traffic is actually flowing through my VPC, and how do I detect problems and attacks?”
VPC Flow Logs are your eyes into the network. Without them, you’re blind—you don’t know what’s communicating with what, whether traffic is being blocked, or if there’s a security incident. This project teaches you to transform raw metadata into actionable intelligence.
Concepts You Must Understand First
Stop and research these before coding:
- VPC Flow Log Record Format
- What are the default fields in v2 flow logs?
- What additional fields can you add in v3+?
- What does each field actually represent?
- Why are there “NODATA” and “SKIPDATA” entries?
- Book Reference: AWS VPC Flow Logs Documentation
- Network Protocol Numbers
- What does protocol “6” mean? (TCP) Protocol “17”? (UDP) Protocol “1”? (ICMP)
- Where do these numbers come from? (IANA protocol numbers)
- Why is this important for analyzing traffic?
- Book Reference: “TCP/IP Illustrated, Volume 1” by Stevens — Ch. 1: Introduction
- Flow vs Packet
- What’s the difference between a flow and a packet?
- Why does AWS capture flows instead of packets?
- How long does a flow represent? (aggregation window)
- Book Reference: “The Practice of Network Security Monitoring” by Bejtlich — Ch. 5: Flow Data
- Security Attack Patterns
- What does a port scan look like in flow logs?
- How do you detect brute force attacks?
- What indicates data exfiltration?
- What’s a C2 (Command & Control) communication pattern?
- Book Reference: “The Practice of Network Security Monitoring” by Bejtlich — Ch. 8: Analysis Techniques
- ENI (Elastic Network Interface) Architecture
- What is an ENI and how does it relate to instances?
- Why do flow logs capture at the ENI level, not instance level?
- How do you map ENI to instance/resource?
- Book Reference: AWS Documentation on ENIs
Questions to Guide Your Design
Before implementing, think through these:
- Data Ingestion
- Where should flow logs be delivered—S3, CloudWatch Logs, or Kinesis?
- How do you handle the delay between event and log availability?
- How do you process backlog when starting up?
- What’s your strategy for handling millions of records per hour?
- Data Storage
- What database is best for time-series flow data? (TimescaleDB, ClickHouse, etc.)
- How do you partition data for efficient queries?
- How long should you retain data?
- How do you handle storage costs at scale?
- Enrichment
- How do you map ENI IDs to instance names?
- Should you do GeoIP lookups? Performance implications?
- How do you integrate threat intelligence feeds?
- How often should you refresh ENI mappings?
- Detection Logic
- What thresholds define a “port scan”?
- How do you distinguish attack traffic from legitimate scanning?
- What’s “unusual” outbound traffic?
- How do you avoid alert fatigue?
- Output & Alerting
- How should alerts be delivered—Slack, PagerDuty, email?
- What severity levels should you use?
- How do you make the dashboard useful for both security and operations?
Thinking Exercise
Before coding, analyze these flow log samples:
Sample 1: Normal Web Traffic
2 123456789012 eni-web123 203.0.113.50 10.0.1.100 52341 443 6 25 15000 1639489200 1639489260 ACCEPT OK
2 123456789012 eni-web123 10.0.1.100 203.0.113.50 443 52341 6 30 125000 1639489200 1639489260 ACCEPT OK
Questions:
- Which direction is the client → server traffic?
- How many bytes did the server send vs receive?
- What service is being accessed?
Sample 2: Port Scan
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45123 22 6 1 40 1639489200 1639489201 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45124 23 6 1 40 1639489201 1639489202 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45125 80 6 1 40 1639489202 1639489203 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45126 443 6 1 40 1639489203 1639489204 REJECT OK
2 123456789012 eni-web123 185.143.223.x 10.0.1.100 45127 3306 6 1 40 1639489204 1639489205 REJECT OK
Questions:
- How do you know this is a port scan?
- What’s the scan rate (ports per second)?
- Why are all actions REJECT?
- What NACL rule would block this?
Sample 3: Possible Data Exfiltration
2 123456789012 eni-app456 10.0.10.50 185.143.223.100 49152 443 6 50000 52428800 1639400000 1639486400 ACCEPT OK
2 123456789012 eni-app456 185.143.223.100 10.0.10.50 443 49152 6 1000 50000 1639400000 1639486400 ACCEPT OK
Questions:
- How much data was sent outbound? (52428800 bytes = 50 MB)
- Over what time period? (86400 seconds = 24 hours)
- Why is the ratio suspicious? (50 MB out, 50 KB in)
- What’s the destination? (External IP—needs investigation)
The Interview Questions They’ll Ask
Prepare to answer these:
- “What are VPC Flow Logs and what do they capture?”
- “What’s the difference between ACCEPT and REJECT in flow logs?”
- “How would you detect a port scan using flow logs?”
- “What are the limitations of VPC Flow Logs?” (no payload, no DNS queries to Amazon DNS)
- “How would you set up flow logs for compliance/audit requirements?”
- “What’s the performance impact of enabling flow logs?”
- “How do you analyze flow logs at scale?”
- “What’s the difference between flow logs sent to S3 vs CloudWatch Logs?”
- “How would you use flow logs to troubleshoot a connectivity issue?”
- “What security threats can you detect with flow logs?”
Hints in Layers
Hint 1: Parse flow log records efficiently
from dataclasses import dataclass
from typing import Optional
import ipaddress
@dataclass
class FlowRecord:
version: int
account_id: str
interface_id: str
srcaddr: str
dstaddr: str
srcport: int
dstport: int
protocol: int
packets: int
bytes: int
start: int
end: int
action: str
log_status: str
@classmethod
def from_line(cls, line: str) -> Optional['FlowRecord']:
"""Parse a single flow log line."""
parts = line.strip().split()
if len(parts) < 14 or parts[0] != '2': # Version 2
return None
return cls(
version=int(parts[0]),
account_id=parts[1],
interface_id=parts[2],
srcaddr=parts[3],
dstaddr=parts[4],
srcport=int(parts[5]) if parts[5] != '-' else 0,
dstport=int(parts[6]) if parts[6] != '-' else 0,
protocol=int(parts[7]) if parts[7] != '-' else 0,
packets=int(parts[8]) if parts[8] != '-' else 0,
bytes=int(parts[9]) if parts[9] != '-' else 0,
start=int(parts[10]) if parts[10] != '-' else 0,
end=int(parts[11]) if parts[11] != '-' else 0,
action=parts[12],
log_status=parts[13]
)
def is_external_source(self) -> bool:
"""Check if source IP is external (not RFC1918)."""
try:
ip = ipaddress.ip_address(self.srcaddr)
return not ip.is_private
except ValueError:
return False
def protocol_name(self) -> str:
"""Convert protocol number to name."""
protocols = {1: 'ICMP', 6: 'TCP', 17: 'UDP'}
return protocols.get(self.protocol, str(self.protocol))
Hint 2: Build the ENI-to-Instance mapping
import boto3
from functools import lru_cache
class ENIMapper:
def __init__(self):
self.ec2 = boto3.client('ec2')
self._cache = {}
def refresh_cache(self):
"""Refresh ENI to instance mapping."""
self._cache = {}
# Get all ENIs
paginator = self.ec2.get_paginator('describe_network_interfaces')
for page in paginator.paginate():
for eni in page['NetworkInterfaces']:
eni_id = eni['NetworkInterfaceId']
attachment = eni.get('Attachment', {})
self._cache[eni_id] = {
'instance_id': attachment.get('InstanceId'),
'private_ip': eni.get('PrivateIpAddress'),
'vpc_id': eni.get('VpcId'),
'subnet_id': eni.get('SubnetId'),
'description': eni.get('Description', '')
}
# Enrich with instance names
instance_ids = [v['instance_id'] for v in self._cache.values()
if v['instance_id']]
if instance_ids:
instances = self.ec2.describe_instances(InstanceIds=instance_ids)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
name = next(
(tag['Value'] for tag in instance.get('Tags', [])
if tag['Key'] == 'Name'),
instance['InstanceId']
)
# Update all ENIs for this instance
for eni_id, data in self._cache.items():
if data['instance_id'] == instance['InstanceId']:
data['instance_name'] = name
def lookup(self, eni_id: str) -> dict:
"""Look up ENI details."""
return self._cache.get(eni_id, {'unknown': True})
Hint 3: Implement port scan detection
from collections import defaultdict
from datetime import datetime, timedelta
class PortScanDetector:
def __init__(self, window_seconds=60, port_threshold=10):
self.window_seconds = window_seconds
self.port_threshold = port_threshold
# {source_ip: [(timestamp, port), ...]}
self.activity = defaultdict(list)
def check(self, record: FlowRecord) -> Optional[dict]:
"""Check if this record indicates a port scan."""
if record.action != 'REJECT':
return None # Only care about rejected (probing) traffic
now = datetime.utcnow()
source = record.srcaddr
# Add this activity
self.activity[source].append((now, record.dstport))
# Clean old activity
cutoff = now - timedelta(seconds=self.window_seconds)
self.activity[source] = [
(ts, port) for ts, port in self.activity[source]
if ts > cutoff
]
# Check for port scan pattern
recent_ports = set(port for _, port in self.activity[source])
if len(recent_ports) >= self.port_threshold:
return {
'type': 'PORT_SCAN',
'severity': 'HIGH',
'source_ip': source,
'ports_scanned': sorted(recent_ports),
'window_seconds': self.window_seconds,
'recommendation': f"Block {source} in NACL"
}
return None
Hint 4: Create the analytics dashboard
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich.live import Live
class Dashboard:
def __init__(self, db):
self.db = db
self.console = Console()
def render(self):
"""Render the dashboard."""
# Traffic summary
summary = self.db.query("""
SELECT
COUNT(*) as total_flows,
SUM(CASE WHEN action = 'ACCEPT' THEN 1 ELSE 0 END) as accepted,
SUM(CASE WHEN action = 'REJECT' THEN 1 ELSE 0 END) as rejected,
SUM(bytes) as total_bytes
FROM flows
WHERE start_time > now() - interval '24 hours'
""").fetchone()
# Top talkers
top_talkers = self.db.query("""
SELECT
srcaddr,
SUM(bytes) as total_bytes,
COUNT(DISTINCT dstaddr) as unique_destinations
FROM flows
WHERE action = 'ACCEPT'
AND start_time > now() - interval '24 hours'
GROUP BY srcaddr
ORDER BY total_bytes DESC
LIMIT 5
""").fetchall()
# Build display
table = Table(title="Top Talkers (24h)")
table.add_column("Source IP")
table.add_column("Bytes", justify="right")
table.add_column("Destinations", justify="right")
for row in top_talkers:
# Enrich with instance name
eni_info = self.eni_mapper.lookup_by_ip(row['srcaddr'])
name = eni_info.get('instance_name', row['srcaddr'])
table.add_row(
name,
format_bytes(row['total_bytes']),
str(row['unique_destinations'])
)
self.console.print(table)
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Network flow analysis | “The Practice of Network Security Monitoring” by Bejtlich | Ch. 5: Flow Data, Ch. 8: Analysis |
| TCP/IP fundamentals | “TCP/IP Illustrated, Volume 1” by Stevens | Ch. 1-4: Protocol basics |
| Security monitoring | “Applied Network Security Monitoring” by Sanders | Ch. 5-7: Collection and Analysis |
| Time-series databases | “Designing Data-Intensive Applications” by Kleppmann | Ch. 3: Storage and Retrieval |
| Python data processing | “Data Engineering with Python” by Reis | Ch. 4-6: Pipelines |
| AWS networking | “AWS Certified Advanced Networking Study Guide” | VPC Flow Logs section |
Project 4: VPC Peering vs Transit Gateway Lab
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK, CloudFormation
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Cloud Architecture / Networking
- Software or Tool: AWS VPC Peering, Transit Gateway
- Main Book: “AWS Certified Advanced Networking Study Guide”
What you’ll build: Two parallel network architectures—one using VPC Peering (full mesh) and one using Transit Gateway (hub-and-spoke)—with performance tests and cost analysis to understand when to use each.
Why it teaches AWS networking: The VPC Peering vs Transit Gateway decision is one of the most important architectural choices. This project lets you experience both, measure the differences, and make informed decisions.
Core challenges you’ll face:
- Full mesh complexity (n VPCs = n(n-1)/2 peering connections) → maps to *scaling limits
- Non-transitive routing (VPC A↔B↔C doesn’t mean A↔C) → maps to routing behavior
- TGW route table design (attachment associations, propagations) → maps to hub routing
- Latency measurement (extra hop in TGW) → maps to performance tradeoffs
- Cost analysis (TGW hourly + data processing vs peering data transfer) → maps to FinOps
Key Concepts:
- VPC Peering: AWS VPC Peering Guide
- Transit Gateway: AWS Transit Gateway Documentation
- Architecture Decision: VPC Peering vs Transit Gateway Comparison
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC fundamentals, Terraform
Real world outcome:
$ terraform apply -var="architecture=peering"
Creating VPC Peering architecture (5 VPCs, full mesh)...
VPC-A ↔ VPC-B: peering-001
VPC-A ↔ VPC-C: peering-002
VPC-A ↔ VPC-D: peering-003
VPC-A ↔ VPC-E: peering-004
VPC-B ↔ VPC-C: peering-005
... (10 peering connections total)
Route tables: 10 routes per VPC
$ terraform apply -var="architecture=tgw"
Creating Transit Gateway architecture (5 VPCs, hub-and-spoke)...
Transit Gateway: tgw-0abc123
VPC-A attachment: tgw-attach-001
VPC-B attachment: tgw-attach-002
... (5 attachments total)
Route tables: 1 route per VPC (0.0.0.0/0 → TGW)
$ ./network-benchmark
╔══════════════════════════════════════════════════════════════╗
║ VPC PEERING vs TRANSIT GATEWAY COMPARISON ║
╠══════════════════════════════════════════════════════════════╣
║ ║
║ CONFIGURATION COMPLEXITY ║
║ ────────────────────────────────────────────────────────── ║
║ Peering (5 VPCs): 10 connections, 50 route entries ║
║ TGW (5 VPCs): 1 TGW, 5 attachments, 5 route entries ║
║ ║
║ LATENCY (VPC-A to VPC-E, avg of 1000 pings) ║
║ ────────────────────────────────────────────────────────── ║
║ VPC Peering: 0.3ms (direct connection) ║
║ Transit Gateway: 0.5ms (+0.2ms for TGW hop) ║
║ ║
║ BANDWIDTH (iperf3, same AZ) ║
║ ────────────────────────────────────────────────────────── ║
║ VPC Peering: ~unlimited (AWS backbone) ║
║ Transit Gateway: 50 Gbps per attachment (100 Gbps now) ║
║ ║
║ MONTHLY COST (5 VPCs, 1 TB cross-VPC traffic) ║
║ ────────────────────────────────────────────────────────── ║
║ VPC Peering: $10.00 (data transfer only) ║
║ Transit Gateway: $380.00 ($36/attachment + $0.02/GB) ║
║ ║
║ SCALING TO 50 VPCs ║
║ ────────────────────────────────────────────────────────── ║
║ Peering connections: 1,225 (50*49/2) - UNMANAGEABLE! ║
║ TGW attachments: 50 - easy to manage ║
║ ║
╚══════════════════════════════════════════════════════════════╝
RECOMMENDATION:
- < 10 VPCs, low traffic: VPC Peering (cost-effective)
- > 10 VPCs or hybrid connectivity: Transit Gateway (manageable)
- Latency-critical applications: VPC Peering (direct path)
Implementation Hints: VPC Peering full mesh calculation:
For n VPCs:
Peering connections = n * (n-1) / 2
Route entries per VPC = n - 1
5 VPCs: 10 connections, 4 routes each
10 VPCs: 45 connections, 9 routes each
50 VPCs: 1,225 connections, 49 routes each (NIGHTMARE!)
Transit Gateway scales linearly:
For n VPCs:
TGW attachments = n
Route entries per VPC = 1 (point to TGW)
5 VPCs: 5 attachments
50 VPCs: 50 attachments
500 VPCs: 500 attachments (TGW supports 5,000)
Key insight about transitivity:
VPC Peering: A ↔ B, B ↔ C does NOT mean A ↔ C
Transit Gateway: A → TGW → C works automatically
# This is why peering becomes unmanageable at scale
# Every VPC needs direct peering to every other VPC
Learning milestones:
- Both architectures deploy successfully → You understand the components
- Traffic flows in both designs → You understand routing
- Measure latency difference → You understand the TGW hop
- Calculate cost breakeven point → You can make business decisions
Project 5: NAT Gateway Deep Dive & Cost Optimizer
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Python
- Alternative Programming Languages: Go, Bash
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cloud Networking / FinOps
- Software or Tool: AWS NAT Gateway, VPC Endpoints
- Main Book: “Cloud FinOps” by J.R. Storment
What you’ll build: A tool that analyzes NAT Gateway traffic, identifies cost-saving opportunities (like using VPC Endpoints for AWS services), and provides recommendations with projected savings.
Why it teaches AWS networking: NAT Gateway is often the #1 surprise cost in AWS bills. Understanding why traffic goes through NAT (vs. VPC Endpoints) teaches you about routing, endpoints, and cost optimization.
Core challenges you’ll face:
- Analyzing NAT Gateway bytes (CloudWatch metrics, flow logs) → maps to traffic analysis
- Identifying AWS service traffic (S3, DynamoDB, ECR, etc.) → maps to endpoint candidates
- Calculating endpoint savings (NAT costs vs endpoint costs) → maps to FinOps
- Route table impact (how endpoints change routing) → maps to routing precedence
Key Concepts:
- NAT Gateway Pricing: AWS NAT Gateway Pricing
- VPC Endpoints: Gateway Endpoints vs Interface Endpoints
- Centralized Endpoints: Centralized VPC Endpoints Architecture
Difficulty: Intermediate Time estimate: 1 week Prerequisites: AWS billing understanding, CloudWatch
Real world outcome:
$ ./nat-optimizer analyze --vpc vpc-0abc123
╔════════════════════════════════════════════════════════════════╗
║ NAT GATEWAY COST ANALYSIS (Last 30 days) ║
╠════════════════════════════════════════════════════════════════╣
║ ║
║ NAT GATEWAY SUMMARY ║
║ ─────────────────────────────────────────────────────────────║
║ NAT Gateway: nat-0abc123 (us-east-1a) ║
║ Data Processed: 15.2 TB ║
║ Current Cost: $683.10 ($0.045/GB processing) ║
║ Hourly Cost: $32.40 (720 hours × $0.045) ║
║ TOTAL: $715.50/month ║
║ ║
║ TRAFFIC BREAKDOWN (by destination) ║
║ ─────────────────────────────────────────────────────────────║
║ 1. S3 (52-region prefix lists): 8.5 TB (55.9%) ║
║ 2. ECR (Elastic Container Registry): 3.2 TB (21.1%) ║
║ 3. DynamoDB: 1.8 TB (11.8%) ║
║ 4. Internet (non-AWS): 1.7 TB (11.2%) ║
║ ║
║ 💰 SAVINGS OPPORTUNITIES ║
║ ─────────────────────────────────────────────────────────────║
║ ║
║ S3 Gateway Endpoint (FREE!) ║
║ Current: 8.5 TB × $0.045 = $382.50 ║
║ With Endpoint: $0.00 (gateway endpoints are free) ║
║ SAVINGS: $382.50/month ║
║ ║
║ DynamoDB Gateway Endpoint (FREE!) ║
║ Current: 1.8 TB × $0.045 = $81.00 ║
║ With Endpoint: $0.00 ║
║ SAVINGS: $81.00/month ║
║ ║
║ ECR Interface Endpoint ║
║ Current: 3.2 TB × $0.045 = $144.00 ║
║ Endpoint hourly: $7.30/month (2 AZs × $0.01/hr × 730) ║
║ Endpoint data: 3.2 TB × $0.01 = $32.00 ║
║ SAVINGS: $104.70/month ║
║ ║
║ ═══════════════════════════════════════════════════════════ ║
║ TOTAL POTENTIAL SAVINGS: $568.20/month (79.4% reduction!) ║
║ REMAINING NAT COST: $147.30 (internet traffic only) ║
║ ═══════════════════════════════════════════════════════════ ║
║ ║
╚════════════════════════════════════════════════════════════════╝
$ ./nat-optimizer deploy-endpoints --vpc vpc-0abc123 --dry-run
Would create:
- S3 Gateway Endpoint (vpce-gw-s3)
- DynamoDB Gateway Endpoint (vpce-gw-ddb)
- ECR Interface Endpoint (vpce-if-ecr-api, vpce-if-ecr-dkr)
Route table changes:
- rtb-private-a: Add S3 prefix list → vpce-gw-s3
- rtb-private-b: Add S3 prefix list → vpce-gw-s3
- rtb-private-a: Add DynamoDB prefix list → vpce-gw-ddb
- rtb-private-b: Add DynamoDB prefix list → vpce-gw-ddb
Implementation Hints: The key insight: Gateway Endpoints are FREE (for S3 and DynamoDB), while Interface Endpoints cost ~$7.30/month per AZ but are much cheaper than NAT Gateway for high-traffic services.
# Pseudo-code for traffic analysis
def analyze_nat_traffic(vpc_id, days=30):
# Get NAT Gateway CloudWatch metrics
nat_gateways = get_nat_gateways(vpc_id)
for nat in nat_gateways:
# Get bytes processed
metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/NATGateway',
MetricName='BytesOutToDestination',
Dimensions=[{'Name': 'NatGatewayId', 'Value': nat.id}],
Period=86400,
Statistics=['Sum']
)
# Analyze flow logs to determine destinations
flows = query_flow_logs(f"""
SELECT dstaddr, SUM(bytes) as total_bytes
FROM flow_logs
WHERE interface_id = '{nat.eni_id}'
AND action = 'ACCEPT'
GROUP BY dstaddr
ORDER BY total_bytes DESC
""")
# Categorize by destination
for flow in flows:
if is_s3_ip(flow.dstaddr):
traffic['s3'] += flow.total_bytes
elif is_dynamodb_ip(flow.dstaddr):
traffic['dynamodb'] += flow.total_bytes
elif is_ecr_ip(flow.dstaddr):
traffic['ecr'] += flow.total_bytes
else:
traffic['internet'] += flow.total_bytes
return calculate_savings(traffic)
Learning milestones:
- Identify traffic patterns through NAT → You understand outbound flow
- Calculate endpoint savings correctly → You understand pricing
- Deploy endpoints without breaking traffic → You understand routing
- See cost reduction in AWS bill → You’ve made a real impact
Project 6: Multi-Account Network with AWS Organizations
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK, CloudFormation StackSets
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Enterprise Architecture / Networking
- Software or Tool: AWS Organizations, RAM, Transit Gateway
- Main Book: “AWS Certified Solutions Architect Professional Study Guide”
What you’ll build: A multi-account network architecture with a shared services VPC (containing NAT Gateways, VPC Endpoints), application VPCs in separate accounts, and Transit Gateway connecting everything—all shared via Resource Access Manager.
Why it teaches AWS networking: Enterprise AWS is multi-account. Understanding how to share networking resources across accounts, maintain isolation while enabling connectivity, and centralize common infrastructure is essential for cloud architects.
Core challenges you’ll face:
- Resource Access Manager (sharing TGW, subnets across accounts) → maps to multi-account patterns
- Centralized egress (shared NAT, single point of control) → maps to hub-and-spoke
- Centralized VPC Endpoints (cost optimization, management) → maps to shared services
- Route propagation (TGW route tables, blackhole routes) → maps to transit routing
- Security boundaries (what CAN vs SHOULD communicate) → maps to network segmentation
Key Concepts:
- Multi-VPC Architecture: AWS Multi-VPC Whitepaper
- Resource Access Manager: AWS RAM Documentation
- Centralized Egress: One to Many: Evolving VPC Design
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Multi-account AWS setup, Terraform, Transit Gateway basics
Real world outcome:
MULTI-ACCOUNT NETWORK ARCHITECTURE
┌──────────────────────────────────────────────────────────────────┐
│ MANAGEMENT ACCOUNT │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ AWS Organizations │ │
│ │ Resource Access Manager (shares TGW, Subnets) │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ NETWORK ACCOUNT │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Shared Services VPC (10.0.0.0/16) │ │
│ │ ├── NAT Gateways (centralized egress) │ │
│ │ ├── VPC Endpoints (S3, DynamoDB, ECR, etc.) │ │
│ │ └── Transit Gateway (hub) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
└──────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ DEV ACCOUNT │ │ STAGING ACCOUNT│ │ PROD ACCOUNT │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ App VPC │ │ │ │ App VPC │ │ │ │ App VPC │ │
│ │ 10.1.0.0/16 │ │ │ │ 10.2.0.0/16 │ │ │ │ 10.3.0.0/16 │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ TGW Attach │ │ │ │ TGW Attach │ │ │ │ TGW Attach │ │
│ └─────────────┘ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
$ terraform apply
Network Account:
Transit Gateway: tgw-0net123 (shared via RAM)
Shared Services VPC: vpc-0shared
NAT Gateway: nat-0central
VPC Endpoints: 8 endpoints (S3, DynamoDB, ECR, SSM, etc.)
Dev Account:
TGW Attachment: tgw-attach-dev (accepted from RAM share)
App VPC: vpc-0dev (10.1.0.0/16)
Route: 0.0.0.0/0 → TGW → Shared Services → NAT → Internet
Prod Account:
TGW Attachment: tgw-attach-prod
App VPC: vpc-0prod (10.3.0.0/16)
ISOLATED: Cannot reach Dev VPC (TGW route table segmentation)
# Test connectivity
$ aws-vault exec dev -- ssh app-server
[dev-app]$ curl https://s3.amazonaws.com/mybucket/test.txt
OK (via centralized VPC endpoint - no NAT!)
[dev-app]$ ping 10.3.0.50 # Prod server
Request timed out (blocked by TGW route table - no route to prod)
Implementation Hints: Key architecture decisions:
- Centralized Egress: All outbound traffic from spoke VPCs routes through the shared services VPC. This gives you:
- Single NAT Gateway bill (not one per VPC)
- Central firewall/inspection point
- Unified logging
- Route Table Segmentation: Use separate TGW route tables to control what can talk to what:
```
TGW Route Table: “shared-services”
- All VPC attachments associated
- Routes to all VPCs (hub access)
TGW Route Table: “dev”
- Dev VPC attachment associated
- Route to shared-services VPC: 10.0.0.0/16
- Route to dev VPC: 10.1.0.0/16
- NO route to prod VPC (isolation!)
TGW Route Table: “prod”
- Prod VPC attachment associated
- Route to shared-services VPC: 10.0.0.0/16
- Route to prod VPC: 10.3.0.0/16
- NO route to dev VPC ```
- Centralized VPC Endpoints:
```hcl
In network account
resource “aws_vpc_endpoint” “s3” { vpc_id = aws_vpc.shared_services.id service_name = “com.amazonaws.${var.region}.s3” # This endpoint is reachable from spoke VPCs via TGW }
Spoke VPCs route S3 traffic: VPC → TGW → Shared VPC → Endpoint
**Learning milestones**:
1. **Resources share correctly via RAM** → You understand cross-account sharing
2. **Spoke VPCs reach internet via centralized NAT** → You understand transit routing
3. **Dev cannot reach Prod** → You understand TGW route table segmentation
4. **All AWS API calls use centralized endpoints** → You understand endpoint routing
---
## Project 7: Site-to-Site VPN with Simulated On-Premises
- **File**: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- **Main Programming Language**: Terraform
- **Alternative Programming Languages**: CloudFormation, Manual (console + strongSwan)
- **Coolness Level**: Level 3: Genuinely Clever
- **Business Potential**: 3. The "Service & Support" Model
- **Difficulty**: Level 3: Advanced
- **Knowledge Area**: Hybrid Networking / VPN
- **Software or Tool**: AWS Site-to-Site VPN, strongSwan, Libreswan
- **Main Book**: "Computer Networks, Fifth Edition" by Tanenbaum (Chapter on VPNs)
**What you'll build**: A complete Site-to-Site VPN setup with an EC2 instance running strongSwan to simulate an on-premises data center, demonstrating IPsec tunnel establishment, BGP routing, and failover.
**Why it teaches AWS networking**: Hybrid connectivity is essential for enterprise AWS. By simulating the "on-premises" side with strongSwan, you'll understand both ends of the VPN tunnel—something most engineers never see.
**Core challenges you'll face**:
- **IPsec configuration** (Phase 1, Phase 2, encryption algorithms) → maps to *VPN protocols*
- **BGP configuration** (ASNs, route advertisement) → maps to *dynamic routing*
- **Tunnel redundancy** (two tunnels, failover testing) → maps to *high availability*
- **NAT-Traversal** (when CGW is behind NAT) → maps to *real-world complexity*
- **Troubleshooting tunnels** (why isn't it coming up?) → maps to *practical debugging*
**Key Concepts**:
- **Site-to-Site VPN**: [AWS VPN Documentation](https://docs.aws.amazon.com/vpn/latest/s2svpn/VPC_VPN.html)
- **IPsec Fundamentals**: *"Computer Networks, Fifth Edition"* Chapter 8 - Tanenbaum
- **BGP Basics**: [AWS BGP Routing](https://docs.aws.amazon.com/vpn/latest/s2svpn/cgw-options.html)
- **strongSwan Configuration**: [strongSwan AWS Guide](https://docs.strongswan.org/docs/5.9/interop/aws.html)
**Difficulty**: Advanced
**Time estimate**: 1-2 weeks
**Prerequisites**: Basic networking, Linux administration
**Real world outcome**:
```bash
$ terraform apply
AWS Resources Created:
VPN Gateway: vgw-0abc123 (attached to VPC)
Customer Gateway: cgw-0def456 (simulated on-prem)
VPN Connection: vpn-0ghi789 (2 tunnels)
Tunnel 1: 169.254.10.1/30 (AWS) ↔ 169.254.10.2/30 (On-prem)
Tunnel 2: 169.254.11.1/30 (AWS) ↔ 169.254.11.2/30 (On-prem)
On-Premises Simulation (EC2 in different VPC):
strongSwan instance: i-onprem123
Public IP: 54.x.x.x (Customer Gateway IP)
Internal network: 192.168.0.0/16 (simulated corporate LAN)
# Check VPN status
$ aws ec2 describe-vpn-connections --vpn-connection-ids vpn-0ghi789 \
--query 'VpnConnections[0].VgwTelemetry'
[
{
"OutsideIpAddress": "52.1.2.3",
"Status": "UP",
"StatusMessage": "2 BGP ROUTES",
"AcceptedRouteCount": 2
},
{
"OutsideIpAddress": "52.4.5.6",
"Status": "UP",
"StatusMessage": "2 BGP ROUTES",
"AcceptedRouteCount": 2
}
]
# Test connectivity
$ ssh on-prem-server
[on-prem]$ ping 10.0.1.50 # AWS private IP
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=1 ttl=64 time=15.2 ms
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=2 ttl=64 time=14.8 ms
[on-prem]$ traceroute 10.0.1.50
1 192.168.1.1 (gateway) 0.5 ms
2 169.254.10.1 (AWS tunnel) 15.1 ms # Through IPsec tunnel!
3 10.0.1.50 (AWS instance) 15.3 ms
# Failover test
$ ssh on-prem-server
[on-prem]$ sudo ipsec down aws-tunnel-1
[on-prem]$ ping 10.0.1.50
# Brief interruption, then traffic flows through tunnel-2
PING 10.0.1.50: 64 bytes from 10.0.1.50: icmp_seq=5 ttl=64 time=16.1 ms
Implementation Hints: The strongSwan configuration on the simulated on-prem server:
# /etc/ipsec.conf (simplified)
conn aws-tunnel-1
auto=start
type=tunnel
authby=secret
left=%defaultroute
leftid=54.x.x.x # On-prem public IP
leftsubnet=192.168.0.0/16 # On-prem network
right=52.1.2.3 # AWS VPN endpoint 1
rightsubnet=10.0.0.0/16 # AWS VPC CIDR
ike=aes256-sha256-modp2048
esp=aes256-sha256
keyexchange=ikev2
ikelifetime=8h
lifetime=1h
dpdaction=restart
dpddelay=10s
conn aws-tunnel-2
# Similar config for second tunnel
right=52.4.5.6 # AWS VPN endpoint 2
BGP configuration for dynamic routing:
# Using quagga/FRR for BGP
router bgp 65001
neighbor 169.254.10.1 remote-as 64512 # AWS ASN
neighbor 169.254.11.1 remote-as 64512
network 192.168.0.0/16 # Advertise on-prem network
Key troubleshooting steps:
# Check IPsec status
sudo ipsec statusall
# Check BGP sessions
sudo vtysh -c "show ip bgp summary"
# Check routes learned from AWS
ip route show | grep 10.0
# Debug IPsec
sudo ipsec stroke loglevel ike 3
sudo tail -f /var/log/syslog | grep charon
Learning milestones:
- Both tunnels establish → You understand IPsec negotiation
- BGP routes are exchanged → You understand dynamic routing
- Traffic flows through VPN → You can verify connectivity
- Failover works when one tunnel dies → You understand redundancy
Project 8: AWS Network Firewall Deployment
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK, CloudFormation
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Network Security
- Software or Tool: AWS Network Firewall
- Main Book: “Network Security Monitoring” by Richard Bejtlich
What you’ll build: A network inspection architecture using AWS Network Firewall with domain filtering, IDS/IPS rules, and logging—positioned to inspect all egress traffic.
Why it teaches AWS networking: AWS Network Firewall provides VPC-perimeter inspection that Security Groups and NACLs can’t do. Understanding where to place it, how to route traffic through it, and how to write rules teaches deep network security.
Core challenges you’ll face:
- Firewall subnet placement (dedicated subnets, routing) → maps to inspection architecture
- Routing to the firewall (ingress routing, gateway route tables) → maps to traffic flow design
- Rule group design (stateful vs stateless, domain filtering) → maps to firewall policy
- Logging and alerting (what to log, where to send) → maps to security monitoring
- Understanding Suricata rules (IDS/IPS syntax) → maps to threat detection
Key Concepts:
- AWS Network Firewall: AWS Network Firewall Documentation
- Suricata Rules: Suricata Rule Writing Guide
- Inspection Architecture: Network Firewall Deployment Models
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC routing, basic firewall concepts
Real world outcome:
$ terraform apply
Resources Created:
Network Firewall: nfw-0abc123
Firewall Policy: policy-egress-inspection
Rule Groups:
- Stateless: Allow/Drop by IP/Port
- Stateful: Domain filtering, IDS/IPS
Architecture:
┌─────────────────────────────────────────────────────────┐
│ VPC │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Private │ │ Firewall │ │ Public │ │
│ │ Subnet │───▶│ Subnet │───▶│ Subnet │ │
│ │ (App) │ │ (NFW) │ │ (NAT) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ 10.0.1.0/24 10.0.100.0/24 10.0.0.0/24 │
│ │ │ │
│ ┌─────┴─────┐ ┌────┴────┐ │
│ │ Network │ │ NAT │ │
│ │ Firewall │ │ Gateway │ │
│ └───────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────┘
# Test domain filtering
$ ssh app-server
[app]$ curl https://example.com
HTTP/1.1 200 OK # Allowed
[app]$ curl https://malware-c2-server.evil
curl: (Connection refused) # BLOCKED by Network Firewall!
# Check firewall logs
$ aws logs filter-log-events --log-group-name /aws/network-firewall/alert
{
"event_type": "alert",
"alert": {
"action": "blocked",
"signature": "ET MALWARE Known C2 Domain",
"src_ip": "10.0.1.50",
"dest_ip": "185.143.x.x",
"dest_port": 443
}
}
# Firewall metrics
$ ./nfw-monitor
╔════════════════════════════════════════════════════════════╗
║ NETWORK FIREWALL METRICS (Last Hour) ║
╠════════════════════════════════════════════════════════════╣
║ ║
║ Traffic Processed: 2.3 GB ║
║ Packets Inspected: 4,567,890 ║
║ ║
║ Actions: ║
║ ✓ Passed: 4,560,123 (99.8%) ║
║ ✗ Dropped: 7,767 (0.2%) ║
║ ║
║ Top Blocked: ║
║ 1. Malware C2 domains: 234 connections ║
║ 2. Crypto mining pools: 156 connections ║
║ 3. Known bad IPs: 89 connections ║
║ ║
║ IDS/IPS Alerts: 47 ║
║ - SQL injection attempts: 12 ║
║ - SSH brute force: 35 ║
║ ║
╚════════════════════════════════════════════════════════════╝
Implementation Hints: The critical routing for egress inspection:
# Traffic flow: App → NFW → NAT → Internet
# Return flow: Internet → NAT → NFW → App
# Private subnet route table (where apps live)
resource "aws_route" "private_to_firewall" {
route_table_id = aws_route_table.private.id
destination_cidr_block = "0.0.0.0/0"
vpc_endpoint_id = aws_networkfirewall_firewall.main.firewall_status[0].sync_states[0].attachment[0].endpoint_id
}
# Firewall subnet route table
resource "aws_route" "firewall_to_nat" {
route_table_id = aws_route_table.firewall.id
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main.id
}
# NAT Gateway subnet needs INGRESS routing back through firewall
resource "aws_route" "nat_to_firewall" {
route_table_id = aws_route_table.public.id
destination_cidr_block = "10.0.1.0/24" # Private subnet
vpc_endpoint_id = aws_networkfirewall_firewall.main.firewall_status[0].sync_states[0].attachment[0].endpoint_id
}
Example firewall rules:
# Domain filtering (block malware C2)
resource "aws_networkfirewall_rule_group" "domain_block" {
capacity = 100
name = "block-malicious-domains"
type = "STATEFUL"
rule_group {
rule_variables {
ip_sets {
key = "HOME_NET"
ip_set { definition = ["10.0.0.0/16"] }
}
}
rules_source {
rules_source_list {
generated_rules_type = "DENYLIST"
target_types = ["TLS_SNI", "HTTP_HOST"]
targets = [".evil.com", ".malware.net", ".c2server.xyz"]
}
}
}
}
# IDS/IPS with Suricata rules
resource "aws_networkfirewall_rule_group" "ids" {
capacity = 500
name = "ids-rules"
type = "STATEFUL"
rule_group {
rules_source {
rules_string = <<EOF
alert tcp any any -> any 22 (msg:"SSH brute force attempt"; flow:to_server; threshold:type both,track by_src,count 5,seconds 60; sid:1000001; rev:1;)
drop tcp any any -> any any (msg:"SQL injection"; content:"UNION SELECT"; nocase; sid:1000002; rev:1;)
EOF
}
}
}
Learning milestones:
- Firewall inspects all egress traffic → You understand routing through NFW
- Domain blocking works → You understand stateful rules
- IDS alerts fire correctly → You understand Suricata
- Return traffic routes correctly → You understand symmetric routing
Project 9: Global Accelerator & CloudFront Edge Networking
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK, CloudFormation
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: CDN / Edge Networking
- Software or Tool: AWS Global Accelerator, CloudFront
- Main Book: “High Performance Browser Networking” by Ilya Grigorik
What you’ll build: A global application with CloudFront for static content, Global Accelerator for dynamic content, and latency-based routing—with performance benchmarks showing the improvement.
Why it teaches AWS networking: Understanding how traffic enters AWS at the edge, traverses the AWS backbone, and reaches your origin teaches you about Anycast, PoPs, and why edge networking matters for performance.
Core challenges you’ll face:
- CloudFront distribution setup (origins, behaviors, caching) → maps to CDN architecture
- Global Accelerator configuration (endpoints, health checks) → maps to Anycast networking
- Understanding when to use which (static vs dynamic, TCP vs HTTP) → maps to architectural decisions
- Measuring performance improvement (latency from different regions) → maps to performance testing
Key Concepts:
- CloudFront: AWS CloudFront Developer Guide
- Global Accelerator: AWS Global Accelerator Documentation
- Edge Networking: “High Performance Browser Networking” Chapter 1 - Ilya Grigorik
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic DNS, HTTP understanding
Real world outcome:
$ terraform apply
CloudFront Distribution: d123abc.cloudfront.net
Origin: alb-us-east-1.example.com
Behaviors:
/static/* → S3 origin (cached 1 year)
/api/* → ALB origin (no cache)
/* → ALB origin (cache 5 min)
Global Accelerator: a1b2c3d4.awsglobalaccelerator.com
Listener: TCP 443
Endpoint Groups:
- us-east-1: alb-east (weight 100)
- eu-west-1: alb-west (weight 100)
Health Checks: /health every 10s
# Performance comparison from different locations
$ ./edge-benchmark
╔════════════════════════════════════════════════════════════════╗
║ EDGE NETWORKING PERFORMANCE COMPARISON ║
╠════════════════════════════════════════════════════════════════╣
║ ║
║ TEST: HTTPS request to API endpoint ║
║ Origin: us-east-1 (N. Virginia) ║
║ ║
║ FROM NEW YORK (same region as origin) ║
║ ─────────────────────────────────────────────────────────────║
║ Direct to ALB: 45ms ║
║ Via CloudFront: 42ms (-7%) ║
║ Via Global Accel: 38ms (-16%) ║
║ ║
║ FROM LONDON (eu-west-1) ║
║ ─────────────────────────────────────────────────────────────║
║ Direct to ALB: 125ms (transatlantic internet) ║
║ Via CloudFront: 48ms (-62%) ← enters AWS at London PoP ║
║ Via Global Accel: 44ms (-65%) ← uses AWS backbone ║
║ ║
║ FROM TOKYO (ap-northeast-1) ║
║ ─────────────────────────────────────────────────────────────║
║ Direct to ALB: 210ms (Pacific + US internet) ║
║ Via CloudFront: 65ms (-69%) ← Tokyo PoP + AWS backbone ║
║ Via Global Accel: 58ms (-72%) ← Anycast to nearest PoP ║
║ ║
║ STATIC CONTENT (S3 via CloudFront) ║
║ ─────────────────────────────────────────────────────────────║
║ First request: 85ms (cache miss, fetch from S3) ║
║ Cached requests: 12ms (served from edge) ║
║ ║
╚════════════════════════════════════════════════════════════════╝
WHY THE IMPROVEMENT?
1. Traffic enters AWS at nearest edge location (400+ PoPs)
2. Traverses AWS private backbone (not public internet)
3. CloudFront caches responses at edge
4. Global Accelerator uses Anycast for optimal routing
5. TCP optimization (connection reuse, fast retransmits)
Implementation Hints: Key architectural understanding:
WITHOUT EDGE SERVICES:
User (Tokyo) → Internet → Origin (us-east-1)
- Multiple ISP hops, congested peering points
- TCP cold start on every connection
- ~200ms latency
WITH CLOUDFRONT:
User (Tokyo) → Tokyo PoP (5ms) → AWS Backbone → Origin
- Enters AWS at nearest Point of Presence
- Uses optimized AWS backbone
- Connection pooling to origin
- Caching at edge
- ~65ms latency
WITH GLOBAL ACCELERATOR:
User (Tokyo) → Anycast → Nearest PoP → AWS Backbone → Origin
- Anycast IP routes to geographically closest PoP
- Static IPs (good for whitelisting)
- TCP/UDP optimization
- Health-checked failover
- ~58ms latency
When to use which:
CLOUDFRONT:
- HTTP/HTTPS traffic
- Cacheable content
- Complex routing rules (path-based, header-based)
- Lambda@Edge for edge compute
GLOBAL ACCELERATOR:
- TCP/UDP traffic (not just HTTP)
- Non-cacheable dynamic content
- Need static Anycast IPs
- Multi-region active-active with health checks
- Gaming, VoIP, financial applications
Learning milestones:
- CloudFront serves static content from edge → You understand CDN caching
- API latency improves for distant users → You understand backbone routing
- Global Accelerator failover works → You understand health-checked routing
- You can explain when to use each → You can make architectural decisions
Project 10: VPC Lattice for Service-to-Service Networking
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Service Mesh / Application Networking
- Software or Tool: AWS VPC Lattice
- Main Book: “Building Microservices, 2nd Edition” by Sam Newman
What you’ll build: A service mesh using VPC Lattice that enables service-to-service communication across VPCs and accounts with built-in authentication, authorization, and observability—without managing sidecars or proxies.
Why it teaches AWS networking: VPC Lattice represents AWS’s vision for application-layer networking. It abstracts away traditional network complexity (VPC peering, Transit Gateway for service communication) and provides L7 features like path-based routing and IAM authentication.
Core challenges you’ll face:
- Service network creation (the namespace for services) → maps to service mesh concepts
- Service association (connecting Lambda, ECS, EC2 to the mesh) → maps to compute abstraction
- Auth policies (IAM-based service-to-service auth) → maps to zero-trust
- Cross-account service sharing → maps to multi-account patterns
- Traffic management (weighted routing, path-based) → maps to L7 load balancing
Key Concepts:
- VPC Lattice: AWS VPC Lattice Documentation
- Service Mesh Concepts: “Building Microservices, 2nd Edition” Chapter 4 - Sam Newman
- Zero-Trust Networking: NIST Zero Trust Architecture (SP 800-207)
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: VPC fundamentals, IAM, microservices basics
Real world outcome:
$ terraform apply
VPC Lattice Service Network: my-service-network
Associated VPCs: [vpc-frontend, vpc-backend, vpc-data]
Services:
- frontend-service (Lambda function)
URL: https://frontend-svc-abc123.vpc-lattice-svcs.us-east-1.on.aws
- orders-service (ECS Fargate)
URL: https://orders-svc-def456.vpc-lattice-svcs.us-east-1.on.aws
- inventory-service (EC2 instances)
URL: https://inventory-svc-ghi789.vpc-lattice-svcs.us-east-1.on.aws
Auth Policies:
- frontend-service → can call → orders-service
- orders-service → can call → inventory-service
- frontend-service → CANNOT call → inventory-service (least privilege!)
# Test service-to-service call
$ ssh frontend-host
[frontend]$ curl -H "Authorization: $IAM_SIG" \
https://orders-svc-def456.vpc-lattice-svcs.us-east-1.on.aws/orders
{"orders": [...]} # Allowed!
[frontend]$ curl -H "Authorization: $IAM_SIG" \
https://inventory-svc-ghi789.vpc-lattice-svcs.us-east-1.on.aws/inventory
403 Forbidden: frontend-service is not authorized to access inventory-service
# Traffic split for canary deployment
$ terraform apply -var="orders_v2_weight=10"
Traffic routing for orders-service:
v1 (current): 90%
v2 (canary): 10%
# Observe metrics
$ aws cloudwatch get-metric-statistics \
--namespace AWS/VPCLattice \
--metric-name RequestCount
orders-service metrics:
Total requests: 10,000
Target v1: 9,012 (90.1%)
Target v2: 988 (9.9%)
4xx errors v1: 45 (0.5%)
4xx errors v2: 12 (1.2%)
Implementation Hints: VPC Lattice architecture:
┌─────────────────────────────┐
│ Service Network │
│ "my-service-network" │
└─────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ frontend │ │ orders │ │ inventory │
│ service │─────▶│ service │─────▶│ service │
│ (Lambda) │ │ (ECS) │ │ (EC2) │
└───────────┘ └───────────┘ └───────────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ VPC A │ │ VPC B │ │ VPC C │
└─────────┘ └─────────┘ └─────────┘
Key insight: NO VPC peering or Transit Gateway needed!
VPC Lattice provides overlay networking between services.
IAM-based service auth:
resource "aws_vpclattice_auth_policy" "orders_policy" {
resource_identifier = aws_vpclattice_service.orders.arn
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = "*"
Action = "vpc-lattice-svcs:Invoke"
Resource = "*"
Condition = {
StringEquals = {
"vpc-lattice-svcs:SourceVpcOwnerAccount" = var.frontend_account_id
}
"ForAnyValue:StringLike" = {
"aws:PrincipalServiceName" = "lambda.amazonaws.com"
}
}
}
]
})
}
Why VPC Lattice matters:
BEFORE (with Transit Gateway + ALBs):
- Create TGW, attach VPCs
- Create ALB in each VPC
- Manage Security Groups between VPCs
- Build custom auth (mTLS, JWT, etc.)
- Set up monitoring per service
- Total: 50+ resources, complex routing
AFTER (with VPC Lattice):
- Create service network
- Register services
- Define auth policies (IAM-native!)
- Built-in metrics and access logs
- Total: ~10 resources, declarative
Learning milestones:
- Services communicate across VPCs → You understand Lattice overlay
- IAM policies restrict access → You understand zero-trust auth
- Traffic splitting works → You understand L7 capabilities
- Can monitor service-to-service calls → You understand observability
Project 11: PrivateLink Service Provider
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: AWS CDK, CloudFormation
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Private Networking / SaaS Architecture
- Software or Tool: AWS PrivateLink, Network Load Balancer
- Main Book: “AWS Certified Advanced Networking Study Guide”
What you’ll build: A PrivateLink service that exposes your application to other AWS accounts privately—the same way AWS exposes services like S3, DynamoDB. Customers connect via Interface Endpoints without any public internet exposure.
Why it teaches AWS networking: PrivateLink is how you build “AWS-native” SaaS. Understanding how to become a service provider (not just consumer) teaches you about NLBs, endpoint services, cross-account networking, and how AWS builds its own services.
Core challenges you’ll face:
- NLB as PrivateLink backend (why NLB, not ALB?) → maps to PrivateLink architecture
- Cross-account access (allowlisting consumer accounts) → maps to multi-tenant SaaS
- DNS for endpoints (private hosted zones, endpoint DNS) → maps to private DNS
- Connection acceptance (auto vs manual) → maps to security controls
- High availability (multi-AZ endpoints) → maps to reliability
Key Concepts:
- PrivateLink Services: AWS PrivateLink Documentation
- Endpoint Services: Creating Endpoint Services
- NLB Requirements: PrivateLink NLB Requirements
Difficulty: Expert Time estimate: 2 weeks Prerequisites: NLB, VPC Endpoints, cross-account IAM
Real world outcome:
$ terraform apply -target=module.provider
PROVIDER ACCOUNT (your SaaS):
VPC: vpc-provider
NLB: nlb-myservice (internal)
Listeners: 443 → target-group (your app)
Endpoint Service: vpce-svc-abc123
Service Name: com.amazonaws.vpce.us-east-1.vpce-svc-abc123
Allowed Principals: [arn:aws:iam::CUSTOMER_ACCOUNT:root]
$ terraform apply -target=module.consumer
CONSUMER ACCOUNT (customer):
VPC: vpc-consumer
Interface Endpoint: vpce-def456
Service: com.amazonaws.vpce.us-east-1.vpce-svc-abc123
Subnets: [subnet-a, subnet-b]
DNS: myservice.vpce.local → 10.1.1.100, 10.1.2.100
# From customer's VPC
$ ssh customer-server
[customer]$ nslookup myservice.vpce.local
10.1.1.100 (ENI in subnet-a)
10.1.2.100 (ENI in subnet-b)
[customer]$ curl https://myservice.vpce.local/api/health
{"status": "healthy", "provider": "MyService SaaS"}
# Traffic never touches internet!
[customer]$ traceroute myservice.vpce.local
1 10.1.1.100 (vpce ENI) 0.5ms # Direct to PrivateLink ENI
# No internet hops - traffic stays in AWS backbone
# Provider sees customer connection
$ aws ec2 describe-vpc-endpoint-connections
{
"VpcEndpointConnections": [{
"VpcEndpointId": "vpce-def456",
"VpcEndpointOwner": "CUSTOMER_ACCOUNT",
"VpcEndpointState": "available",
"CreationTimestamp": "2024-01-15T10:30:00Z"
}]
}
Implementation Hints: PrivateLink architecture (provider side):
PROVIDER VPC
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Your App │◀───│ Target │◀───│ Network Load │ │
│ │ (ECS/EC2) │ │ Group │ │ Balancer │ │
│ └─────────────┘ └─────────────┘ └────────┬────────┘ │
│ 10.0.1.x │ │
│ │ │
│ ┌───────────────────▼────────┐ │
│ │ VPC Endpoint Service │ │
│ │ vpce-svc-abc123 │ │
│ │ │ │
│ │ Allowed: CUSTOMER_ACCOUNT │ │
│ └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ AWS PrivateLink
│ (private connection)
│
CONSUMER VPC ▼
┌─────────────────────────────────────────────────────────────┐
│ ┌─────────────────┐ ┌────────────────────────────────┐ │
│ │ Interface │───▶│ Customer App │ │
│ │ Endpoint │ │ curl myservice.vpce.local │ │
│ │ vpce-def456 │ └────────────────────────────────┘ │
│ │ 10.1.1.100 │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Terraform resources:
# Provider side
resource "aws_vpc_endpoint_service" "myservice" {
acceptance_required = false # or true for manual approval
network_load_balancer_arns = [aws_lb.internal.arn]
allowed_principals = [
"arn:aws:iam::CUSTOMER_ACCOUNT:root"
]
}
# Consumer side
resource "aws_vpc_endpoint" "myservice" {
vpc_id = aws_vpc.consumer.id
service_name = "com.amazonaws.vpce.us-east-1.vpce-svc-abc123"
vpc_endpoint_type = "Interface"
subnet_ids = [aws_subnet.consumer_a.id, aws_subnet.consumer_b.id]
security_group_ids = [aws_security_group.endpoint.id]
private_dns_enabled = false # Use custom DNS instead
}
# Custom DNS in consumer VPC
resource "aws_route53_zone" "private" {
name = "vpce.local"
vpc {
vpc_id = aws_vpc.consumer.id
}
}
resource "aws_route53_record" "myservice" {
zone_id = aws_route53_zone.private.zone_id
name = "myservice.vpce.local"
type = "A"
alias {
name = aws_vpc_endpoint.myservice.dns_entry[0].dns_name
zone_id = aws_vpc_endpoint.myservice.dns_entry[0].hosted_zone_id
evaluate_target_health = true
}
}
Why NLB (not ALB)?
PrivateLink requires NLB because:
1. Layer 4 (TCP/UDP) - preserves client IP
2. Static IPs per AZ (stable endpoints)
3. Extremely high performance
4. Works with any TCP protocol (not just HTTP)
If you need HTTP features (path routing, headers):
NLB → ALB → App (NLB for PrivateLink, ALB for L7)
Learning milestones:
- Endpoint service is created → You understand the provider side
- Consumer connects via Interface Endpoint → You understand cross-account
- Traffic is private (no internet route) → You verify PrivateLink privacy
- Multiple consumers can connect → You understand multi-tenant SaaS
Project 12: Complete Multi-Region Network Architecture (Capstone)
- File: AWS_NETWORKING_DEEP_DIVE_PROJECTS.md
- Main Programming Language: Terraform
- Alternative Programming Languages: Pulumi, AWS CDK
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Enterprise Architecture / Global Networking
- Software or Tool: All AWS Networking Services
- Main Book: “AWS Certified Solutions Architect Professional Study Guide”
What you’ll build: A production-grade, multi-region, multi-account network with Transit Gateway peering, centralized egress, Network Firewall, hybrid connectivity, and complete observability—the kind of network that runs Fortune 500 companies.
Why it teaches AWS networking: This is the synthesis of everything. You’ll face real architectural decisions: where to place firewalls, how to handle cross-region traffic, when to use which connectivity option, and how to make it all observable and maintainable.
Core challenges you’ll face:
- Multi-region Transit Gateway peering → maps to global networking
- Centralized egress per region → maps to inspection architecture
- Cross-region data transfer optimization → maps to cost optimization
- Failover and disaster recovery → maps to reliability
- Unified monitoring and logging → maps to observability at scale
Key Concepts:
- AWS Multi-Region Architecture: AWS Global Infrastructure
- Transit Gateway Inter-Region Peering: TGW Peering Documentation
- Well-Architected Framework - Reliability Pillar: AWS Well-Architected
Difficulty: Master Time estimate: 1-2 months Prerequisites: All previous projects
Real world outcome:
╔══════════════════════════════════════════════════════════════════════════╗
║ ENTERPRISE MULTI-REGION NETWORK ║
╠══════════════════════════════════════════════════════════════════════════╣
US-EAST-1 EU-WEST-1
┌─────────────────────┐ ┌─────────────────────┐
│ │ │ │
ON-PREMISES │ ┌───────────────┐ │ │ ┌───────────────┐ │
┌──────────┐ │ │ Transit │ │ │ │ Transit │ │
│ Data │─VPN─┼──│ Gateway │◀─┼─PEER─┼─▶│ Gateway │ │
│ Center │ │ │ (us-east-1) │ │ │ │ (eu-west-1) │ │
└──────────┘ │ └───────┬───────┘ │ │ └───────┬───────┘ │
│ │ │ │ │ │
│ ┌───────┴───────┐ │ │ ┌───────┴───────┐ │
│ │ │ │ │ │ │ │
│ ▼ ▼ │ │ ▼ ▼ │
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Shared Svcs │ │ Production │ │ Shared Svcs │ │ Production │
│ VPC │ │ VPC │ │ VPC │ │ VPC │
│ ┌─────────┐ │ │ │ │ ┌─────────┐ │ │ │
│ │ NFW │ │ │ ┌─────────┐ │ │ │ NFW │ │ │ ┌─────────┐ │
│ │ NAT │ │ │ │ App │ │ │ │ NAT │ │ │ │ App │ │
│ │ Endpoint│ │ │ │ Cluster │ │ │ │ Endpoint│ │ │ │ Cluster │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
└─────────────────────┘ └─────────────────────┘
ACCOUNTS:
- Network Account: Transit Gateways, Shared Services VPCs, Firewalls
- Security Account: GuardDuty, Security Hub, Flow Logs aggregation
- Production Account: Application workloads
- Development Account: Dev/test workloads (isolated)
TRAFFIC FLOWS:
┌────────────────────────────────────────────────────────────────────────┐
│ App (Prod US) → TGW → Shared Services VPC → NFW → NAT → Internet │
│ App (Prod EU) → App (Prod US): TGW EU → TGW Peering → TGW US → App │
│ On-Prem → AWS: VPN → TGW US → Shared Services → Production VPC │
└────────────────────────────────────────────────────────────────────────┘
$ ./network-dashboard
╔════════════════════════════════════════════════════════════════════════╗
║ GLOBAL NETWORK HEALTH DASHBOARD ║
╠════════════════════════════════════════════════════════════════════════╣
║ ║
║ REGIONS ║
║ ─────────────────────────────────────────────────────────────────────║
║ us-east-1: ✓ Healthy ║
║ - Transit Gateway: 12 attachments, 4.2 Gbps throughput ║
║ - Network Firewall: 156K flows/min, 23 blocks ║
║ - VPN Tunnels: 2/2 UP (BGP routes: 15) ║
║ ║
║ eu-west-1: ✓ Healthy ║
║ - Transit Gateway: 8 attachments, 2.1 Gbps throughput ║
║ - Network Firewall: 89K flows/min, 12 blocks ║
║ - TGW Peering: UP (latency: 72ms to us-east-1) ║
║ ║
║ CROSS-REGION TRAFFIC (Last 24h) ║
║ ─────────────────────────────────────────────────────────────────────║
║ us-east-1 ↔ eu-west-1: 1.2 TB ║
║ Cost: $24.00 (at $0.02/GB) ║
║ ║
║ ALERTS ║
║ ─────────────────────────────────────────────────────────────────────║
║ ⚠️ VPN Tunnel 1 latency spike: 180ms → 95ms (recovered) ║
║ ⚠️ Network Firewall blocked crypto miner: 10.1.2.50 → pool.mining.com║
║ ║
║ COMPLIANCE ║
║ ─────────────────────────────────────────────────────────────────────║
║ ✓ All flow logs enabled and shipped to Security Account ║
║ ✓ No direct internet access from production VPCs ║
║ ✓ All cross-account traffic via Transit Gateway (auditable) ║
║ ✓ Network Firewall IDS rules: 2,456 signatures active ║
║ ║
╚════════════════════════════════════════════════════════════════════════╝
Implementation Hints: This project combines all previous projects. Key architectural decisions:
- Transit Gateway per region with peering:
resource "aws_ec2_transit_gateway_peering_attachment" "us_eu" { provider = aws.us_east_1 peer_region = "eu-west-1" peer_transit_gateway_id = aws_ec2_transit_gateway.eu.id transit_gateway_id = aws_ec2_transit_gateway.us.id } - Centralized egress with inspection: ``` Traffic: App VPC → TGW → Firewall VPC → NFW → NAT → Internet Return: Internet → NAT → NFW → TGW → App VPC
All egress inspected by Network Firewall All traffic logged via Flow Logs
3. **Route table segmentation for security**:
TGW Route Tables:
- “production”: Routes to shared services + prod VPCs, NO dev access
- “development”: Routes to shared services + dev VPCs, NO prod access
- “shared-services”: Routes to all VPCs (hub) ```
- Observability pipeline:
VPC Flow Logs → CloudWatch Logs → Kinesis Firehose → S3 (Security Account) ↓ Network Firewall Logs → CloudWatch Logs ──────────────────→ Athena/QuickSight ↓ TGW Flow Logs → CloudWatch Logs ───────────────────────────→ SIEM Integration
Learning milestones:
- Multi-region TGW peering works → You understand global connectivity
- All egress flows through central firewall → You understand inspection
- Prod/Dev isolation enforced → You understand segmentation
- Complete observability in Security Account → You understand enterprise monitoring
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Production VPC from Scratch | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 2. Security Group Debugger | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 3. VPC Flow Logs Analyzer | Advanced | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 4. VPC Peering vs TGW Lab | Advanced | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 5. NAT Gateway Cost Optimizer | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 6. Multi-Account Network | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 7. Site-to-Site VPN Lab | Advanced | 1-2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 8. Network Firewall Deployment | Advanced | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. Global Accelerator & CloudFront | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 10. VPC Lattice Service Mesh | Advanced | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 11. PrivateLink Service Provider | Expert | 2 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 12. Multi-Region Architecture | Master | 1-2 months | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
For Beginners (Start Here)
- Project 1: Production VPC - The foundation of everything
- Project 2: Security Group Debugger - Understand traffic flow
- Project 5: NAT Gateway Optimizer - Learn about VPC Endpoints
For Intermediate Cloud Engineers
- Project 3: VPC Flow Logs Analyzer - Visibility into your network
- Project 4: VPC Peering vs TGW Lab - Key architectural decision
- Project 9: Global Accelerator & CloudFront - Edge networking
For Advanced Engineers
- Project 7: Site-to-Site VPN - Hybrid connectivity
- Project 8: Network Firewall - Perimeter security
- Project 6: Multi-Account Network - Enterprise patterns
For Experts
- Project 10: VPC Lattice - Modern service networking
- Project 11: PrivateLink Provider - SaaS architecture
- Project 12: Multi-Region Architecture - Everything combined
Essential Resources
AWS Documentation
- AWS VPC User Guide
- Building Scalable and Secure Multi-VPC Infrastructure
- AWS Hybrid Connectivity Whitepaper
- VPC Security Best Practices
Architecture Guides
Security
Books
- “Computer Networks, Fifth Edition” by Tanenbaum - Networking fundamentals
- “The Practice of Network Security Monitoring” by Richard Bejtlich - Security analysis
- “High Performance Browser Networking” by Ilya Grigorik - Edge networking
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Build a Production-Ready VPC from Scratch | Terraform |
| 2 | Security Group Traffic Flow Debugger | Python |
| 3 | VPC Flow Logs Analyzer | Python |
| 4 | VPC Peering vs Transit Gateway Lab | Terraform |
| 5 | NAT Gateway Deep Dive & Cost Optimizer | Python |
| 6 | Multi-Account Network with AWS Organizations | Terraform |
| 7 | Site-to-Site VPN with Simulated On-Premises | Terraform |
| 8 | AWS Network Firewall Deployment | Terraform |
| 9 | Global Accelerator & CloudFront Edge Networking | Terraform |
| 10 | VPC Lattice for Service-to-Service Networking | Terraform |
| 11 | PrivateLink Service Provider | Terraform |
| 12 | Complete Multi-Region Network Architecture | Terraform |
Why This Path Works
By completing these projects, you won’t just “know” AWS networking—you’ll understand it deeply:
- You’ll understand isolation (VPCs, subnets, Security Groups)
- You’ll understand connectivity (Peering, Transit Gateway, VPN, Direct Connect)
- You’ll understand security (NACLs, Network Firewall, PrivateLink)
- You’ll understand scale (multi-account, multi-region)
- You’ll understand optimization (VPC Endpoints, cost analysis)
- You’ll understand observability (Flow Logs, metrics, debugging)
Most importantly, you’ll be able to design networks for real enterprises—the kind that handle millions of requests, maintain security compliance, and scale globally.
Build these projects, and you’ll go from “AWS user” to “AWS network architect.”