Project 2: Build a Load Balancer
Build a TCP/HTTP load balancer that routes traffic, detects failures, and exposes operational metrics.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Main Programming Language | Go (Alternatives: Rust, C, Python) |
| Alternative Programming Languages | Rust, C, Python |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | Service and Support Model |
| Prerequisites | TCP sockets, HTTP basics, concurrency |
| Key Topics | Load balancing algorithms, health checks, failure detection, backpressure |
1. Learning Objectives
By completing this project, you will:
- Implement a TCP and HTTP reverse proxy with routing logic.
- Compare round-robin, least-connections, and weighted algorithms.
- Design active and passive health checks with timeouts.
- Implement safe shared state for concurrent routing decisions.
- Expose operational metrics and observe system behavior under load.
- Explain L4 vs L7 load balancing trade-offs in interviews.
2. All Theory Needed (Per-Concept Breakdown)
2.1 L4 vs L7 Load Balancing
Description / Expanded Explanation
Layer 4 load balancing routes by TCP/UDP metadata (IP, port), while Layer 7 routes by application data (HTTP headers, paths). L4 is faster and simpler; L7 is more flexible and enables routing by URL or headers.
Definitions & Key Terms
- L4 -> transport layer routing using IP/port
- L7 -> application layer routing using request content
- reverse proxy -> a server that forwards client requests to backends
- connection vs request -> L4 routes connections, L7 routes requests
Mental Model Diagram (ASCII)
Client -> LB (L4) -> Backend
LB (L7) -> Backend based on HTTP path
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Client connects to load balancer.
- L4 picks backend immediately and forwards bytes.
- L7 reads HTTP request to select backend.
- Responses are proxied back to the client.
Minimal Concrete Example
GET /api/users HTTP/1.1
Host: example.com
Route by path prefix for L7, or by connection for L4.
Common Misconceptions
- “L7 is always better” -> L7 costs CPU and adds latency.
- “L4 cannot be smart” -> L4 can still use metrics like connection count.
Check-Your-Understanding Questions
- Why is L7 more expensive than L4?
- What data does L4 use to make a routing decision?
- When would L4 be preferred over L7?
Where You’ll Apply It
- See 3.2 and 3.4 for routing features.
- See 4.1 for architecture split between L4 and L7.
- Also used in: P08 TCP Socket Server
2.2 Load Balancing Algorithms
Description / Expanded Explanation
The algorithm decides which backend handles the next request. Simple algorithms are easy to implement but can mis-handle uneven load. Advanced algorithms need more state and metrics.
Definitions & Key Terms
- round-robin -> rotate through backends in order
- least-connections -> choose backend with fewest active connections
- weighted round-robin -> distribute traffic proportional to weights
- sticky sessions -> route same client to same backend
Mental Model Diagram (ASCII)
Backends: A B C
Round-robin: A B C A B C
Weighted: A A B C A A B C
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Maintain backend list with health and metrics.
- Select based on algorithm.
- Update connection counters and response metrics.
- If request fails, retry or mark backend unhealthy.
Minimal Concrete Example
idx := atomic.AddUint32(&rr, 1)
backend := backends[idx%uint32(len(backends))]
Common Misconceptions
- “Round-robin is fair” -> it ignores backend speed and connection length.
- “Sticky sessions are free” -> they reduce balancing effectiveness.
Check-Your-Understanding Questions
- Why can least-connections be better for long requests?
- How do weights affect distribution?
- What happens to sticky sessions when a backend dies?
Where You’ll Apply It
- See 3.2 for algorithm requirements.
- See 4.4 for routing logic.
- Also used in: P04 Raft Consensus for leader selection logic
2.3 Health Checks and Failure Detection
Description / Expanded Explanation
A load balancer must detect failed backends quickly without overreacting to temporary slowness. Health checks can be active (polling endpoints) or passive (observing failures).
Definitions & Key Terms
- active health check -> periodic probe request
- passive health check -> mark unhealthy after failed requests
- failure threshold -> number of consecutive failures before down
- recovery window -> time before retrying a failed backend
Mental Model Diagram (ASCII)
healthy -> fail -> fail -> fail -> DOWN
DOWN -> probe -> success -> HEALTHY
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Run checks every N seconds.
- On failure, increment counter.
- If failures exceed threshold, mark DOWN.
- Periodically probe DOWN backends for recovery.
Minimal Concrete Example
if failures >= 3 {
backend.healthy = false
}
Common Misconceptions
- “One failure means down” -> transient errors are normal.
- “Health checks should be heavy” -> use lightweight endpoints.
Check-Your-Understanding Questions
- Why use a failure threshold instead of one failed probe?
- What happens if health checks are too frequent?
- How do you avoid flapping backends?
Where You’ll Apply It
- See 3.2 and 3.6 for requirements and edge cases.
- See 5.10 Phase 2 for health check implementation.
- Also used in: P04 Raft Consensus
2.4 Connection Lifecycle and Backpressure
Description / Expanded Explanation
Load balancers must handle many concurrent connections without exhausting resources. Backpressure prevents the system from accepting more work than it can handle, and timeouts ensure slow backends do not block the entire proxy.
Definitions & Key Terms
- keep-alive -> reuse TCP connections for multiple requests
- backpressure -> signaling upstream to slow down
- timeout -> bound on waiting for backend response
- connection pool -> reuse outbound connections to backends
Mental Model Diagram (ASCII)
client -> [LB] -> backend
^ queue limits and timeouts
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Accept client connection.
- If queue is full, reject with 503.
- Proxy to backend with timeout.
- If backend is slow, cancel and retry or fail.
Minimal Concrete Example
ctx, cancel := context.WithTimeout(req.Context(), 2*time.Second)
Common Misconceptions
- “More concurrency always helps” -> too many open connections exhaust file descriptors.
- “Timeouts are optional” -> slow backends can stall the whole balancer.
Check-Your-Understanding Questions
- What happens when file descriptor limits are exceeded?
- Why is backpressure important in distributed systems?
- What is the trade-off between retries and latency?
Where You’ll Apply It
- See 3.3 and 3.6 for performance and edge cases.
- See 7.3 for performance traps.
- Also used in: P08 TCP Socket Server
2.5 Concurrency and Shared State
Description / Expanded Explanation
Routing decisions require shared state such as backend health and connection counts. In concurrent systems this state must be consistent and safe without becoming a bottleneck.
Definitions & Key Terms
- atomic -> CPU-level operation that is indivisible
- mutex -> lock to guard shared state
- race condition -> incorrect behavior caused by unsynchronized access
- read-write lock -> allows multiple readers, single writer
Mental Model Diagram (ASCII)
requests -> [router]
| state: backends, counters |
Mental Model Diagram (Image)

How It Works (Step-by-Step)
- Each request chooses a backend using shared state.
- Update connection counts atomically or via lock.
- Health checker updates backend status periodically.
- Protect state with locks or immutable snapshots.
Minimal Concrete Example
mu.Lock()
backend.active++
mu.Unlock()
Common Misconceptions
- “Go avoids concurrency bugs” -> data races still exist.
- “Lock everything” -> heavy locks can destroy throughput.
Check-Your-Understanding Questions
- Why is an atomic counter sufficient for round-robin index?
- When do you need a full mutex instead of atomics?
- How do you prevent races when reloading config?
Where You’ll Apply It
- See 4.2 and 4.4 for component responsibilities.
- See 5.10 Phase 2 for concurrency design.
- Also used in: P07 Lock-Free Queue
3. Project Specification
3.1 What You Will Build
A load balancer that accepts incoming TCP/HTTP connections and distributes them across multiple backend servers. It supports multiple algorithms, health checks, and exposes metrics. It is a standalone CLI service.
3.2 Functional Requirements
- Routing algorithms: round-robin, least-connections, weighted.
- Health checks: active and passive with configurable intervals.
- Sticky sessions: IP hash based routing.
- Metrics endpoint: JSON stats at
/metrics. - Config reload: hot reload without dropping existing connections.
- Graceful shutdown: stop accepting new connections, drain old ones.
3.3 Non-Functional Requirements
- Performance: add less than 5 ms overhead per request.
- Reliability: failover within 3 consecutive failed checks.
- Usability: clear logs and deterministic tests.
3.4 Example Usage / Output
$ ./loadbalancer --config lb.yaml --port 8080
[LB] listening on :8080
[LB] algorithm=round-robin backends=3
3.5 Data Formats / Schemas / Protocols
Config file lb.yaml:
listen: 8080
algorithm: round-robin
health_check:
interval_ms: 5000
timeout_ms: 800
failure_threshold: 3
backends:
- url: http://127.0.0.1:9001
weight: 2
- url: http://127.0.0.1:9002
weight: 1
Metrics JSON:
{
"total_requests": 1204,
"backends": [
{"url":"http://127.0.0.1:9001","healthy":true,"active":3},
{"url":"http://127.0.0.1:9002","healthy":true,"active":1}
]
}
Error JSON (unified shape):
{"error":"backend_unavailable","message":"no healthy backends"}
3.6 Edge Cases
- All backends unhealthy -> return 503 with error JSON.
- Backend slow -> timeout and retry once.
- Config reload while requests in flight.
- Client closes connection early.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
make
./backend --port 9001 --name A &
./backend --port 9002 --name B &
./loadbalancer --config lb.yaml --port 8080
3.7.2 Golden Path Demo (Deterministic)
The demo uses a fixed request order and deterministic backend responses.
3.7.3 CLI Transcript (Success)
$ ./loadbalancer --config lb.yaml --port 8080
[LB] algorithm=round-robin backends=2
$ curl http://localhost:8080/hello
Hello from A
$ curl http://localhost:8080/hello
Hello from B
$ curl http://localhost:8080/metrics
{"total_requests":2,"backends":[{"url":"http://127.0.0.1:9001","healthy":true,"active":0},{"url":"http://127.0.0.1:9002","healthy":true,"active":0}]}
$ echo $?
0
3.7.3 CLI Transcript (Failure)
$ pkill backend
$ curl http://localhost:8080/hello
{"error":"backend_unavailable","message":"no healthy backends"}
$ echo $?
22
3.7.4 If API
Endpoints:
GET /metrics-> 200 JSON statsPOST /admin/reload-> reload config
Example success:
POST /admin/reload HTTP/1.1
Host: localhost:8080
HTTP/1.1 200 OK
{"status":"reloaded","backends":2}
Example error:
POST /admin/reload HTTP/1.1
Host: localhost:8080
HTTP/1.1 400 Bad Request
{"error":"config_invalid","message":"missing backends"}
3.7.5 Exit Codes
0success1invalid config2port bind failure
4. Solution Architecture
4.1 High-Level Design
Client -> Listener -> Router -> Backend Pool -> Backend
| Health Checker |
| Metrics |
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Listener | Accept client connections | TCP or HTTP listener | | Router | Select backend | algorithm strategy interface | | Backend Pool | Track health and metrics | shared state with locks | | Health Checker | Active probes | configurable interval | | Metrics | JSON endpoint | expose internal counters |
4.3 Data Structures (No Full Code)
type Backend struct {
URL string
Weight int
Healthy bool
Active int64
}
4.4 Algorithm Overview
Routing
- Snapshot healthy backends.
- Apply algorithm to choose backend.
- Proxy request and update metrics.
Complexity Analysis
- Time: O(n) for least-connections, O(1) for round-robin
- Space: O(n) backend state
5. Implementation Guide
5.1 Development Environment Setup
make
5.2 Project Structure
project-root/
├── cmd/loadbalancer/
├── internal/router/
├── internal/health/
├── internal/metrics/
└── tests/
5.3 The Core Question You’re Answering
“Given multiple servers, how do I route traffic fairly and safely when failures happen?”
5.4 Concepts You Must Understand First
- TCP connection lifecycle
- HTTP parsing and proxying
- Concurrency and shared state
- Health check thresholds
5.5 Questions to Guide Your Design
- How will you protect the backend list during reloads?
- What is the retry policy on failure?
- How do you avoid slow backend poisoning overall latency?
5.6 Thinking Exercise
Simulate 3 backends with weights 3,2,1. Write the first 12 backend selections.
5.7 The Interview Questions They’ll Ask
- Explain the difference between L4 and L7.
- How do you implement sticky sessions and what are trade-offs?
- How do you prevent cascading failures?
5.8 Hints in Layers
Hint 1: Start with a single-backend TCP proxy. Hint 2: Add round-robin with an atomic counter. Hint 3: Add health checks before implementing weighted algorithms.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | System design | Designing Data-Intensive Applications | Ch. 1 | | Networking | The Linux Programming Interface | Ch. 59-61 | | Reliability | Building Microservices | Ch. 11 |
5.10 Implementation Phases
Phase 1: Foundation (3-4 days)
Goals: TCP proxy and round-robin routing. Checkpoint: distribute requests across two backends.
Phase 2: Core Functionality (7-10 days)
Goals: health checks and metrics. Checkpoint: backend failure is detected and removed.
Phase 3: Polish and Edge Cases (4-6 days)
Goals: hot reload, sticky sessions, timeouts. Checkpoint: reload config without dropping connections.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Proxy style | full L7 vs L4 only | L7 | more interview relevant | | Health checks | active only, active+passive | active+passive | faster failure detection | | State protection | locks vs channels | locks | simplicity and clarity |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | algorithm correctness | round-robin distribution | | Integration Tests | backend failover | kill backend, verify reroute | | Load Tests | throughput | 1000 rps with hey/ab |
6.2 Critical Test Cases
- Round-robin distributes evenly across healthy backends.
- Least-connections routes to backend with smallest active count.
- Unhealthy backend is excluded within 3 checks.
6.3 Test Data
requests: 1000
concurrency: 50
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | Race on index | uneven distribution | atomic counter | | Too aggressive health checks | false downs | increase timeout | | No timeouts | hanging requests | set per-request timeout |
7.2 Debugging Strategies
- Enable verbose logs for routing decisions.
- Use tcpdump to verify L7 parsing.
- Run with race detector for concurrency bugs.
7.3 Performance Traps
- Excessive locking in hot path.
- Re-parsing HTTP headers multiple times.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add least-connections algorithm.
- Add basic metrics endpoint.
8.2 Intermediate Extensions
- Implement circuit breaker per backend.
- Support TLS termination.
8.3 Advanced Extensions
- Adaptive routing based on latency.
- Consistent hashing for cache affinity.
9. Real-World Connections
9.1 Industry Applications
- Edge proxies in CDNs.
- Service meshes and API gateways.
9.2 Related Open Source Projects
- HAProxy
- Nginx
- Envoy
9.3 Interview Relevance
- Core system design question across companies.
- Shows understanding of failure modes and trade-offs.
10. Resources
10.1 Essential Reading
- Designing Data-Intensive Applications (Chapter 1)
- Building Microservices (Chapter 11)
10.2 Video Resources
- Talks on load balancing and reliability
10.3 Tools & Documentation
hey,ab,tcpdump,pprof
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain L4 vs L7 load balancing.
- I can describe health check failure thresholds.
- I can justify a routing algorithm choice.
11.2 Implementation
- All routing algorithms work as specified.
- Health checks remove unhealthy backends.
- Metrics endpoint reports correct counts.
11.3 Growth
- I can diagram the architecture in an interview.
- I can explain how I would scale this further.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Accept connections and route to backends.
- Round-robin routing works.
- Health checks mark dead backends as down.
Full Completion:
- Supports weighted and least-connections.
- Exposes metrics endpoint.
- Supports hot reload.
Excellence (Going Above & Beyond):
- Adds circuit breaker and adaptive routing.
- Supports TLS termination and observability hooks.