Project 2: HTTP Connection Pool with Failure Injection

Build a reusable HTTP connection pool that survives server restarts, half-open sockets, and injected failures without leaking file descriptors.

Quick Reference

Attribute Value
Difficulty Intermediate-Advanced
Time Estimate 2-3 weeks
Main Programming Language C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level Level 4 - Production-grade networking
Business Potential Level 4 - Infrastructure library
Prerequisites TCP basics, sockets, non-blocking I/O, errno
Key Topics TCP states, timeouts, pooling, failure injection, backoff

1. Learning Objectives

By completing this project, you will:

  1. Build a connection pool that manages socket lifecycle safely.
  2. Implement non-blocking connect validation using SO_ERROR.
  3. Detect and evict half-open connections with health checks and timeouts.
  4. Apply backoff and circuit breaker patterns to prevent retry storms.
  5. Instrument and observe latency and failure rates.

2. All Theory Needed (Per-Concept Breakdown)

2.1 TCP Connection State Machine and Failure Modes

Fundamentals

TCP connections transition through well-defined states: LISTEN, SYN-SENT, ESTABLISHED, FIN-WAIT, TIME-WAIT, and more. These states exist to ensure reliable delivery and to handle ordering and retransmission. As a client, you typically see SYN-SENT to ESTABLISHED on connect, then FIN-WAIT on close. Failures such as connection resets, timeouts, and half-open connections appear when one side of the connection dies without cleanly closing. A half-open connection occurs when one side believes the connection is still alive but the other side has crashed or is unreachable. This is critical for connection pools because a pool can silently hold dead sockets unless it validates them.

Deep Dive into the concept

TCP is a stateful protocol implemented in the kernel. When a client initiates a connection, it sends a SYN and enters SYN-SENT. If the server replies with SYN-ACK, the client sends ACK and enters ESTABLISHED. If the server is down, the SYN may be retransmitted for a period, and the connection attempt eventually fails with ETIMEDOUT. When a connection is closed gracefully, each side transitions through FIN-WAIT or CLOSE-WAIT states. However, when a process crashes or a network path drops, the kernel does not immediately know; the connection can remain in ESTABLISHED state for minutes while data cannot be delivered. The client only learns the connection is dead when it writes and receives an error (EPIPE, ECONNRESET) or when a read times out. This is the half-open case.

In a connection pool, half-open sockets are dangerous because the pool assumes idle connections are healthy. If the server restarts, all idle connections become invalid. The next request might be sent on a dead connection and fail. A robust pool must detect this and evict dead sockets quickly. Strategies include:

  • Active health checks (periodic lightweight requests).
  • TCP keepalive (SO_KEEPALIVE), though default intervals are often too large.
  • Read and write timeouts to detect hung connections.
  • Validation on checkout: test the socket before reuse.

TIME-WAIT is another important state. When the client closes a connection, it often enters TIME-WAIT, holding the local port for 2*MSL. If your pool creates and destroys many connections rapidly, you can exhaust ephemeral ports. This is why pooling is important: it reduces connection churn. But pooling also increases the chance of half-open sockets if you do not validate. The correct design balances reuse with validation and eviction.

Understanding how TCP state interacts with OS limits matters. connect() may succeed at the TCP level but the application might fail due to local resource limits (e.g., EMFILE). Or connect() might be in progress when you mark the socket as pooled; if you do not validate with SO_ERROR, you can mark a socket as usable when it is not. Another subtlety is that non-blocking sockets can report writable even if the connection failed. Only SO_ERROR tells the truth.

How this fits in projects

This concept defines how you detect dead sockets and why you need timeouts, health checks, and eviction policies. It is also relevant to Project 6 where deployment operations rely on a resilient connection layer.

Definitions & key terms

  • ESTABLISHED: TCP state indicating an active connection.
  • TIME-WAIT: State after close to ensure late packets are handled safely.
  • Half-open: One side thinks the connection is alive, the other is dead.
  • ECONNRESET: Connection reset by peer.

Mental model diagram (ASCII)

SYN-SENT -> ESTABLISHED -> FIN-WAIT -> TIME-WAIT -> CLOSED
                |
                +--> crash -> half-open (no FIN)

How it works (step-by-step, with invariants and failure modes)

  1. Client sends SYN, waits for SYN-ACK.
  2. Connection becomes ESTABLISHED.
  3. Server crashes: no FIN sent.
  4. Client writes: kernel returns EPIPE or timeout.
  5. Pool must mark connection dead and evict.

Minimal concrete example

int fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
connect(fd, (struct sockaddr*)&addr, sizeof(addr));
// Later: poll for writability, then check SO_ERROR

Common misconceptions

  • “If connect returned 0, the connection is good forever.” False; it can die later.
  • “TCP keepalive always solves half-open.” Defaults are too slow for most pools.
  • “TIME-WAIT is a server-only problem.” Clients also enter TIME-WAIT.

Check-your-understanding questions

  1. Why can a connection be half-open even if both processes think it is alive?
  2. What error signals a reset, and when do you see it?
  3. Why does pooling reduce TIME-WAIT exhaustion?

Check-your-understanding answers

  1. Because failures can occur without a clean FIN, so one side is unaware.
  2. ECONNRESET or EPIPE when writing to a dead socket.
  3. Pooling reuses sockets, reducing frequent close/open cycles.

Real-world applications

  • HTTP clients and API gateways.
  • Database connection pools.
  • RPC frameworks that reuse TCP connections.

Where you will apply it

  • Project 2: See §3.2 Functional Requirements and §5.10 Phase 2.
  • Project 6: Reuse connection pool for deployment operations.
  • Also used in: P06 Deployment Pipeline Tool.

References

  • “TCP/IP Illustrated Vol. 1” (Stevens/Fall), TCP state chapters.
  • man 7 tcp for Linux-specific behavior.

Key insights

Connection pools must model TCP failure states, not just happy paths.

Summary

TCP state drives connection validity. Without explicit validation, a pool will reuse dead sockets and fail unpredictably.

Homework/exercises to practice the concept

  1. Start a server, open a connection, then kill the server and observe behavior on write.
  2. Observe TIME-WAIT using ss -tan after many short-lived connections.
  3. Enable SO_KEEPALIVE and compare default intervals.

Solutions to the homework/exercises

  1. You will see EPIPE or ECONNRESET on write, often after a timeout.
  2. Many short-lived connections leave sockets in TIME-WAIT, showing why pooling helps.
  3. Default keepalive intervals are typically hours, too slow for pools.

2.2 Non-Blocking Connect and SO_ERROR Validation

Fundamentals

Non-blocking sockets allow you to initiate a connection without blocking the thread. connect() returns immediately, often with EINPROGRESS. You must then wait for the socket to become writable using poll() or select(). However, writability does not mean the connection succeeded. The only reliable way to know if the connection succeeded is to call getsockopt() with SO_ERROR. If the error is 0, the connection is established. If it is non-zero, the connection failed. This pattern is essential for connection pools to avoid marking failed sockets as healthy.

Deep Dive into the concept

When you call connect() on a non-blocking socket, the kernel begins the TCP handshake asynchronously. The socket becomes writable when the handshake either completes or fails. This is why writable readiness alone is ambiguous. getsockopt(SO_ERROR) returns the pending error on the socket and then clears it. A value of 0 indicates success; any other value is the errno for the failure (ECONNREFUSED, ETIMEDOUT, ENETUNREACH).

Connection pools must integrate this pattern into their lifecycle. A pool should consider a connection “pending” until SO_ERROR is 0. Only after that can it be placed in the idle pool. If you skip this, you can create a pool full of sockets that never finished connecting. This is a common failure pattern that only appears under load, when connection attempts are frequent and timeouts overlap.

Timeout handling is also important. poll() with a timeout gives you a maximum connect time. If it times out, you must close the socket and record the failure. You should also implement a backoff strategy, because repeated connect timeouts can cause a thundering herd. The pool should expose metrics for connect failures and durations so you can see when a target is unhealthy.

Another subtle point is the use of connect() on already connected sockets. If a socket is reused incorrectly or not fully reset, a second connect call can return EISCONN or other errors. A pool must ensure that a socket transitions to a valid idle state and is cleaned before reuse. This includes clearing pending errors, draining unread data, and resetting per-connection timeouts.

How this fits in projects

This concept is core to Project 2’s connection creation and validation. It also informs the deployment pipeline in Project 6 when establishing remote connections.

Definitions & key terms

  • EINPROGRESS: Connection in progress for non-blocking sockets.
  • SO_ERROR: Socket option that stores the pending error state.
  • Poll for writability: Detect when connect finished.

Mental model diagram (ASCII)

connect() -> EINPROGRESS
     |
     v
poll(writable) -> getsockopt(SO_ERROR)
     |
  0 success / errno failure

How it works (step-by-step, with invariants and failure modes)

  1. Create non-blocking socket.
  2. Call connect; expect EINPROGRESS.
  3. Poll for writability with a deadline.
  4. Call getsockopt(SO_ERROR) and check result.
  5. If success, mark connection healthy; else close and retry.

Minimal concrete example

int err = 0; socklen_t len = sizeof(err);
getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &len);
if (err != 0) { /* connect failed */ }

Common misconceptions

  • “Writable means connected”: false without SO_ERROR check.
  • “EINPROGRESS means failure”: false; it is normal for non-blocking connect.
  • “Timeout means server down”: could be network path issues or firewalls.

Check-your-understanding questions

  1. Why does writable readiness not guarantee success?
  2. How long should you wait for connect before failing?
  3. What do you do with a socket that timed out?

Check-your-understanding answers

  1. Writability indicates the handshake completed or failed; SO_ERROR distinguishes.
  2. Use a configurable deadline based on expected network latency.
  3. Close and record failure; do not return it to the pool.

Real-world applications

  • Non-blocking HTTP clients.
  • Event-driven proxies.
  • Load balancer health checks.

Where you will apply it

References

  • man 2 connect, man 2 getsockopt.
  • TCP/IP Sockets in C, Chapters 2-4.

Key insights

Non-blocking connect is only complete after SO_ERROR returns 0.

Summary

Connection pools must validate non-blocking connects with SO_ERROR to avoid false-positive connections.

Homework/exercises to practice the concept

  1. Write a small program that connects to a closed port and prints SO_ERROR.
  2. Compare blocking vs non-blocking connect timings under packet loss.
  3. Inject artificial timeouts with tc and observe behavior.

Solutions to the homework/exercises

  1. SO_ERROR should return ECONNREFUSED.
  2. Blocking connect will wait; non-blocking uses poll and a custom deadline.
  3. Timeouts should produce ETIMEDOUT and trigger retries.

2.3 Timeouts, Backoff, and Circuit Breakers

Fundamentals

Timeouts prevent a single request or connection from blocking forever. Backoff prevents rapid retries from overwhelming a failing system. A circuit breaker stops attempts temporarily when failure rates are high. In a connection pool, you need timeouts at multiple layers: connect timeout, read timeout, and idle timeout. You also need a backoff strategy for reconnect attempts and a circuit breaker that opens when failures exceed a threshold. These patterns are critical in production environments where network failures are common.

Deep Dive into the concept

A robust pool has layered timeouts. The connect timeout bounds how long you wait for the TCP handshake. The read timeout bounds how long a request waits for a response. The idle timeout removes connections that have been unused for too long. Each timeout needs a specific failure response: connect timeout triggers a new connect attempt after backoff; read timeout triggers request retry (if safe) and socket eviction; idle timeout triggers a health check or closure.

Backoff should be exponential with jitter to prevent synchronized retry storms. A typical formula is: base * 2^attempt + random(0, jitter), capped at a maximum. Without jitter, many clients will retry at the same intervals, worsening an outage. The pool should track failures per host and per connection to avoid a single failing endpoint poisoning the entire pool.

Circuit breakers track error rates within a sliding window. When the error rate exceeds a threshold, the breaker opens and all new requests fail fast for a cooldown period. After the cooldown, the breaker allows a limited number of trial requests (half-open state). If these succeed, the breaker closes; if they fail, it reopens. In a pool, the breaker can be global (per target) or per connection. A global breaker is easier to reason about and helps protect the system when the target is down.

Failure injection is a testing strategy: deliberately introduce delays, resets, or partial failures to verify your recovery logic. You can implement this by randomly delaying requests, forcing a connect to a blackholed port, or injecting EPIPE errors by closing sockets under load. This makes the pool resilient because it exposes failures before production does.

How this fits in projects

This concept defines how your pool recovers and how you test it. It also appears in Project 6 where deployments must fail fast and retry safely.

Definitions & key terms

  • Timeout: Maximum duration for an operation.
  • Backoff: Increasing delay between retries.
  • Circuit breaker: State machine that stops requests during high failure rates.
  • Half-open: Breaker state that allows trial requests.

Mental model diagram (ASCII)

Closed -> failures exceed threshold -> Open
Open -> cooldown -> Half-open -> success -> Closed
                         | failure -> Open

How it works (step-by-step, with invariants and failure modes)

  1. Each connect attempt has a deadline.
  2. On failure, record error and schedule retry with backoff.
  3. Track error rate in a window.
  4. If threshold exceeded, open circuit and fail fast.
  5. After cooldown, allow limited trial requests.
  6. Failure mode: no backoff -> retry storm.

Minimal concrete example

int backoff_ms = min(max_ms, base_ms << attempts);
backoff_ms += rand() % jitter_ms;

Common misconceptions

  • “Timeouts slow things down”: they prevent indefinite stalls.
  • “Retries always help”: retries can worsen outages without backoff.
  • “Circuit breakers are optional”: they protect systems under failure.

Check-your-understanding questions

  1. Why add jitter to backoff?
  2. What should happen when the circuit opens?
  3. How do you decide if a request is safe to retry?

Check-your-understanding answers

  1. To avoid synchronized retries.
  2. Fail fast and wait for cooldown before trial requests.
  3. Only retry idempotent requests (e.g., GET) or explicit retryable operations.

Real-world applications

  • API clients in microservices.
  • Database connection pools.
  • Load balancers and gateways.

Where you will apply it

  • Project 2: See §3.2 Functional Requirements and §5.10 Phase 3.
  • Project 6: Deployment retries and health checks.
  • Also used in: P06 Deployment Pipeline Tool.

References

  • “Release It!” (Nygard), circuit breaker patterns.
  • “Site Reliability Engineering” (Google), retry and backoff guidance.

Key insights

Retries without backoff are a failure amplifier.

Summary

Timeouts and backoff prevent a pool from hanging or causing a retry storm. Circuit breakers provide a safety valve when a target is unhealthy.

Homework/exercises to practice the concept

  1. Implement exponential backoff with jitter and plot the sequence.
  2. Simulate a 50% failure rate and observe when the breaker opens.
  3. Inject timeouts and verify that idle sockets are evicted.

Solutions to the homework/exercises

  1. The delay should double each attempt but include random jitter.
  2. The breaker should open once the threshold is exceeded and close after successful trials.
  3. Idle sockets should be closed after the idle timeout or validated.

2.4 Connection Pool Design and Health Checks

Fundamentals

A connection pool maintains a set of reusable connections to reduce latency and resource usage. The pool manages the lifecycle of connections: creation, validation, checkout, checkin, and eviction. Health checks ensure that idle connections remain valid. Without health checks, a pool will return dead connections after server restarts. The pool must also enforce a maximum size and handle contention when all connections are busy.

Deep Dive into the concept

A pool typically has two main data structures: an idle list and an in-use list. When a client requests a connection, the pool checks the idle list first. If none are available and the pool is under its maximum size, it creates a new connection. If the pool is at capacity, the client can either wait or fail fast. This decision is critical for latency and throughput. For a systems integration project, you should implement a configurable strategy: block with timeout or return an error immediately.

Health checks can be passive or active. Passive checks occur when a request fails; the connection is evicted. Active checks periodically test idle connections with a lightweight request (e.g., HTTP HEAD) or with a TCP keepalive probe. Active checks consume resources but prevent stale connections from lingering.

Pools must also enforce idle timeouts. Long-lived idle connections can be closed by intermediaries (load balancers) without the client knowing. Closing idle connections proactively reduces half-open issues. Additionally, you should track per-connection statistics such as last-used time, number of failures, and latency. These metrics guide eviction decisions and allow you to detect unhealthy connections.

Failure injection is used to validate pool behavior. You can implement a mode where random requests are delayed or connections are forcibly closed. This ensures your pool handles unexpected failures gracefully. The pool should also expose metrics, such as total connections, active connections, failures by errno, and latency percentiles. These metrics are essential in production to diagnose issues.

How this fits in projects

This concept defines the core pool behavior in Project 2 and will be reused in Project 6 to manage remote operations.

Definitions & key terms

  • Checkout: Borrowing a connection from the pool.
  • Checkin: Returning a connection to the pool.
  • Idle timeout: Maximum time a connection can remain unused.
  • Health check: Validation to ensure a connection is alive.

Mental model diagram (ASCII)

[Pool]
 idle list <-> in-use list
    |              |
  validate       return

How it works (step-by-step, with invariants and failure modes)

  1. Client requests connection.
  2. If idle exists, validate and checkout.
  3. If none, create new if under max size.
  4. If at max size, wait or fail based on policy.
  5. On return, place in idle list and update metrics.
  6. Failure mode: pool returns dead connection without validation.

Minimal concrete example

struct conn { int fd; int64_t last_used_ms; int failures; };

Common misconceptions

  • “Pooling hides network failures”: it can expose them more quickly if you validate.
  • “Idle connections are safe”: they can be silently closed by peers.
  • “Maximum size is enough”: you also need idle limits and timeouts.

Check-your-understanding questions

  1. Why validate connections on checkout?
  2. What happens if all connections are busy?
  3. Why track last-used time?

Check-your-understanding answers

  1. To avoid returning dead sockets.
  2. You either block with a timeout or fail fast; choose explicitly.
  3. To enforce idle timeouts and decide eviction.

Real-world applications

  • HTTP clients and service meshes.
  • Database pools (Postgres, MySQL).
  • Message brokers and RPC frameworks.

Where you will apply it

References

  • Connection pool design patterns in high-performance systems.

Key insights

A pool is a lifecycle manager, not just a connection cache.

Summary

Pooling reduces latency and resource usage, but only if connections are validated and evicted correctly.

Homework/exercises to practice the concept

  1. Implement a pool with max size and idle timeout.
  2. Simulate server restarts and verify validation logic.
  3. Add metrics for active/idle connections and failures.

Solutions to the homework/exercises

  1. The pool should cap total connections and evict idle ones.
  2. Validation should detect dead sockets and replace them.
  3. Metrics should show connection usage and failure patterns.

3. Project Specification

3.1 What You Will Build

A reusable HTTP connection pool library with a CLI test harness. The pool should support non-blocking connects, health checks, backoff, and failure injection. It should survive server restarts without manual intervention and provide metrics for observability.

3.2 Functional Requirements

  1. Pool lifecycle: Create, checkout, and return connections with max size.
  2. Validation: Validate connections on checkout with a health check.
  3. Timeouts: Implement connect, read, and idle timeouts.
  4. Failure injection: Add a mode that simulates timeouts and resets.
  5. Metrics: Expose latency and failure counters.

3.3 Non-Functional Requirements

  • Performance: Handle 1,000 requests/min with stable latency.
  • Reliability: No FD leaks; detect dead connections quickly.
  • Usability: Clear CLI output for pool status.

3.4 Example Usage / Output

$ ./pooler --host 127.0.0.1 --port 8080 --size 10 --inject-failures
[pool] created 10 connections
[pool] request latency p50=3ms p95=12ms
[pool] failures: ECONNRESET=3 ETIMEDOUT=2
[pool] circuit breaker OPEN for 5s
[pool] recovered, pool size=10

3.5 Data Formats / Schemas / Protocols

Config (JSON):

{
  "host": "127.0.0.1",
  "port": 8080,
  "pool_size": 10,
  "connect_timeout_ms": 500,
  "read_timeout_ms": 2000,
  "idle_timeout_ms": 10000,
  "inject_failures": true
}

Metrics output (text):

connections.active=5
connections.idle=5
latency.p50_ms=3
errors.ECONNRESET=3

3.6 Edge Cases

  • Server restart while idle connections exist.
  • Connect timeout under packet loss.
  • Pool exhaustion when all connections are in use.
  • Circuit breaker open state with queued requests.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

make
./pooler --host 127.0.0.1 --port 8080 --size 5 --inject-failures

3.7.2 Golden Path Demo (Deterministic)

  • Use a local HTTP server and fixed failure injection seed.
  • The pool reports identical latency and failure counts on each run.

3.7.3 If CLI: exact terminal transcript

$ ./pooler --host 127.0.0.1 --port 8080 --size 2 --seed 42
[pool] created 2 connections
[pool] request latency p50=2ms p95=4ms
[pool] failures: ECONNRESET=0 ETIMEDOUT=0

Failure demo (server down):

$ ./pooler --host 127.0.0.1 --port 9999
[pool] connect timeout after 500ms
[pool] circuit breaker OPEN for 3s

Exit codes:

  • 0 on success.
  • 2 on invalid args/config.
  • 4 on unrecoverable network failures.

4. Solution Architecture

4.1 High-Level Design

+--------------------+
| Pool Manager       |
+---+-----------+----+
    |           |
    v           v
Idle List    In-Use List
    |           |
    +-- Health --+
         Checks

4.2 Key Components

Component Responsibility Key Decisions
Connection Socket state, last-used, failure count Non-blocking by default
Pool Manager Checkout/checkin, sizing, idle eviction Max size + wait timeout
Health Checker Validate connections HTTP HEAD or TCP ping
Metrics Latency and error tracking Text output for simplicity
Failure Injector Inject delays and resets Seeded random for determinism

4.3 Data Structures (No Full Code)

struct conn {
    int fd;
    int healthy;
    int64_t last_used_ms;
    int failures;
};

struct pool {
    struct conn *idle[MAX];
    struct conn *in_use[MAX];
    int max_size;
};

4.4 Algorithm Overview

Key Algorithm: Checkout/validation

  1. If idle exists, validate with health check.
  2. If healthy, return.
  3. If not healthy, close and replace.
  4. If none and under max, create new connection.
  5. If at max, wait or fail based on policy.

Complexity Analysis:

  • Time: O(1) for list operations; O(1) per validation.
  • Space: O(n) for pool size.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y gcc make

5.2 Project Structure

http-pool/
├── src/
│   ├── main.c
│   ├── pool.c
│   ├── conn.c
│   ├── metrics.c
│   └── failinject.c
├── include/
│   ├── pool.h
│   └── conn.h
├── tests/
│   ├── test_connect.sh
│   └── test_restart.sh
└── Makefile

5.3 The Core Question You’re Answering

“How do I keep connections reliable when the network is not?”

5.4 Concepts You Must Understand First

  1. TCP state machine and half-open behavior.
  2. Non-blocking connect and SO_ERROR validation.
  3. Timeouts and circuit breakers.
  4. Pool lifecycle and health checks.

5.5 Questions to Guide Your Design

  1. When should a connection be considered unhealthy?
  2. How long should you wait for connect and read?
  3. What is your retry and backoff policy?
  4. How will you expose metrics?

5.6 Thinking Exercise

Half-Open Simulation

Client sends request, server crashes after ACK.
What should the pool do after 5s? 30s?

5.7 The Interview Questions They’ll Ask

  1. Why does writable readiness not guarantee connect success?
  2. What is TIME-WAIT and how does pooling affect it?
  3. How do you detect half-open connections?

5.8 Hints in Layers

Hint 1: Non-blocking connect

int fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);

Hint 2: SO_ERROR

getsockopt(fd, SOL_SOCKET, SO_ERROR, &err, &len);

Hint 3: Health check

send(fd, "HEAD / HTTP/1.0\r\n\r\n", 18, 0);

Hint 4: Backoff

100ms -> 200ms -> 400ms -> 800ms

5.9 Books That Will Help

Topic Book Chapter
TCP state machine TCP/IP Illustrated Vol. 1 Ch. 13
Sockets in C TCP/IP Sockets in C Ch. 2-4
Failure patterns Release It! Ch. 5

5.10 Implementation Phases

Phase 1: Connection Creation (3-4 days)

Goals:

  • Non-blocking connect with SO_ERROR validation.

Tasks:

  1. Implement connection creation with timeouts.
  2. Validate connections with SO_ERROR.

Checkpoint: A single connection can be created and validated.

Phase 2: Pool Lifecycle (4-6 days)

Goals:

  • Checkout/checkin and idle eviction.

Tasks:

  1. Build pool data structures.
  2. Implement health checks on checkout.
  3. Add idle timeout eviction.

Checkpoint: Pool sustains 100 requests without leaks.

Phase 3: Resilience + Metrics (4-6 days)

Goals:

  • Backoff, circuit breaker, failure injection.

Tasks:

  1. Implement backoff with jitter.
  2. Add circuit breaker state machine.
  3. Add failure injection mode and metrics output.

Checkpoint: Pool recovers after server restart.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Health check method HEAD request, TCP ping HEAD request Validates HTTP path too
Pool exhaustion block, fail fast block with timeout Predictable latency
Retry policy linear, exponential exponential + jitter Avoid retry storms

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests SO_ERROR handling connect failure test
Integration Tests Server restart recovery restart script
Edge Case Tests Half-open detection kill server mid-request

6.2 Critical Test Cases

  1. Non-blocking connect: Verify SO_ERROR success and failure.
  2. Server restart: Connections are evicted and re-established.
  3. Circuit breaker: Opens after threshold and closes after cooldown.

6.3 Test Data

Request: GET /health HTTP/1.1\r\nHost: 127.0.0.1\r\n\r\n
Expected: HTTP/1.1 200 OK

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Skipping SO_ERROR Pool returns dead sockets Validate after poll
No timeouts Requests hang Add read/connect timeout
No backoff Retry storm Exponential backoff

7.2 Debugging Strategies

  • Use ss -tan to inspect TCP states.
  • Add debug logs for connect timing and SO_ERROR values.
  • Simulate failures using tc netem and server restarts.

7.3 Performance Traps

  • Excessive health checks on every checkout.
  • Pool size too large for RLIMIT_NOFILE.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a simple synchronous HTTP client wrapper.
  • Output metrics in JSON format.

8.2 Intermediate Extensions

  • Add TLS support using OpenSSL.
  • Implement per-host pools.

8.3 Advanced Extensions

  • Add adaptive backoff based on failure rate.
  • Implement a connection warm-up strategy.

9. Real-World Connections

9.1 Industry Applications

  • Service meshes: Pools manage outbound connections.
  • Database clients: Pools limit connection churn.
  • curl: Connection reuse and pooling logic.
  • Envoy: Advanced connection pool implementation.

9.3 Interview Relevance

  • Expect questions about TCP states, timeouts, and retry strategies.

10. Resources

10.1 Essential Reading

  • TCP/IP Illustrated Vol. 1 - TCP state machine.
  • TCP/IP Sockets in C - Non-blocking connect patterns.

10.2 Video Resources

  • Talks on resilient networking and backoff.

10.3 Tools & Documentation

  • man 7 tcp, man 2 connect, man 2 poll.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain TCP half-open connections.
  • I can validate non-blocking connects with SO_ERROR.
  • I can describe circuit breaker behavior.

11.2 Implementation

  • All functional requirements are met.
  • Pool survives server restarts.
  • Failure injection demonstrates recovery.

11.3 Growth

  • I can justify my pool sizing decisions.
  • I can explain backoff to a teammate.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Pool creates and validates connections.
  • Timeouts and health checks work.
  • Failure injection mode runs.

Full Completion:

  • Circuit breaker and metrics implemented.
  • Integration tests for server restart and timeouts.

Excellence (Going Above & Beyond):

  • TLS support with session reuse.
  • Adaptive backoff and dynamic pool sizing.