Project 2: HTTP/1.1 Protocol Parser

Build a streaming HTTP/1.1 request parser that never loses synchronization and always reports errors deterministically.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Main Programming Language C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go, Zig
Coolness Level Level 3: Genuinely Clever
Business Potential Level 2: Micro-SaaS / Pro Tool
Prerequisites Socket I/O basics, strings, enums, buffers
Key Topics Streaming parsing, state machines, framing, error recovery

1. Learning Objectives

By completing this project, you will:

  1. Implement a streaming parser as an explicit state machine.
  2. Parse HTTP/1.1 request lines, headers, and bodies incrementally.
  3. Manage partial reads without losing synchronization.
  4. Detect protocol errors and report them consistently.
  5. Enforce size limits and backpressure for safety.
  6. Produce deterministic success and failure transcripts.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Streaming Parsing as a State Machine

Fundamentals

Streaming protocols deliver data in chunks, not complete messages. A parser must accept partial input, remember what it has already seen, and continue when more bytes arrive. This naturally maps to a state machine: states represent which part of the message you are parsing (request line, headers, body), and transitions are triggered by delimiters such as CRLF. Without an explicit state machine, you will either block waiting for more input or accidentally treat incomplete data as invalid. A streaming parser must be deterministic, resilient to partial reads, and capable of resynchronizing after errors. The state machine should be the single source of truth for what you expect next.

Deep Dive into the concept

An HTTP parser is a small protocol engine. The core problem is that TCP is a byte stream with no message boundaries. You might receive one byte at a time or multiple requests at once. The parser must therefore be incremental: it should accept a buffer of bytes, consume as much as it can, and return “need more” if the message is incomplete. This is identical to the approach used in binary protocol parsers in databases and message brokers.

A robust streaming parser has three parts: a state enum, a buffer cursor, and a set of transition rules. The state enum might include REQUEST_LINE, HEADERS, BODY_FIXED, BODY_CHUNKED, and DONE. Each state defines the delimiters it is waiting for. REQUEST_LINE waits for CRLF, then parses method, path, and version. HEADERS waits for CRLF CRLF, parsing each header line. BODY_FIXED consumes exactly Content-Length bytes. BODY_CHUNKED parses a hexadecimal length line, then consumes that many bytes, repeated until a zero-length chunk. This decomposition makes the parser resilient: if a read ends halfway through a CRLF, the parser simply retains its state and continues when the next chunk arrives.

Error handling is part of the state machine, not a separate feature. If a line exceeds a maximum length, transition to ERROR and return a deterministic error code. If a header line lacks a colon, do the same. If Content-Length and Transfer-Encoding disagree, error. The point is to avoid ambiguous recovery. A parser that tries to guess or silently fix invalid input becomes a security liability. Instead, make errors explicit and stop parsing that request. If you want to support pipelining, you can then discard input until the next valid request line, but you must do so deliberately with a resynchronization state.

A streaming parser must also manage backpressure. If the body is large, you may not want to buffer it entirely. The parser should be able to emit “body data” events to a consumer so that the application can stream it to disk or process incrementally. This implies that the parser is not just a function returning a full request; it is an event-driven machine that emits tokens as they arrive. This design mirrors how production HTTP servers work. It also reinforces the state machine: each emitted event corresponds to a state transition.

The most subtle bugs come from off-by-one errors at boundaries. If you incorrectly consume the CRLF after a chunk, you will desynchronize and misinterpret the next bytes. A disciplined parser tracks the exact number of bytes consumed and never discards bytes without updating state. For example, in chunked encoding, after reading the chunk size line, you must read exactly size bytes of data and then read exactly CRLF. A missing CRLF must be treated as “need more” rather than error if the buffer ends. This precision is the difference between a parser that works on sample inputs and one that works on real network traffic.

Finally, streaming parsers are testable. If you model the parser as state + input, you can feed it partial chunks and verify that it returns “need more” until the input is complete. You can also fuzz it with random splits to ensure that no split causes a crash or misparse. The state machine makes those tests deterministic and meaningful.

How this fit on projects

The HTTP parser is a direct application of explicit state machines. Every byte is processed according to state, and all errors are handled as legal transitions to ERROR. This design also reuses the input parsing mindset from Project 1.

Definitions & key terms

  • Streaming parser: A parser that consumes input incrementally across multiple reads.
  • Framing: Determining where one message ends and the next begins.
  • Resynchronization: Discarding input until the next valid message boundary.
  • Content-Length: Header defining fixed body length.
  • Chunked encoding: Body delivered as length-prefixed chunks.

Mental model diagram (ASCII)

[BYTE STREAM]
     |
     v
REQUEST_LINE -> HEADERS -> (BODY_FIXED | BODY_CHUNKED) -> DONE
        |           |             |
        v           v             v
      ERROR       ERROR         ERROR

How it works (step-by-step)

  1. Start in REQUEST_LINE state.
  2. Consume bytes until CRLF; parse method/path/version.
  3. Transition to HEADERS; parse header lines until CRLF CRLF.
  4. Decide body mode: none, fixed length, or chunked.
  5. Consume body accordingly, emitting data events.
  6. Transition to DONE and return the parsed request.
  7. If any rule is violated, transition to ERROR with a code.

Failure modes: missing CRLF, header line too long, invalid chunk size, inconsistent headers.

Minimal concrete example

typedef enum { ST_REQ, ST_HEADERS, ST_BODY_FIXED, ST_DONE, ST_ERROR } State;

typedef struct {
    State st;
    size_t needed; // bytes needed for body
} Parser;

int parse(Parser *p, const char *buf, size_t len) {
    // Consume based on p->st; return 0 when done, 1 if need more, -1 on error
    return 1;
}

Common misconceptions

  • “I can parse the whole request in one read.” TCP does not guarantee that.
  • “If I see invalid input, I can keep going.” This creates ambiguous state and security bugs.
  • “Chunked encoding is just another header.” It changes the body framing rules entirely.

Check-your-understanding questions

  1. Why does a streaming parser need to store state between reads?
  2. What delimiter ends the headers section in HTTP/1.1?
  3. How do you know when a chunked body is complete?

Check-your-understanding answers

  1. Because the input may end in the middle of a token, and the parser must resume correctly.
  2. A blank line, represented by CRLF CRLF.
  3. A chunk size of 0 followed by a final CRLF terminates the body.

Real-world applications

  • HTTP servers and reverse proxies
  • SMTP and IMAP parsers
  • Custom binary protocol parsers

Where you’ll apply it

References

  • RFC 7230 (HTTP/1.1 Message Syntax and Routing)
  • “HTTP: The Definitive Guide” by Fielding et al. (protocol fundamentals)

Key insights

A streaming parser is a deterministic state machine that treats TCP as a raw byte stream.

Summary

Explicit states and precise byte accounting are the only reliable way to parse HTTP over TCP.

Homework/Exercises to practice the concept

  1. Write a parser that accepts only a request line and returns NEED_MORE if incomplete.
  2. Simulate a request arriving in 1-byte chunks and verify the parser never errors.

Solutions to the homework/exercises

  1. Store partial line in a buffer and search for CRLF; only parse when found.
  2. Feed the parser one byte at a time and ensure it returns NEED_MORE until the final CRLF.

2.2 Buffering, Framing, and Backpressure

Fundamentals

A streaming parser is only as good as its buffering strategy. You need a buffer that can accept partial reads, retain unconsumed bytes, and expose complete tokens to the parser. You also need framing rules to determine where one message ends. Without framing, you cannot pipeline requests or safely discard invalid bytes. Backpressure is the strategy that prevents memory blowups: if the body is too large or the consumer is slow, you must stop reading or reject the request. These concerns are just as important as the parsing logic itself.

A reliable buffer strategy also decides ownership and lifetime of parsed data. If you store pointers into a mutable buffer, you must guarantee that those pointers remain valid until the consumer is done. If you copy data, you trade memory for safety. Either way, the buffering decision dictates correctness boundaries: you cannot parse faster than you can store, and you cannot discard bytes until you know they belong to a completed message.

Deep Dive into the concept

Buffer management is a systems problem disguised as string handling. If you read from a socket into a fixed buffer, you must decide what to do when the buffer fills before a full message arrives. The correct approach is to maintain a sliding window or ring buffer. A ring buffer lets you append new bytes while retaining unread bytes and avoids costly memmoves. The parser consumes bytes by advancing a read index. When the buffer runs out of space, you either grow it or apply backpressure by stopping reads until the consumer catches up.

Framing in HTTP is defined by the protocol: headers are terminated by CRLF CRLF, body length is determined by Content-Length or chunked encoding, and connections may contain multiple pipelined requests. The buffer therefore must retain any extra bytes after a message completes, because those bytes belong to the next request. This is a common bug: a naive parser discards extra bytes, losing synchronization. The correct behavior is to return DONE while leaving unread bytes in the buffer, so the next parse call can process them.

Backpressure requires explicit limits. You should define maximum header size, maximum request line length, and maximum body size. If these are exceeded, return a 413 (payload too large) or a parser error. This is both a safety feature and a denial-of-service mitigation. The limits are part of your state machine: when the header buffer grows beyond the limit, transition to ERROR and stop consuming. For body streaming, you can design the parser to emit body chunks to a callback; if the callback signals “slow” you stop reading from the socket until it drains. This design mirrors how real HTTP servers integrate parsing with I/O loops.

Another subtle issue is partial delimiters. Suppose the buffer ends with “\r” but not “\n” yet. You must not treat it as an error; you must wait for more input. This is why the parser must be incremental and why the buffer must preserve raw bytes. You also need to distinguish between “need more bytes” and “invalid sequence” errors. This affects your error reporting and makes the parser predictable.

Finally, consider memory ownership. If you parse headers by pointing into the buffer, those pointers become invalid when the buffer is reused. A safe design copies header names and values into a separate arena or uses offsets into a stable buffer. For this project, you can copy into fixed-size arrays or a simple heap arena. The key is to avoid use-after-free and dangling pointers when the buffer grows or wraps.

Zero-copy parsing is tempting but dangerous unless you control buffer lifetimes. A balanced approach is to keep offsets into a ring buffer for short-lived tokens (like the request line) and copy only when you need long-lived storage (like header fields used by downstream logic). This explicit lifetime rule keeps the parser fast while still safe under partial reads and buffer reuse.

These buffer and framing decisions influence all downstream behavior. Without them, even a correct state machine will fail on real network conditions. They also reinforce the control flow discipline from Project 1: every path that consumes bytes must update indices consistently, or the parser will desynchronize.

How this fit on projects

You will implement a ring or sliding buffer, enforce size limits, and keep leftover bytes for pipelined requests. This is the foundation for correctness under real network conditions.

Definitions & key terms

  • Ring buffer: Circular buffer with head/tail indices for streaming data.
  • Backpressure: Mechanism to slow or stop input when the consumer is overloaded.
  • Framing: Protocol rules that define message boundaries.
  • Pipelining: Multiple requests sent without waiting for responses.

Mental model diagram (ASCII)

[Socket] -> [Ring Buffer] -> [Parser]
              ^     |
              |     v
          Backpressure  (stop reads when full)

How it works (step-by-step)

  1. Read from socket into ring buffer.
  2. Parser consumes bytes and advances read index.
  3. If a request completes, leave remaining bytes for next parse.
  4. If buffer full, stop reading (backpressure).
  5. If limits exceeded, return error and close connection.

Failure modes: buffer overwrite, lost bytes after parse, unbounded memory growth, invalid pointer into reused buffer.

Minimal concrete example

// Ring buffer indices
typedef struct { char buf[8192]; size_t r, w; } Ring;

size_t ring_write(Ring *rb, const char *src, size_t n);
size_t ring_read(Ring *rb, char *dst, size_t n);

Common misconceptions

  • “Headers fit in any small buffer.” Large headers are common and must be bounded.
  • “Discarding extra bytes is fine.” It breaks pipelining and desynchronizes streams.

Check-your-understanding questions

  1. Why should leftover bytes remain in the buffer after parsing a request?
  2. What is the difference between size limits and backpressure?
  3. Why can pointer-based header storage be unsafe?

Check-your-understanding answers

  1. Because they may belong to the next pipelined request.
  2. Limits enforce maximum size, backpressure controls rate when near capacity.
  3. The underlying buffer may move or wrap, invalidating pointers.

Real-world applications

  • HTTP servers, proxies, and load balancers
  • Streaming parsers for log ingestion

Where you’ll apply it

  • In this project: see §3.5 Data Formats, §4.2 Key Components, and §5.10 Phase 2.
  • Also used in: P03-connection-pool.md (backpressure and limits).

References

  • RFC 7230 (framing rules)
  • “UNIX Network Programming” by Stevens (socket buffering)

Key insights

Buffering and framing turn a byte stream into safe, deterministic message boundaries.

Summary

A correct parser depends on disciplined buffering, explicit limits, and careful handling of leftover bytes.

Homework/Exercises to practice the concept

  1. Implement a ring buffer that supports partial reads and writes.
  2. Add a maximum header size limit and test rejection behavior.

Solutions to the homework/exercises

  1. Track read/write indices modulo buffer size and ensure no overwrite.
  2. Count header bytes and return error once limit exceeded.

3. Project Specification

3.1 What You Will Build

A streaming HTTP/1.1 request parser library with a small CLI tool (httpdump). The parser consumes byte chunks, emits parsed request structures, and handles partial reads. The CLI can read from stdin or a file and prints a parsed summary.

Included:

  • Request line + header parsing
  • Fixed-length and chunked body handling
  • Size limits and error reporting

Excluded:

  • Full HTTP server response generation
  • TLS

3.2 Functional Requirements

  1. Streaming API: parse_chunk(buf,len) returns NEED_MORE, DONE, or ERROR.
  2. Request line parsing: method, path, version validated.
  3. Header parsing: case-insensitive names, value trimming.
  4. Body parsing: Content-Length or chunked encoding.
  5. Limits: max request line, max headers, max body.
  6. Error codes: deterministic error codes with messages.
  7. CLI tool: reads from stdin or file and prints parsed structure.

3.3 Non-Functional Requirements

  • Performance: O(n) parsing per request.
  • Reliability: never segfault on malformed input.
  • Usability: clear error messages and exit codes.

3.4 Example Usage / Output

$ cat req.txt | ./httpdump
METHOD: GET
PATH: /index.html
VERSION: HTTP/1.1
Headers:
  Host: example.com
  Accept: */*
Body: (0 bytes)

3.5 Data Formats / Schemas / Protocols

  • Protocol: HTTP/1.1 as defined in RFC 7230.
  • CLI output (text): human readable summary.
  • CLI output (json): --json prints machine-readable structure.

Example JSON error shape:

{ "error": { "code": "INVALID_HEADER", "message": "Missing ':' in header" } }

3.6 Edge Cases

  • CRLF split across reads
  • Multiple requests pipelined in one buffer
  • Chunked body with extensions
  • Headers larger than limit

3.7 Real World Outcome

Deterministic inputs are provided; no timestamps.

3.7.1 How to Run (Copy/Paste)

cc -std=c11 -Wall -Wextra -O2 -o httpdump src/httpdump.c
./httpdump < samples/get.req

3.7.2 Golden Path Demo (Deterministic)

printf "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n" | ./httpdump

Expected output shows method GET, path /, HTTP/1.1, and Host header.

3.7.3 CLI Transcript (Success + Failure)

$ printf "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n" | ./httpdump
METHOD: GET
PATH: /
VERSION: HTTP/1.1
Headers:
  Host: example.com
Body: (0 bytes)
$ echo $?
0

$ printf "GET / HTTP/1.1\r\nBadHeader\r\n\r\n" | ./httpdump
ERROR: INVALID_HEADER (Missing ':' in header)
$ echo $?
1

Exit codes:

  • 0 success
  • 1 protocol error
  • 2 system error (I/O or allocation failure)

4. Solution Architecture

4.1 High-Level Design

[Socket/STDIN] -> [Ring Buffer] -> [Parser State Machine] -> [Request Struct]
                                             |
                                             v
                                        [Error Codes]

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Ring Buffer | Store partial bytes | Avoid memmove, keep leftovers | | Parser | State machine for HTTP sections | Explicit states and limits | | Request Model | Store parsed fields | Copy header values to stable storage | | CLI | Read input, call parser, print results | Text + JSON output modes |

4.3 Data Structures (No Full Code)

typedef enum { ST_REQ, ST_HEADERS, ST_BODY_LEN, ST_BODY_CHUNK, ST_DONE, ST_ERR } State;

typedef struct {
    State st;
    size_t content_len;
    size_t chunk_len;
} HttpParser;

4.4 Algorithm Overview

Key Algorithm: Incremental Parse Loop

  1. Read bytes into ring buffer.
  2. While parser can advance, consume tokens.
  3. If DONE, emit request and keep leftover bytes.
  4. If ERROR, stop and report.

Complexity Analysis:

  • Time: O(n) per request
  • Space: O(n) for buffers and parsed headers

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

project-root/
├── src/
│   ├── parser.c
│   ├── parser.h
│   └── httpdump.c
├── tests/
│   └── test_parser.c
└── Makefile

5.3 The Core Question You’re Answering

“How can I parse a protocol correctly when I never know how many bytes I’ll receive?”

5.4 Concepts You Must Understand First

  1. HTTP/1.1 request grammar (request line, headers, body)
  2. Streaming parsing (state machine, partial reads)
  3. Buffering strategies (ring buffer)

5.5 Questions to Guide Your Design

  1. How will you avoid losing leftover bytes?
  2. How will you handle headers split across reads?
  3. What are your size limits and why?

5.6 Thinking Exercise

Simulate a request that arrives in 3 chunks: GET /, HTTP/1.1\r\nHost: a, b.com\r\n\r\n. Which states does the parser visit?

5.7 The Interview Questions They’ll Ask

  1. “Why is HTTP parsing a state machine problem?”
  2. “How do you handle chunked encoding?”
  3. “How do you avoid desynchronization after a malformed request?”

5.8 Hints in Layers

Hint 1: Start with parsing only the request line.

Hint 2: Add header parsing until CRLF CRLF.

Hint 3: Implement fixed-length body first, chunked later.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | HTTP protocol | “HTTP: The Definitive Guide” | Ch. 3-5 | | Network I/O | “UNIX Network Programming” | Ch. 5 | | Error handling | “Effective C” | Ch. 8 |

5.10 Implementation Phases

Phase 1: Request Line (2-3 days)

  • Parse method/path/version with CRLF.
  • Return NEED_MORE if incomplete.

Phase 2: Headers + Fixed Body (4-5 days)

  • Parse headers, enforce limits.
  • Implement Content-Length bodies.

Phase 3: Chunked Body + CLI (3-4 days)

  • Parse chunked bodies and build CLI output.
  • Add JSON output and error codes.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Buffer type | Sliding window, ring | Ring | Efficient and simple | | Header storage | Pointers, copies | Copies | Safer lifetime management | | Error handling | errno, enum codes | Enum codes | Deterministic and testable |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Parsing tokens | request line, header line | | Integration Tests | Full request parse | simple GET, POST | | Edge Case Tests | Partial reads | split CRLF |

6.2 Critical Test Cases

  1. CRLF split across buffers
  2. Chunked body with multiple chunks
  3. Oversized headers rejected

6.3 Test Data

GET / HTTP/1.1\r\nHost: x\r\n\r\n
POST / HTTP/1.1\r\nContent-Length: 4\r\n\r\ndata

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |——–|———|———-| | Dropping leftover bytes | Broken pipelining | Keep unread bytes in buffer | | Treating partial CRLF as error | Random failures | Return NEED_MORE | | Not enforcing limits | Memory blowups | Add size caps |

7.2 Debugging Strategies

  • Log state transitions and bytes consumed.
  • Write tests that split input at every byte boundary.

7.3 Performance Traps

Excessive copying of buffers can be O(n^2). Use a ring buffer.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add support for absolute-form URLs.
  • Support HTTP/1.0 requests.

8.2 Intermediate Extensions

  • Emit parsed tokens as events instead of building full struct.
  • Implement header normalization and folding.

8.3 Advanced Extensions

  • Implement an HTTP response parser.
  • Integrate with a tiny HTTP server loop.

9. Real-World Connections

9.1 Industry Applications

  • HTTP servers, reverse proxies, and API gateways rely on these exact parsing rules.
  • http-parser: A classic streaming HTTP parser.
  • nginx: Production-grade HTTP parser and event loop.

9.3 Interview Relevance

  • State machines and streaming parsing are common systems interview topics.

10. Resources

10.1 Essential Reading

  • RFC 7230 (HTTP/1.1)
  • “HTTP: The Definitive Guide”

10.2 Video Resources

  • Protocol parsing lectures (search for HTTP parser walkthroughs)

10.3 Tools & Documentation

  • curl and netcat for generating requests

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why HTTP parsing is incremental.
  • I can explain how chunked encoding works.
  • I can explain how leftover bytes are handled.

11.2 Implementation

  • All functional requirements are met.
  • All test cases pass.
  • Error codes are deterministic.

11.3 Growth

  • I can describe parser states without notes.
  • I can explain how I handled partial reads.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Parses request line and headers correctly.
  • Handles partial reads with NEED_MORE.
  • Reports errors for malformed input.

Full Completion:

  • Supports Content-Length and chunked bodies.
  • Enforces size limits and returns proper exit codes.

Excellence (Going Above & Beyond):

  • Event-driven parsing with streaming callbacks.
  • Integration with a minimal HTTP server loop.

13. Additional Content Rules (Compliance)

  • Deterministic demos provided in §3.7.
  • Failure demo with exit codes included.
  • Cross-links provided in §2.1 and §2.2.