Project 2: HTTP/1.1 Protocol Parser

Build a streaming HTTP/1.1 request parser that never loses synchronization and always reports errors deterministically.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go, Zig
Coolness Level	Level 3: Genuinely Clever
Business Potential	Level 2: Micro-SaaS / Pro Tool
Prerequisites	Socket I/O basics, strings, enums, buffers
Key Topics	Streaming parsing, state machines, framing, error recovery

1. Learning Objectives

By completing this project, you will:

Implement a streaming parser as an explicit state machine.
Parse HTTP/1.1 request lines, headers, and bodies incrementally.
Manage partial reads without losing synchronization.
Detect protocol errors and report them consistently.
Enforce size limits and backpressure for safety.
Produce deterministic success and failure transcripts.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Streaming Parsing as a State Machine

Fundamentals

Streaming protocols deliver data in chunks, not complete messages. A parser must accept partial input, remember what it has already seen, and continue when more bytes arrive. This naturally maps to a state machine: states represent which part of the message you are parsing (request line, headers, body), and transitions are triggered by delimiters such as CRLF. Without an explicit state machine, you will either block waiting for more input or accidentally treat incomplete data as invalid. A streaming parser must be deterministic, resilient to partial reads, and capable of resynchronizing after errors. The state machine should be the single source of truth for what you expect next.

Deep Dive into the concept

An HTTP parser is a small protocol engine. The core problem is that TCP is a byte stream with no message boundaries. You might receive one byte at a time or multiple requests at once. The parser must therefore be incremental: it should accept a buffer of bytes, consume as much as it can, and return “need more” if the message is incomplete. This is identical to the approach used in binary protocol parsers in databases and message brokers.

A robust streaming parser has three parts: a state enum, a buffer cursor, and a set of transition rules. The state enum might include REQUEST_LINE, HEADERS, BODY_FIXED, BODY_CHUNKED, and DONE. Each state defines the delimiters it is waiting for. REQUEST_LINE waits for CRLF, then parses method, path, and version. HEADERS waits for CRLF CRLF, parsing each header line. BODY_FIXED consumes exactly Content-Length bytes. BODY_CHUNKED parses a hexadecimal length line, then consumes that many bytes, repeated until a zero-length chunk. This decomposition makes the parser resilient: if a read ends halfway through a CRLF, the parser simply retains its state and continues when the next chunk arrives.

Error handling is part of the state machine, not a separate feature. If a line exceeds a maximum length, transition to ERROR and return a deterministic error code. If a header line lacks a colon, do the same. If Content-Length and Transfer-Encoding disagree, error. The point is to avoid ambiguous recovery. A parser that tries to guess or silently fix invalid input becomes a security liability. Instead, make errors explicit and stop parsing that request. If you want to support pipelining, you can then discard input until the next valid request line, but you must do so deliberately with a resynchronization state.

A streaming parser must also manage backpressure. If the body is large, you may not want to buffer it entirely. The parser should be able to emit “body data” events to a consumer so that the application can stream it to disk or process incrementally. This implies that the parser is not just a function returning a full request; it is an event-driven machine that emits tokens as they arrive. This design mirrors how production HTTP servers work. It also reinforces the state machine: each emitted event corresponds to a state transition.

The most subtle bugs come from off-by-one errors at boundaries. If you incorrectly consume the CRLF after a chunk, you will desynchronize and misinterpret the next bytes. A disciplined parser tracks the exact number of bytes consumed and never discards bytes without updating state. For example, in chunked encoding, after reading the chunk size line, you must read exactly size bytes of data and then read exactly CRLF. A missing CRLF must be treated as “need more” rather than error if the buffer ends. This precision is the difference between a parser that works on sample inputs and one that works on real network traffic.

Finally, streaming parsers are testable. If you model the parser as state + input, you can feed it partial chunks and verify that it returns “need more” until the input is complete. You can also fuzz it with random splits to ensure that no split causes a crash or misparse. The state machine makes those tests deterministic and meaningful.

How this fit on projects

The HTTP parser is a direct application of explicit state machines. Every byte is processed according to state, and all errors are handled as legal transitions to ERROR. This design also reuses the input parsing mindset from Project 1.

Definitions & key terms

Streaming parser: A parser that consumes input incrementally across multiple reads.
Framing: Determining where one message ends and the next begins.
Resynchronization: Discarding input until the next valid message boundary.
Content-Length: Header defining fixed body length.
Chunked encoding: Body delivered as length-prefixed chunks.

Mental model diagram (ASCII)

[BYTE STREAM]
     |
     v
REQUEST_LINE -> HEADERS -> (BODY_FIXED | BODY_CHUNKED) -> DONE
        |           |             |
        v           v             v
      ERROR       ERROR         ERROR

How it works (step-by-step)

Start in REQUEST_LINE state.
Consume bytes until CRLF; parse method/path/version.
Transition to HEADERS; parse header lines until CRLF CRLF.
Decide body mode: none, fixed length, or chunked.
Consume body accordingly, emitting data events.
Transition to DONE and return the parsed request.
If any rule is violated, transition to ERROR with a code.

Failure modes: missing CRLF, header line too long, invalid chunk size, inconsistent headers.

Minimal concrete example

typedef enum { ST_REQ, ST_HEADERS, ST_BODY_FIXED, ST_DONE, ST_ERROR } State;

typedef struct {
    State st;
    size_t needed; // bytes needed for body
} Parser;

int parse(Parser *p, const char *buf, size_t len) {
    // Consume based on p->st; return 0 when done, 1 if need more, -1 on error
    return 1;
}

Common misconceptions

“I can parse the whole request in one read.” TCP does not guarantee that.
“If I see invalid input, I can keep going.” This creates ambiguous state and security bugs.
“Chunked encoding is just another header.” It changes the body framing rules entirely.

Check-your-understanding questions

Why does a streaming parser need to store state between reads?
What delimiter ends the headers section in HTTP/1.1?
How do you know when a chunked body is complete?

Check-your-understanding answers

Because the input may end in the middle of a token, and the parser must resume correctly.
A blank line, represented by CRLF CRLF.
A chunk size of 0 followed by a final CRLF terminates the body.

Real-world applications

HTTP servers and reverse proxies
SMTP and IMAP parsers
Custom binary protocol parsers

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §4.4 Algorithm Overview.
Also used in: P01-modal-text-editor.md (escape parsing), P06-git-like-version-control-system.md (object parsing).

References

RFC 7230 (HTTP/1.1 Message Syntax and Routing)
“HTTP: The Definitive Guide” by Fielding et al. (protocol fundamentals)

Key insights

A streaming parser is a deterministic state machine that treats TCP as a raw byte stream.

Summary

Explicit states and precise byte accounting are the only reliable way to parse HTTP over TCP.

Homework/Exercises to practice the concept

Write a parser that accepts only a request line and returns NEED_MORE if incomplete.
Simulate a request arriving in 1-byte chunks and verify the parser never errors.

Solutions to the homework/exercises

Store partial line in a buffer and search for CRLF; only parse when found.
Feed the parser one byte at a time and ensure it returns NEED_MORE until the final CRLF.

2.2 Buffering, Framing, and Backpressure

Fundamentals

A streaming parser is only as good as its buffering strategy. You need a buffer that can accept partial reads, retain unconsumed bytes, and expose complete tokens to the parser. You also need framing rules to determine where one message ends. Without framing, you cannot pipeline requests or safely discard invalid bytes. Backpressure is the strategy that prevents memory blowups: if the body is too large or the consumer is slow, you must stop reading or reject the request. These concerns are just as important as the parsing logic itself.

A reliable buffer strategy also decides ownership and lifetime of parsed data. If you store pointers into a mutable buffer, you must guarantee that those pointers remain valid until the consumer is done. If you copy data, you trade memory for safety. Either way, the buffering decision dictates correctness boundaries: you cannot parse faster than you can store, and you cannot discard bytes until you know they belong to a completed message.

Deep Dive into the concept

Buffer management is a systems problem disguised as string handling. If you read from a socket into a fixed buffer, you must decide what to do when the buffer fills before a full message arrives. The correct approach is to maintain a sliding window or ring buffer. A ring buffer lets you append new bytes while retaining unread bytes and avoids costly memmoves. The parser consumes bytes by advancing a read index. When the buffer runs out of space, you either grow it or apply backpressure by stopping reads until the consumer catches up.

Framing in HTTP is defined by the protocol: headers are terminated by CRLF CRLF, body length is determined by Content-Length or chunked encoding, and connections may contain multiple pipelined requests. The buffer therefore must retain any extra bytes after a message completes, because those bytes belong to the next request. This is a common bug: a naive parser discards extra bytes, losing synchronization. The correct behavior is to return DONE while leaving unread bytes in the buffer, so the next parse call can process them.

Backpressure requires explicit limits. You should define maximum header size, maximum request line length, and maximum body size. If these are exceeded, return a 413 (payload too large) or a parser error. This is both a safety feature and a denial-of-service mitigation. The limits are part of your state machine: when the header buffer grows beyond the limit, transition to ERROR and stop consuming. For body streaming, you can design the parser to emit body chunks to a callback; if the callback signals “slow” you stop reading from the socket until it drains. This design mirrors how real HTTP servers integrate parsing with I/O loops.

Another subtle issue is partial delimiters. Suppose the buffer ends with “\r” but not “\n” yet. You must not treat it as an error; you must wait for more input. This is why the parser must be incremental and why the buffer must preserve raw bytes. You also need to distinguish between “need more bytes” and “invalid sequence” errors. This affects your error reporting and makes the parser predictable.

Finally, consider memory ownership. If you parse headers by pointing into the buffer, those pointers become invalid when the buffer is reused. A safe design copies header names and values into a separate arena or uses offsets into a stable buffer. For this project, you can copy into fixed-size arrays or a simple heap arena. The key is to avoid use-after-free and dangling pointers when the buffer grows or wraps.

Zero-copy parsing is tempting but dangerous unless you control buffer lifetimes. A balanced approach is to keep offsets into a ring buffer for short-lived tokens (like the request line) and copy only when you need long-lived storage (like header fields used by downstream logic). This explicit lifetime rule keeps the parser fast while still safe under partial reads and buffer reuse.

These buffer and framing decisions influence all downstream behavior. Without them, even a correct state machine will fail on real network conditions. They also reinforce the control flow discipline from Project 1: every path that consumes bytes must update indices consistently, or the parser will desynchronize.

How this fit on projects

You will implement a ring or sliding buffer, enforce size limits, and keep leftover bytes for pipelined requests. This is the foundation for correctness under real network conditions.

Definitions & key terms

Ring buffer: Circular buffer with head/tail indices for streaming data.
Backpressure: Mechanism to slow or stop input when the consumer is overloaded.
Framing: Protocol rules that define message boundaries.
Pipelining: Multiple requests sent without waiting for responses.

Mental model diagram (ASCII)

[Socket] -> [Ring Buffer] -> [Parser]
              ^     |
              |     v
          Backpressure  (stop reads when full)

How it works (step-by-step)

Read from socket into ring buffer.
Parser consumes bytes and advances read index.
If a request completes, leave remaining bytes for next parse.
If buffer full, stop reading (backpressure).
If limits exceeded, return error and close connection.

Failure modes: buffer overwrite, lost bytes after parse, unbounded memory growth, invalid pointer into reused buffer.

Minimal concrete example

// Ring buffer indices
typedef struct { char buf[8192]; size_t r, w; } Ring;

size_t ring_write(Ring *rb, const char *src, size_t n);
size_t ring_read(Ring *rb, char *dst, size_t n);

Common misconceptions

“Headers fit in any small buffer.” Large headers are common and must be bounded.
“Discarding extra bytes is fine.” It breaks pipelining and desynchronizes streams.

Check-your-understanding questions

Why should leftover bytes remain in the buffer after parsing a request?
What is the difference between size limits and backpressure?
Why can pointer-based header storage be unsafe?

Check-your-understanding answers

Because they may belong to the next pipelined request.
Limits enforce maximum size, backpressure controls rate when near capacity.
The underlying buffer may move or wrap, invalidating pointers.

Real-world applications

HTTP servers, proxies, and load balancers
Streaming parsers for log ingestion

Where you’ll apply it

In this project: see §3.5 Data Formats, §4.2 Key Components, and §5.10 Phase 2.
Also used in: P03-connection-pool.md (backpressure and limits).

References

RFC 7230 (framing rules)
“UNIX Network Programming” by Stevens (socket buffering)

Key insights

Buffering and framing turn a byte stream into safe, deterministic message boundaries.

Summary

A correct parser depends on disciplined buffering, explicit limits, and careful handling of leftover bytes.

Homework/Exercises to practice the concept

Implement a ring buffer that supports partial reads and writes.
Add a maximum header size limit and test rejection behavior.

Solutions to the homework/exercises

Track read/write indices modulo buffer size and ensure no overwrite.
Count header bytes and return error once limit exceeded.

3. Project Specification

3.1 What You Will Build

A streaming HTTP/1.1 request parser library with a small CLI tool (httpdump). The parser consumes byte chunks, emits parsed request structures, and handles partial reads. The CLI can read from stdin or a file and prints a parsed summary.

Included:

Request line + header parsing
Fixed-length and chunked body handling
Size limits and error reporting

Excluded:

Full HTTP server response generation
TLS

3.2 Functional Requirements

Streaming API: parse_chunk(buf,len) returns NEED_MORE, DONE, or ERROR.
Request line parsing: method, path, version validated.
Header parsing: case-insensitive names, value trimming.
Body parsing: Content-Length or chunked encoding.
Limits: max request line, max headers, max body.
Error codes: deterministic error codes with messages.
CLI tool: reads from stdin or file and prints parsed structure.

3.3 Non-Functional Requirements

Performance: O(n) parsing per request.
Reliability: never segfault on malformed input.
Usability: clear error messages and exit codes.

3.4 Example Usage / Output

$ cat req.txt | ./httpdump
METHOD: GET
PATH: /index.html
VERSION: HTTP/1.1
Headers:
  Host: example.com
  Accept: */*
Body: (0 bytes)

3.5 Data Formats / Schemas / Protocols

Protocol: HTTP/1.1 as defined in RFC 7230.
CLI output (text): human readable summary.
CLI output (json): --json prints machine-readable structure.

Example JSON error shape:

{ "error": { "code": "INVALID_HEADER", "message": "Missing ':' in header" } }

3.6 Edge Cases

CRLF split across reads
Multiple requests pipelined in one buffer
Chunked body with extensions
Headers larger than limit

3.7 Real World Outcome

Deterministic inputs are provided; no timestamps.

3.7.1 How to Run (Copy/Paste)

cc -std=c11 -Wall -Wextra -O2 -o httpdump src/httpdump.c
./httpdump < samples/get.req

3.7.2 Golden Path Demo (Deterministic)

printf "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n" | ./httpdump

Expected output shows method GET, path /, HTTP/1.1, and Host header.

3.7.3 CLI Transcript (Success + Failure)

$ printf "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n" | ./httpdump
METHOD: GET
PATH: /
VERSION: HTTP/1.1
Headers:
  Host: example.com
Body: (0 bytes)
$ echo $?
0

$ printf "GET / HTTP/1.1\r\nBadHeader\r\n\r\n" | ./httpdump
ERROR: INVALID_HEADER (Missing ':' in header)
$ echo $?
1

Exit codes:

0 success
1 protocol error
2 system error (I/O or allocation failure)

4. Solution Architecture

4.1 High-Level Design

[Socket/STDIN] -> [Ring Buffer] -> [Parser State Machine] -> [Request Struct]
                                             |
                                             v
                                        [Error Codes]

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef enum { ST_REQ, ST_HEADERS, ST_BODY_LEN, ST_BODY_CHUNK, ST_DONE, ST_ERR } State;

typedef struct {
    State st;
    size_t content_len;
    size_t chunk_len;
} HttpParser;

4.4 Algorithm Overview

Key Algorithm: Incremental Parse Loop

Read bytes into ring buffer.
While parser can advance, consume tokens.
If DONE, emit request and keep leftover bytes.
If ERROR, stop and report.

Complexity Analysis:

Time: O(n) per request
Space: O(n) for buffers and parsed headers

5. Implementation Guide

5.1 Development Environment Setup

cc --version

5.2 Project Structure

project-root/
├── src/
│   ├── parser.c
│   ├── parser.h
│   └── httpdump.c
├── tests/
│   └── test_parser.c
└── Makefile

5.3 The Core Question You’re Answering

“How can I parse a protocol correctly when I never know how many bytes I’ll receive?”

5.4 Concepts You Must Understand First

HTTP/1.1 request grammar (request line, headers, body)
Streaming parsing (state machine, partial reads)
Buffering strategies (ring buffer)

5.5 Questions to Guide Your Design

How will you avoid losing leftover bytes?
How will you handle headers split across reads?
What are your size limits and why?

5.6 Thinking Exercise

Simulate a request that arrives in 3 chunks: GET /, HTTP/1.1\r\nHost: a, b.com\r\n\r\n. Which states does the parser visit?

5.7 The Interview Questions They’ll Ask

“Why is HTTP parsing a state machine problem?”
“How do you handle chunked encoding?”
“How do you avoid desynchronization after a malformed request?”

5.8 Hints in Layers

Hint 1: Start with parsing only the request line.

Hint 2: Add header parsing until CRLF CRLF.

Hint 3: Implement fixed-length body first, chunked later.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Request Line (2-3 days)

Parse method/path/version with CRLF.
Return NEED_MORE if incomplete.

Phase 2: Headers + Fixed Body (4-5 days)

Parse headers, enforce limits.
Implement Content-Length bodies.

Phase 3: Chunked Body + CLI (3-4 days)

Parse chunked bodies and build CLI output.
Add JSON output and error codes.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

CRLF split across buffers
Chunked body with multiple chunks
Oversized headers rejected

6.3 Test Data

GET / HTTP/1.1\r\nHost: x\r\n\r\n
POST / HTTP/1.1\r\nContent-Length: 4\r\n\r\ndata

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Log state transitions and bytes consumed.
Write tests that split input at every byte boundary.

7.3 Performance Traps

Excessive copying of buffers can be O(n^2). Use a ring buffer.

8. Extensions & Challenges

8.1 Beginner Extensions

Add support for absolute-form URLs.
Support HTTP/1.0 requests.

8.2 Intermediate Extensions

Emit parsed tokens as events instead of building full struct.
Implement header normalization and folding.

8.3 Advanced Extensions

Implement an HTTP response parser.
Integrate with a tiny HTTP server loop.

9. Real-World Connections

9.1 Industry Applications

HTTP servers, reverse proxies, and API gateways rely on these exact parsing rules.

http-parser: A classic streaming HTTP parser.
nginx: Production-grade HTTP parser and event loop.

9.3 Interview Relevance

State machines and streaming parsing are common systems interview topics.

10. Resources

10.1 Essential Reading

RFC 7230 (HTTP/1.1)
“HTTP: The Definitive Guide”

10.2 Video Resources

Protocol parsing lectures (search for HTTP parser walkthroughs)

10.3 Tools & Documentation

curl and netcat for generating requests

Project 1: Modal Text Editor teaches event parsing discipline.
Project 3: Connection Pool builds on backpressure ideas.

11. Self-Assessment Checklist

11.1 Understanding

I can explain why HTTP parsing is incremental.
I can explain how chunked encoding works.
I can explain how leftover bytes are handled.

11.2 Implementation

All functional requirements are met.
All test cases pass.
Error codes are deterministic.

11.3 Growth

I can describe parser states without notes.
I can explain how I handled partial reads.

12. Submission / Completion Criteria

Minimum Viable Completion:

Parses request line and headers correctly.
Handles partial reads with NEED_MORE.
Reports errors for malformed input.

Full Completion:

Supports Content-Length and chunked bodies.
Enforces size limits and returns proper exit codes.

Excellence (Going Above & Beyond):

Event-driven parsing with streaming callbacks.
Integration with a minimal HTTP server loop.

13. Additional Content Rules (Compliance)

Deterministic demos provided in §3.7.
Failure demo with exit codes included.
Cross-links provided in §2.1 and §2.2.