Project 2: HTTP/1.1 Protocol Parser

Build a streaming HTTP/1.1 parser driven by a finite state machine.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language C
Prerequisites String parsing, sockets basics
Key Topics State machines, incremental parsing, error recovery

1. Learning Objectives

By completing this project, you will:

  1. Model HTTP parsing as a deterministic state machine.
  2. Parse requests incrementally across partial reads.
  3. Enforce protocol correctness with clear error states.
  4. Separate parsing logic from I/O and storage.

2. Theoretical Foundation

2.1 Core Concepts

  • Streaming parsing: Input may arrive in chunks; parser must preserve state.
  • Deterministic FSM: Each byte drives a transition; invalid bytes hit error states.
  • Protocol invariants: Method, version, and headers must obey format rules.

2.2 Why This Matters

Network protocols are all state machines. A robust parser prevents security bugs and undefined behavior.

2.3 Historical Context / Background

HTTP/1.1 is line-based and forgiving in the wild, but correct servers still enforce strict rules to avoid request smuggling.

2.4 Common Misconceptions

  • “We can parse with strtok”: Streaming input breaks this approach.
  • “Headers always arrive in one read”: They do not.

3. Project Specification

3.1 What You Will Build

A parser that accepts byte chunks and outputs a structured request (method, path, headers, body) or a clear error state.

3.2 Functional Requirements

  1. Parse request line: METHOD SP PATH SP HTTP/1.1.
  2. Parse headers until CRLF CRLF.
  3. Support Content-Length body parsing.
  4. Handle partial input by returning “need more data.”

3.3 Non-Functional Requirements

  • Reliability: Invalid input must not crash the parser.
  • Performance: Avoid copying data unnecessarily.
  • Security: Enforce header limits.

3.4 Example Usage / Output

feed("GET /index.html HTTP/1.1\r\nHost: example.com\r\n\r\n")
-> request(method=GET, path=/index.html, headers=1)

3.5 Real World Outcome

A streaming parser that consumes data from recv() in chunks and yields a request object when complete.

[parser] state=HEADERS
[parser] need more data
[parser] state=BODY
[parser] request complete

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ socket recv  │────▶│ FSM parser   │────▶│ request obj  │
└──────────────┘     └──────────────┘     └──────────────┘

4.2 Key Components

Component Responsibility Key Decisions
FSM core Parse input incrementally Explicit state enum
Buffer Store partial tokens Ring buffer or pointer offsets
Validator Enforce rules Header limits

4.3 Data Structures

typedef enum {
    ST_METHOD,
    ST_PATH,
    ST_VERSION,
    ST_HEADER_KEY,
    ST_HEADER_VALUE,
    ST_BODY,
    ST_DONE,
    ST_ERROR
} ParserState;

4.4 Algorithm Overview

Key Algorithm: Byte-driven FSM

  1. Consume bytes and transition states.
  2. Accumulate tokens for method/path/headers.
  3. On CRLF CRLF, determine body handling.

Complexity Analysis:

  • Time: O(N) for N bytes.
  • Space: O(H) for headers/body.

5. Implementation Guide

5.1 Development Environment Setup

gcc --version

5.2 Project Structure

project-root/
├── src/
│   ├── parser.c
│   ├── parser.h
│   └── main.c
└── Makefile

5.3 The Core Question You’re Answering

“What state am I in, and what bytes are legal next?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. HTTP/1.1 grammar
    • Request line and header formats
  2. FSM design
    • Explicit transitions and error states
  3. Incremental parsing
    • Partial tokens across buffers

5.5 Questions to Guide Your Design

  1. How will you store partial tokens across chunks?
  2. What is the maximum header size you allow?
  3. How do you signal “need more data” to the caller?

5.6 Thinking Exercise

How would you extend this to parse Transfer-Encoding: chunked?

5.7 The Interview Questions They’ll Ask

  1. Why is incremental parsing required for network protocols?
  2. What is the state machine for parsing headers?
  3. How do you prevent request smuggling?

5.8 Hints in Layers

Hint 1: Start with request line only

  • Parse METHOD SP PATH SP VERSION.

Hint 2: Add headers

  • Parse until CRLF CRLF.

Hint 3: Add body

  • Respect Content-Length.

5.9 Books That Will Help

Topic Book Chapter
Protocol parsing “HTTP/1.1 RFC 7230” ABNF sections
State machines “Design Patterns” State pattern

5.10 Implementation Phases

Phase 1: Foundation (3-4 days)

Goals:

  • Parse request line.

Tasks:

  1. Build FSM states for method/path/version.
  2. Validate request line format.

Checkpoint: Valid request line parsed correctly.

Phase 2: Core Functionality (4-5 days)

Goals:

  • Parse headers and end-of-headers.

Tasks:

  1. Add header key/value parsing.
  2. Detect CRLF CRLF.

Checkpoint: Headers stored and parsed.

Phase 3: Polish & Edge Cases (2-3 days)

Goals:

  • Body parsing and error handling.

Tasks:

  1. Read body based on Content-Length.
  2. Implement error states for invalid input.

Checkpoint: Parser returns DONE or ERROR deterministically.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Buffering Copy vs zero-copy Offsets in input Efficiency
Limits None vs strict Strict limits Security

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Valid requests Correct parsing GET with headers
Partial input Streaming correctness Split across chunks
Invalid input Error state Bad method token

6.2 Critical Test Cases

  1. Request line split across two reads.
  2. Header without colon triggers error.
  3. Body parsed exactly by Content-Length.

6.3 Test Data

GET / HTTP/1.1\r\nHost: a\r\n\r\n

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Assuming full request Missing data Return NEED_MORE
Unbounded headers Memory blowup Enforce limits
CRLF handling Parsing stuck Explicit CRLF states

7.2 Debugging Strategies

  • Log state transitions with byte offsets.
  • Add tests for partial inputs.

7.3 Performance Traps

Copying every token into new strings can be expensive; use slices where possible.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Parse query strings.

8.2 Intermediate Extensions

  • Support chunked encoding.

8.3 Advanced Extensions

  • Parse multiple pipelined requests.

9. Real-World Connections

9.1 Industry Applications

  • Web servers and proxies rely on robust parsers.
  • Security: Request smuggling prevention.
  • http-parser: Classic C HTTP parser.
  • nghttp2: HTTP/2 parsing library.

9.3 Interview Relevance

  • Demonstrates FSM design and streaming parsing.

10. Resources

10.1 Essential Reading

  • RFC 7230 for HTTP/1.1 grammar.

10.2 Video Resources

  • Search: “HTTP parser FSM”.

10.3 Tools & Documentation

  • curl: generate test requests.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why streaming parsing is required.
  • I can draw the parser state diagram.

11.2 Implementation

  • Parser handles partial input correctly.
  • Invalid input goes to ERROR state.

11.3 Growth

  • I can apply FSM parsing to other protocols.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Parse request line and headers.

Full Completion:

  • Parse body by Content-Length with proper error handling.

Excellence (Going Above & Beyond):

  • Support chunked encoding or pipelining.

This guide was generated from SPRINT_3_CONTROL_FLOW_STATE_PROJECTS.md. For the complete learning path, see the parent directory README.