Project 2: HTTP/1.1 Protocol Parser
Build a streaming HTTP/1.1 parser driven by a finite state machine.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1-2 weeks |
| Language | C |
| Prerequisites | String parsing, sockets basics |
| Key Topics | State machines, incremental parsing, error recovery |
1. Learning Objectives
By completing this project, you will:
- Model HTTP parsing as a deterministic state machine.
- Parse requests incrementally across partial reads.
- Enforce protocol correctness with clear error states.
- Separate parsing logic from I/O and storage.
2. Theoretical Foundation
2.1 Core Concepts
- Streaming parsing: Input may arrive in chunks; parser must preserve state.
- Deterministic FSM: Each byte drives a transition; invalid bytes hit error states.
- Protocol invariants: Method, version, and headers must obey format rules.
2.2 Why This Matters
Network protocols are all state machines. A robust parser prevents security bugs and undefined behavior.
2.3 Historical Context / Background
HTTP/1.1 is line-based and forgiving in the wild, but correct servers still enforce strict rules to avoid request smuggling.
2.4 Common Misconceptions
- “We can parse with strtok”: Streaming input breaks this approach.
- “Headers always arrive in one read”: They do not.
3. Project Specification
3.1 What You Will Build
A parser that accepts byte chunks and outputs a structured request (method, path, headers, body) or a clear error state.
3.2 Functional Requirements
- Parse request line:
METHOD SP PATH SP HTTP/1.1. - Parse headers until CRLF CRLF.
- Support
Content-Lengthbody parsing. - Handle partial input by returning “need more data.”
3.3 Non-Functional Requirements
- Reliability: Invalid input must not crash the parser.
- Performance: Avoid copying data unnecessarily.
- Security: Enforce header limits.
3.4 Example Usage / Output
feed("GET /index.html HTTP/1.1\r\nHost: example.com\r\n\r\n")
-> request(method=GET, path=/index.html, headers=1)
3.5 Real World Outcome
A streaming parser that consumes data from recv() in chunks and yields a request object when complete.
[parser] state=HEADERS
[parser] need more data
[parser] state=BODY
[parser] request complete
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ socket recv │────▶│ FSM parser │────▶│ request obj │
└──────────────┘ └──────────────┘ └──────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| FSM core | Parse input incrementally | Explicit state enum |
| Buffer | Store partial tokens | Ring buffer or pointer offsets |
| Validator | Enforce rules | Header limits |
4.3 Data Structures
typedef enum {
ST_METHOD,
ST_PATH,
ST_VERSION,
ST_HEADER_KEY,
ST_HEADER_VALUE,
ST_BODY,
ST_DONE,
ST_ERROR
} ParserState;
4.4 Algorithm Overview
Key Algorithm: Byte-driven FSM
- Consume bytes and transition states.
- Accumulate tokens for method/path/headers.
- On CRLF CRLF, determine body handling.
Complexity Analysis:
- Time: O(N) for N bytes.
- Space: O(H) for headers/body.
5. Implementation Guide
5.1 Development Environment Setup
gcc --version
5.2 Project Structure
project-root/
├── src/
│ ├── parser.c
│ ├── parser.h
│ └── main.c
└── Makefile
5.3 The Core Question You’re Answering
“What state am I in, and what bytes are legal next?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- HTTP/1.1 grammar
- Request line and header formats
- FSM design
- Explicit transitions and error states
- Incremental parsing
- Partial tokens across buffers
5.5 Questions to Guide Your Design
- How will you store partial tokens across chunks?
- What is the maximum header size you allow?
- How do you signal “need more data” to the caller?
5.6 Thinking Exercise
How would you extend this to parse Transfer-Encoding: chunked?
5.7 The Interview Questions They’ll Ask
- Why is incremental parsing required for network protocols?
- What is the state machine for parsing headers?
- How do you prevent request smuggling?
5.8 Hints in Layers
Hint 1: Start with request line only
- Parse
METHOD SP PATH SP VERSION.
Hint 2: Add headers
- Parse until CRLF CRLF.
Hint 3: Add body
- Respect
Content-Length.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Protocol parsing | “HTTP/1.1 RFC 7230” | ABNF sections |
| State machines | “Design Patterns” | State pattern |
5.10 Implementation Phases
Phase 1: Foundation (3-4 days)
Goals:
- Parse request line.
Tasks:
- Build FSM states for method/path/version.
- Validate request line format.
Checkpoint: Valid request line parsed correctly.
Phase 2: Core Functionality (4-5 days)
Goals:
- Parse headers and end-of-headers.
Tasks:
- Add header key/value parsing.
- Detect CRLF CRLF.
Checkpoint: Headers stored and parsed.
Phase 3: Polish & Edge Cases (2-3 days)
Goals:
- Body parsing and error handling.
Tasks:
- Read body based on
Content-Length. - Implement error states for invalid input.
Checkpoint: Parser returns DONE or ERROR deterministically.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Buffering | Copy vs zero-copy | Offsets in input | Efficiency |
| Limits | None vs strict | Strict limits | Security |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Valid requests | Correct parsing | GET with headers |
| Partial input | Streaming correctness | Split across chunks |
| Invalid input | Error state | Bad method token |
6.2 Critical Test Cases
- Request line split across two reads.
- Header without colon triggers error.
- Body parsed exactly by Content-Length.
6.3 Test Data
GET / HTTP/1.1\r\nHost: a\r\n\r\n
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Assuming full request | Missing data | Return NEED_MORE |
| Unbounded headers | Memory blowup | Enforce limits |
| CRLF handling | Parsing stuck | Explicit CRLF states |
7.2 Debugging Strategies
- Log state transitions with byte offsets.
- Add tests for partial inputs.
7.3 Performance Traps
Copying every token into new strings can be expensive; use slices where possible.
8. Extensions & Challenges
8.1 Beginner Extensions
- Parse query strings.
8.2 Intermediate Extensions
- Support chunked encoding.
8.3 Advanced Extensions
- Parse multiple pipelined requests.
9. Real-World Connections
9.1 Industry Applications
- Web servers and proxies rely on robust parsers.
- Security: Request smuggling prevention.
9.2 Related Open Source Projects
- http-parser: Classic C HTTP parser.
- nghttp2: HTTP/2 parsing library.
9.3 Interview Relevance
- Demonstrates FSM design and streaming parsing.
10. Resources
10.1 Essential Reading
- RFC 7230 for HTTP/1.1 grammar.
10.2 Video Resources
- Search: “HTTP parser FSM”.
10.3 Tools & Documentation
- curl: generate test requests.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why streaming parsing is required.
- I can draw the parser state diagram.
11.2 Implementation
- Parser handles partial input correctly.
- Invalid input goes to ERROR state.
11.3 Growth
- I can apply FSM parsing to other protocols.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Parse request line and headers.
Full Completion:
- Parse body by Content-Length with proper error handling.
Excellence (Going Above & Beyond):
- Support chunked encoding or pipelining.
This guide was generated from SPRINT_3_CONTROL_FLOW_STATE_PROJECTS.md. For the complete learning path, see the parent directory README.