Project 3: JSON Parser Library
Build a safe, RFC-8259-compatible JSON parser in C with explicit ownership and precise error reporting.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | ~2 weeks |
| Main Programming Language | C (Alternatives: Rust, Zig) |
| Alternative Programming Languages | Rust, Zig |
| Coolness Level | Level 7 - parsing boundary mastery |
| Business Potential | Level 5 - embedded JSON parser |
| Prerequisites | Pointers, structs, recursion, error handling |
| Key Topics | Lexing/parsing, ownership, error models |
1. Learning Objectives
By completing this project, you will:
- Implement a JSON lexer and recursive-descent parser in C.
- Design a JSON value tree with safe ownership and cleanup.
- Build a precise error model with line/column reporting.
- Enforce safety limits to prevent stack overflow and OOM.
- Provide a stable, misuse-resistant parsing API.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Parsing & Grammar (Lexer -> Parser)
Fundamentals
Parsing is the process of turning raw text into structured data. JSON parsing typically uses a two-stage pipeline: a lexer (tokenizer) that converts characters into tokens like {, STRING, and NUMBER, and a parser that consumes tokens according to a grammar. A recursive-descent parser is straightforward for JSON because the grammar is simple and mostly unambiguous. The key boundary is that the parser must accept all valid JSON and reject invalid input with useful errors. In C, this requires careful character handling, explicit state management, and defensive checks for end-of-input.
Deep Dive into the Concept
JSON is defined by RFC 8259. The grammar is small but precise: an object is { members }, members are key/value pairs separated by commas, arrays are [ values ], strings allow escapes, numbers have strict format rules, and the only literals are true, false, and null. A correct parser must implement these rules, including edge cases like leading zeros, exponent notation, and escape sequences (\uXXXX). Many naive parsers accept invalid JSON such as trailing commas or unescaped control characters; your parser must decide whether to be strict (recommended) or permissive (risky). For a boundary-focused project, strict is safer: it forces callers to send valid data and surfaces errors early.
A lexer reads characters and emits tokens with associated data (e.g., string value, number text). It should track line and column positions for error reporting. Each time it consumes a character, it updates position: \n increments line and resets column; others increment column. The lexer must also detect invalid escapes and control characters inside strings. A common pitfall is forgetting that JSON strings may contain escaped Unicode sequences. You can either decode \uXXXX into UTF-8 or store the raw escape sequence and decode later; both are valid as long as the contract is clear. For simplicity, you can decode into UTF-8 in the lexer and store real bytes in the value tree.
A recursive-descent parser processes tokens with functions like parse_value, parse_object, and parse_array. Each function checks current token, consumes it, and returns a json_value*. Depth control is critical: JSON allows nesting, and a malicious input like [[[[... can cause stack overflow. Track recursion depth and enforce a maximum (e.g., 256). This is part of the security boundary.
Numbers require careful parsing. JSON does not allow NaN or Infinity. It allows integers and floating-point with optional exponent. You can parse numbers as double, but this can lose precision for large integers. An alternative is to store the number as a string and let the caller decide. A pragmatic API can store both: raw string plus a parsed double, with a flag indicating precision risk. The design choice must be documented because it affects how callers interpret the data.
Parsing must also handle input termination and trailing data. After parsing the root value, the parser should skip whitespace and then ensure there are no extra characters. If extra data exists, return an error like “trailing characters after root value.” This prevents accepting concatenated JSON messages inadvertently, which can be a security issue in some contexts.
How This Fits in This Project
The lexer/parser pipeline is the core of this project and drives your data structures (Sec. 3.5), error model (Sec. 3.7), and test strategy (Sec. 6.2). It also teaches boundary validation needed for Project 5’s HTTP parsing. Also used in: Project 5.
Definitions & Key Terms
- Lexer/Tokenizer -> Converts characters to tokens.
- Recursive descent -> Parser using recursive functions per grammar rule.
- Token -> Classified unit like STRING or NUMBER.
- Grammar -> Rules defining valid JSON structures.
Mental Model Diagram (ASCII)
text --> [lexer] --> tokens --> [parser] --> json_value tree
(line/col) (owned)
How It Works (Step-by-Step)
- Lexer reads characters and emits tokens.
- Parser calls
parse_valuebased on current token. - Objects/arrays recurse into nested values.
- Parser checks for EOF and trailing data.
- On error, parser returns NULL and sets
json_error.
Minimal Concrete Example
json_value* parse_value(lexer *lx, json_error *err) {
switch (lx->tok.type) {
case TOK_LBRACE: return parse_object(lx, err);
case TOK_LBRACKET: return parse_array(lx, err);
case TOK_STRING: return make_string(lx->tok.str);
case TOK_NUMBER: return make_number(lx->tok.num);
default: return set_error(err, "expected value");
}
}
Common Misconceptions
- “Whitespace doesn’t matter, so ignore it completely.” -> You must track newlines for line/col.
- “JSON numbers are just
double.” -> Large integers can lose precision. - “Trailing commas are okay.” -> Not valid in strict JSON.
Check-Your-Understanding Questions
- Why do you need a lexer for JSON parsing?
- What is the risk of unbounded recursion?
- Why should you reject trailing characters?
Check-Your-Understanding Answers
- It simplifies parsing and improves error reporting.
- Deep nesting can cause stack overflow.
- It can hide concatenated or malicious input.
Real-World Applications
- Configuration file parsers.
- API response decoding.
- Embedded device configuration tools.
Where You’ll Apply It
- In this project: Sec. 5.10 Phase 2 (parser), Sec. 6.2 (invalid input tests).
- Also used in: Project 5.
References
- RFC 8259 (JSON specification)
- “C Interfaces and Implementations” - Ch. 3
Key Insight
Parsing is boundary enforcement: every accepted character is a security decision.
Summary
A correct JSON parser requires a lexer, a recursive-descent parser, and strict adherence to the grammar. The boundary is defined by what you accept and what you reject.
Homework/Exercises to Practice the Concept
- Write a lexer that tokenizes
{,},[,], strings, numbers, and literals. - Add line/column tracking and verify errors point to the correct location.
- Enforce a maximum nesting depth.
Solutions to the Homework/Exercises
- Use a
switchon the current character and emit tokens. - Update line/col counters on every consumed char.
- Pass a depth counter into parse functions and reject when exceeded.
2.2 Memory Representation & Ownership of JSON Trees
Fundamentals
A JSON parser returns a tree of values: objects, arrays, strings, numbers, booleans, and null. In C, you must explicitly allocate and free this tree. Ownership rules determine who frees the tree and how strings are stored. A good API returns a single root pointer that the caller owns and must free with a dedicated json_free() function. Strings can be stored as owned copies or as slices into the input buffer. Owned copies are safer because they decouple the tree’s lifetime from the input string.
Deep Dive into the Concept
A JSON tree is a heterogeneous structure. A common representation is a tagged union:
typedef enum { JSON_NULL, JSON_BOOL, JSON_NUMBER, JSON_STRING, JSON_ARRAY, JSON_OBJECT } json_type;
typedef struct json_value {
json_type type;
union {
double num;
char *str;
struct { struct json_value **items; size_t len; } array;
struct { char **keys; struct json_value **values; size_t len; } object;
int boolean;
} u;
} json_value;
This structure requires careful ownership management. If json_parse returns a json_value*, the caller must be able to free the entire tree without knowing its internals. That means you must implement json_free recursively: free children, free arrays of pointers, free strings, then free the node. The contract should be explicit: the caller owns the tree and must call json_free exactly once. The parser owns nothing after returning.
String ownership is a key choice. If you store slices into the input buffer, you avoid allocations and speed parsing, but you force the caller to keep the original input alive. This is easy to misuse. For a boundary-safety project, storing owned copies is the better default. It also makes the tree portable and allows the caller to free the input immediately. The downside is more allocations; you can mitigate this by allocating from a region/arena and freeing the entire arena at once when json_free is called.
Arrays and objects require dynamic resizing. You can push items into a growable vector. This means you must handle allocation failures gracefully: if you cannot allocate the next element, you must free everything built so far and return a parse error. This is where ownership and error models intersect. A clean approach is to centralize allocations and to have a single cleanup function called on error.
If you want to provide accessors like json_get_string(root, "/user/name"), you must define ownership clearly. That function should return a borrowed pointer to the internal string; the caller must not free it and must not use it after json_free. Document this explicitly and include it in the header. This avoids extra allocations while still keeping the main parse API safe.
Finally, consider ABI stability. Exposing struct json_value in the header means any change to the struct layout breaks ABI. For a stable library, you should make json_value opaque and provide accessor functions. This design is more work but makes the API boundary strong. For this project, you can implement json_value as opaque in the public header and expose a small set of getters: json_type, json_get_string, json_get_number, json_array_len, json_array_get, json_object_get.
How This Fits in This Project
Ownership and tree representation are central to Sec. 3.5 (data structures), Sec. 3.7 (outcome), and Sec. 6.2 (memory tests). The same ownership discipline is used in Project 2’s client API. Also used in: Project 2.
Definitions & Key Terms
- JSON tree -> In-memory representation of parsed JSON.
- Tagged union -> Struct with a type tag and union for variant data.
- Borrowed pointer -> Pointer to internal data; caller must not free.
- Opaque type -> Struct hidden from header to stabilize ABI.
Mental Model Diagram (ASCII)
root (object)
|-- "user" -> object
| |-- "name" -> string "Alice"
| `-- "age" -> number 42
`-- "flags" -> array [true, false]
How It Works (Step-by-Step)
- Parser allocates
json_valuenodes for each value. - Strings are copied into owned buffers.
- Arrays/objects grow dynamically as items are parsed.
json_parsereturns the root node; caller owns it.json_freewalks the tree and frees all allocations.
Minimal Concrete Example
json_value *root = json_parse(text, &err);
if (!root) { fprintf(stderr, "%s", err.message); }
// use root
json_free(root); // frees entire tree
Common Misconceptions
- “Returning a pointer means caller owns everything.” -> Not if you return borrowed internals.
- “Omitting
json_freeis okay for short programs.” -> It creates bad habits and leaks. - “Exposing struct fields is harmless.” -> It breaks ABI on change.
Check-Your-Understanding Questions
- Why is an opaque
json_valuesafer than exposing the struct? - What happens if you store string slices into the input buffer?
- How do you handle allocation failure mid-parse?
Check-Your-Understanding Answers
- It prevents ABI breakage and hides internals.
- The tree becomes invalid when the input buffer is freed.
- Free all previously allocated nodes and return an error.
Real-World Applications
- Parsers in embedded systems that need strict ownership rules.
- JSON libraries that provide accessors rather than exposed structs.
Where You’ll Apply It
- In this project: Sec. 3.5 (data structures), Sec. 5.10 Phase 2 (parser output).
- Also used in: Project 2.
References
- “Effective C” - Ch. 6
- “C Interfaces and Implementations” - Ch. 3
Key Insight
A JSON parser is only as safe as its ownership model.
Summary
A robust JSON library returns an owned tree, provides explicit json_free, and uses accessors for safety. This keeps the API stable and prevents leaks.
Homework/Exercises to Practice the Concept
- Implement
json_freefor a tagged union tree. - Add an arena allocator and free it in one call.
- Design accessor functions and make the struct opaque.
Solutions to the Homework/Exercises
- Recursively free children, then free node.
- Track allocations in a pool and free at the end.
- Expose only functions; hide struct in
.cfile.
2.3 Error Reporting, Limits, and Defensive Parsing
Fundamentals
A parser must report errors precisely and fail safely. This includes line/column reporting, specific error codes, and strict limits on input size and nesting depth. Without limits, a small input can cause a stack overflow or memory exhaustion. A defensive parser validates all inputs and refuses to parse beyond its configured limits. This is part of the boundary contract: you guarantee safe behavior even with hostile input.
Deep Dive into the Concept
Error reporting in parsers is a UX feature. When parsing fails, you should provide the line and column where the error occurred, a short message, and optionally an error code. This makes debugging input issues easy. In C, you can define a json_error struct containing line, col, code, and message. The lexer updates line and column as it reads, and the parser copies those into the error struct on failure.
Limits are a security boundary. If you parse arbitrary input without depth limits, a malicious user can send deeply nested arrays to crash your program via stack overflow. Similarly, if you do not cap string length or total node count, a large JSON file can exhaust memory. You should define reasonable defaults (e.g., max depth 256, max nodes 1,000,000, max string length 1MB) and allow the caller to override them. These limits should be enforced in the parser: before recursing, check depth; before allocating, check node count; before copying a string, check length.
Another defensive concern is numeric parsing. If you parse numbers into double, you must detect overflow and special cases. Use strtod and check errno for ERANGE. If a number is too large, return a parse error with a clear message (“number out of range”). Similarly, you should reject invalid escapes in strings (e.g., incomplete \uXXXX). These validation steps are part of the contract: callers can rely on you to reject malformed input.
Error recovery is optional. For a library, it’s usually better to fail fast rather than attempt to recover. Provide a single error per parse call and require the caller to fix the input. This simplifies the API and makes error semantics deterministic. Your parser should also guarantee that on error, no partially built tree is returned. The safest pattern is: if error occurs, free all allocated nodes and return NULL.
Finally, consider determinism. For testing, you should ensure error messages are deterministic and do not depend on undefined behavior. Store error strings in static buffers or allocate them within json_error. Do not include raw pointers or addresses. Deterministic errors make it easier to test and compare outputs.
How This Fits in This Project
This concept defines your json_error struct and parse limits in Sec. 3.5 and Sec. 3.7, and drives your test cases in Sec. 6.2. These defensive practices are also necessary for Project 5’s HTTP parser. Also used in: Project 5.
Definitions & Key Terms
- Parse error -> Failure to match JSON grammar.
- Line/column -> Exact location of error in input.
- Depth limit -> Maximum allowed nesting.
- Fail-closed -> On error, return NULL and free partial state.
Mental Model Diagram (ASCII)
input -> lexer (line/col) -> parser (depth,count) -> ok tree
|
v
error struct
How It Works (Step-by-Step)
- Initialize
json_errorwith zeros. - Lexer tracks line/col; parser tracks depth and node count.
- On error, set
err->line,err->col,err->message. - Free partial tree and return NULL.
- Caller prints
errand exits.
Minimal Concrete Example
typedef struct {
int line;
int col;
int code;
char message[128];
} json_error;
Common Misconceptions
- “Depth limits are unnecessary for small inputs.” -> Attackers can craft tiny but deep inputs.
- “Partial trees are useful on error.” -> They complicate ownership and recovery.
- “Line/col tracking is optional.” -> It’s essential for usability.
Check-Your-Understanding Questions
- Why is a depth limit important?
- What should happen to partially parsed data on error?
- How should you report numeric overflow?
Check-Your-Understanding Answers
- It prevents stack overflow and denial-of-service.
- It should be freed; return NULL.
- Set an error with
ERANGEand report “number out of range.”
Real-World Applications
- JSON parsers in browsers and APIs.
- Configuration file parsers with strict validation.
Where You’ll Apply It
- In this project: Sec. 3.7 (error model), Sec. 6.2 (invalid tests).
- Also used in: Project 5.
References
- RFC 8259
- “Secure Coding in C and C++” - input validation chapters
Key Insight
A parser boundary is a security boundary; strict validation is non-negotiable.
Summary
Precise errors and strict limits make your parser safe, predictable, and testable. They also protect callers from hostile input.
Homework/Exercises to Practice the Concept
- Add a depth limit and test with deeply nested input.
- Add string length limit and test with oversized strings.
- Implement a deterministic error message format.
Solutions to the Homework/Exercises
- Track depth and return error when exceeded.
- Reject strings longer than configured max.
- Use fixed templates like “line X col Y:
".
3. Project Specification
3.1 What You Will Build
A JSON parser library libjsonlite that parses text into an opaque JSON value tree and provides accessors and error reporting.
3.2 Functional Requirements
- Parse:
json_parse(const char*, json_error*)returns root or NULL. - Free:
json_free(json_value*)releases all memory. - Accessors: Functions to access object keys and array items.
- Errors: Line/column + message on failure.
- Limits: Configurable depth and size limits.
3.3 Non-Functional Requirements
- Performance: Parse 1MB JSON under 100 ms on typical machine.
- Reliability: No leaks or crashes on invalid input.
- Usability: Clear, deterministic error messages.
3.4 Example Usage / Output
OK: parsed 12 nodes
root.type = object
root["user"].name = "Alice"
3.5 Data Formats / Schemas / Protocols
JSON value types:
NULL | BOOL | NUMBER | STRING | ARRAY | OBJECT
3.6 Edge Cases
- Deeply nested arrays.
- Invalid escape sequences.
- Numbers out of range.
- Trailing characters after root.
3.7 Real World Outcome
You can parse JSON files, inspect fields, and get clear error messages with exact locations.
3.7.1 How to Run (Copy/Paste)
make
./json-parse ./sample.json
3.7.2 Golden Path Demo (Deterministic)
Use a fixed sample.json file:
{"user":{"name":"Alice","age":42},"flags":[true,false]}
3.7.3 If CLI: Exact Terminal Transcript
$ ./json-parse ./sample.json
OK: parsed 7 nodes
root.type = object
root["user"].name = "Alice"
$ echo $?
0
Failure demo (invalid JSON):
$ ./json-parse ./broken.json
ERROR line 1 col 28: expected '}' but found ']'
$ echo $?
2
Exit Codes:
0success2parse error3file read error
4. Solution Architecture
4.1 High-Level Design
input text -> lexer -> tokens -> parser -> json_value tree
|
v
json_error
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Lexer | Tokenize input | Track line/col and escapes | | Parser | Build tree | Recursive descent with depth limit | | JSON tree | Store values | Owned strings, opaque type |
4.3 Data Structures (No Full Code)
typedef struct json_value json_value; // opaque
4.4 Algorithm Overview
Key Algorithm: parse_value
- Switch on current token type.
- For objects/arrays, recurse with depth+1.
- Construct node, push into parent.
- On error, free partial tree and return NULL.
Complexity Analysis:
- Time: O(n) where n is input length.
- Space: O(m) where m is number of nodes.
5. Implementation Guide
5.1 Development Environment Setup
cc --version
make --version
5.2 Project Structure
libjsonlite/
|-- include/
| `-- jsonlite.h
|-- src/
| |-- lexer.c
| |-- parser.c
| |-- value.c
| `-- error.c
|-- tools/
| `-- json-parse.c
`-- Makefile
5.3 The Core Question You’re Answering
“How do you design a parser interface that is safe for hostile input?”
5.4 Concepts You Must Understand First
- JSON grammar and recursive parsing.
- Ownership and memory management in trees.
- Error models with line/col and strict limits.
5.5 Questions to Guide Your Design
- Will you store numbers as double or as strings?
- Will you copy strings or store slices?
- What default limits will you enforce?
5.6 Thinking Exercise
Write three invalid JSON strings that should trigger different errors. Predict line/col for each.
5.7 The Interview Questions They’ll Ask
- How do you prevent stack overflow in recursive parsers?
- Why use opaque types for
json_value? - How do you report parse errors precisely?
5.8 Hints in Layers
Hint 1: Start with a lexer
typedef enum { TOK_LBRACE, TOK_STRING, TOK_NUMBER, TOK_TRUE, TOK_FALSE, TOK_NULL } tok_t;
Hint 2: Track depth
if (depth > MAX_DEPTH) return set_error(err, "depth limit");
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Parsing | “C Interfaces and Implementations” | Ch. 3 | | Memory | “Effective C” | Ch. 6 | | Defensive coding | “Code Complete” | Ch. 8 |
5.10 Implementation Phases
Phase 1: Lexer (3-4 days)
Goals: tokens with line/col. Tasks: implement string/number parsing and errors. Checkpoint: lexer passes token tests.
Phase 2: Parser (4-5 days)
Goals: build tree from tokens. Tasks: parse objects/arrays, enforce depth limit. Checkpoint: parse valid JSON into tree.
Phase 3: Hardening (2-3 days)
Goals: errors and limits. Tasks: add node count and size limits, tests. Checkpoint: invalid inputs yield precise errors.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|———-|———|—————-|———–|
| Number representation | double vs string | string + optional double | Preserve precision |
| String storage | borrowed vs owned | owned | Safe lifetime |
| API exposure | struct vs opaque | opaque | ABI stability |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Lexer tests | Token correctness | escapes, numbers | | Parser tests | Valid JSON | arrays, objects | | Error tests | Invalid inputs | missing brace |
6.2 Critical Test Cases
- Deep nesting: exceeds max depth -> error.
- Invalid escape:
"\\u12"-> error. - Trailing data:
"{} extra"-> error.
6.3 Test Data
{"a":1,"b":[true,false]}
{"a":[1,2,]
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |——–|———|———-| | Not tracking line/col | Unhelpful errors | Update counters in lexer | | Returning partial trees | Leaks or invalid state | Free on error | | Accepting invalid JSON | Hidden bugs | Follow RFC strictly |
7.2 Debugging Strategies
- Print tokens as they are generated.
- Add assertions for depth and node counts.
7.3 Performance Traps
- Excessive
mallocper token; consider arena allocation.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add pretty-printer to output formatted JSON.
- Add
json_get_pathhelper for/a/b/cpaths.
8.2 Intermediate Extensions
- Add streaming parser for large files.
- Add incremental parsing API.
8.3 Advanced Extensions
- Add JSON schema validation.
- Add SIMD-accelerated lexer.
9. Real-World Connections
9.1 Industry Applications
- Config parsers in embedded systems.
- JSON parsing in API clients.
9.2 Related Open Source Projects
- cJSON - small JSON parser library.
- jsmn - minimalist JSON parser.
9.3 Interview Relevance
- Parsing algorithms and error handling.
- Memory management for tree structures.
10. Resources
10.1 Essential Reading
- RFC 8259
- “C Interfaces and Implementations” - Ch. 3
10.2 Video Resources
- “Parsing Basics” - university lecture (searchable title)
10.3 Tools & Documentation
man strtod
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can describe JSON grammar rules.
- I can explain ownership of the JSON tree.
- I can explain how line/col errors are tracked.
11.2 Implementation
- Parser handles valid JSON.
- Invalid JSON produces deterministic errors.
- No memory leaks in
json_free.
11.3 Growth
- I can explain parser design in an interview.
- I documented limits and error codes.
12. Submission / Completion Criteria
Minimum Viable Completion:
json_parseandjson_freeimplemented.- Error struct with line/col.
- Handles invalid input safely.
Full Completion:
- Accessor API for arrays/objects.
- Limits for depth and size.
- CLI tool
json-parse.
Excellence (Going Above & Beyond):
- Streaming parser.
- Schema validation.