Project 3: JSON Parser Library

Build a safe, RFC-8259-compatible JSON parser in C with explicit ownership and precise error reporting.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	~2 weeks
Main Programming Language	C (Alternatives: Rust, Zig)
Alternative Programming Languages	Rust, Zig
Coolness Level	Level 7 - parsing boundary mastery
Business Potential	Level 5 - embedded JSON parser
Prerequisites	Pointers, structs, recursion, error handling
Key Topics	Lexing/parsing, ownership, error models

1. Learning Objectives

By completing this project, you will:

Implement a JSON lexer and recursive-descent parser in C.
Design a JSON value tree with safe ownership and cleanup.
Build a precise error model with line/column reporting.
Enforce safety limits to prevent stack overflow and OOM.
Provide a stable, misuse-resistant parsing API.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Parsing & Grammar (Lexer -> Parser)

Fundamentals

Parsing is the process of turning raw text into structured data. JSON parsing typically uses a two-stage pipeline: a lexer (tokenizer) that converts characters into tokens like {, STRING, and NUMBER, and a parser that consumes tokens according to a grammar. A recursive-descent parser is straightforward for JSON because the grammar is simple and mostly unambiguous. The key boundary is that the parser must accept all valid JSON and reject invalid input with useful errors. In C, this requires careful character handling, explicit state management, and defensive checks for end-of-input.

Deep Dive into the Concept

JSON is defined by RFC 8259. The grammar is small but precise: an object is { members }, members are key/value pairs separated by commas, arrays are [ values ], strings allow escapes, numbers have strict format rules, and the only literals are true, false, and null. A correct parser must implement these rules, including edge cases like leading zeros, exponent notation, and escape sequences (\uXXXX). Many naive parsers accept invalid JSON such as trailing commas or unescaped control characters; your parser must decide whether to be strict (recommended) or permissive (risky). For a boundary-focused project, strict is safer: it forces callers to send valid data and surfaces errors early.

A lexer reads characters and emits tokens with associated data (e.g., string value, number text). It should track line and column positions for error reporting. Each time it consumes a character, it updates position: \n increments line and resets column; others increment column. The lexer must also detect invalid escapes and control characters inside strings. A common pitfall is forgetting that JSON strings may contain escaped Unicode sequences. You can either decode \uXXXX into UTF-8 or store the raw escape sequence and decode later; both are valid as long as the contract is clear. For simplicity, you can decode into UTF-8 in the lexer and store real bytes in the value tree.

A recursive-descent parser processes tokens with functions like parse_value, parse_object, and parse_array. Each function checks current token, consumes it, and returns a json_value*. Depth control is critical: JSON allows nesting, and a malicious input like [[[[... can cause stack overflow. Track recursion depth and enforce a maximum (e.g., 256). This is part of the security boundary.

Numbers require careful parsing. JSON does not allow NaN or Infinity. It allows integers and floating-point with optional exponent. You can parse numbers as double, but this can lose precision for large integers. An alternative is to store the number as a string and let the caller decide. A pragmatic API can store both: raw string plus a parsed double, with a flag indicating precision risk. The design choice must be documented because it affects how callers interpret the data.

Parsing must also handle input termination and trailing data. After parsing the root value, the parser should skip whitespace and then ensure there are no extra characters. If extra data exists, return an error like “trailing characters after root value.” This prevents accepting concatenated JSON messages inadvertently, which can be a security issue in some contexts.

How This Fits in This Project

The lexer/parser pipeline is the core of this project and drives your data structures (Sec. 3.5), error model (Sec. 3.7), and test strategy (Sec. 6.2). It also teaches boundary validation needed for Project 5’s HTTP parsing. Also used in: Project 5.

Definitions & Key Terms

Lexer/Tokenizer -> Converts characters to tokens.
Recursive descent -> Parser using recursive functions per grammar rule.
Token -> Classified unit like STRING or NUMBER.
Grammar -> Rules defining valid JSON structures.

Mental Model Diagram (ASCII)

text --> [lexer] --> tokens --> [parser] --> json_value tree
                 (line/col)                (owned)

How It Works (Step-by-Step)

Lexer reads characters and emits tokens.
Parser calls parse_value based on current token.
Objects/arrays recurse into nested values.
Parser checks for EOF and trailing data.
On error, parser returns NULL and sets json_error.

Minimal Concrete Example

json_value* parse_value(lexer *lx, json_error *err) {
    switch (lx->tok.type) {
        case TOK_LBRACE: return parse_object(lx, err);
        case TOK_LBRACKET: return parse_array(lx, err);
        case TOK_STRING: return make_string(lx->tok.str);
        case TOK_NUMBER: return make_number(lx->tok.num);
        default: return set_error(err, "expected value");
    }
}

Common Misconceptions

“Whitespace doesn’t matter, so ignore it completely.” -> You must track newlines for line/col.
“JSON numbers are just double.” -> Large integers can lose precision.
“Trailing commas are okay.” -> Not valid in strict JSON.

Check-Your-Understanding Questions

Why do you need a lexer for JSON parsing?
What is the risk of unbounded recursion?
Why should you reject trailing characters?

Check-Your-Understanding Answers

It simplifies parsing and improves error reporting.
Deep nesting can cause stack overflow.
It can hide concatenated or malicious input.

Real-World Applications

Configuration file parsers.
API response decoding.
Embedded device configuration tools.

Where You’ll Apply It

In this project: Sec. 5.10 Phase 2 (parser), Sec. 6.2 (invalid input tests).
Also used in: Project 5.

References

RFC 8259 (JSON specification)
“C Interfaces and Implementations” - Ch. 3

Key Insight

Parsing is boundary enforcement: every accepted character is a security decision.

Summary

A correct JSON parser requires a lexer, a recursive-descent parser, and strict adherence to the grammar. The boundary is defined by what you accept and what you reject.

Homework/Exercises to Practice the Concept

Write a lexer that tokenizes {, }, [, ], strings, numbers, and literals.
Add line/column tracking and verify errors point to the correct location.
Enforce a maximum nesting depth.

Solutions to the Homework/Exercises

Use a switch on the current character and emit tokens.
Update line/col counters on every consumed char.
Pass a depth counter into parse functions and reject when exceeded.

2.2 Memory Representation & Ownership of JSON Trees

Fundamentals

A JSON parser returns a tree of values: objects, arrays, strings, numbers, booleans, and null. In C, you must explicitly allocate and free this tree. Ownership rules determine who frees the tree and how strings are stored. A good API returns a single root pointer that the caller owns and must free with a dedicated json_free() function. Strings can be stored as owned copies or as slices into the input buffer. Owned copies are safer because they decouple the tree’s lifetime from the input string.

Deep Dive into the Concept

A JSON tree is a heterogeneous structure. A common representation is a tagged union:

typedef enum { JSON_NULL, JSON_BOOL, JSON_NUMBER, JSON_STRING, JSON_ARRAY, JSON_OBJECT } json_type;

typedef struct json_value {
    json_type type;
    union {
        double num;
        char *str;
        struct { struct json_value **items; size_t len; } array;
        struct { char **keys; struct json_value **values; size_t len; } object;
        int boolean;
    } u;
} json_value;

This structure requires careful ownership management. If json_parse returns a json_value*, the caller must be able to free the entire tree without knowing its internals. That means you must implement json_free recursively: free children, free arrays of pointers, free strings, then free the node. The contract should be explicit: the caller owns the tree and must call json_free exactly once. The parser owns nothing after returning.

String ownership is a key choice. If you store slices into the input buffer, you avoid allocations and speed parsing, but you force the caller to keep the original input alive. This is easy to misuse. For a boundary-safety project, storing owned copies is the better default. It also makes the tree portable and allows the caller to free the input immediately. The downside is more allocations; you can mitigate this by allocating from a region/arena and freeing the entire arena at once when json_free is called.

Arrays and objects require dynamic resizing. You can push items into a growable vector. This means you must handle allocation failures gracefully: if you cannot allocate the next element, you must free everything built so far and return a parse error. This is where ownership and error models intersect. A clean approach is to centralize allocations and to have a single cleanup function called on error.

If you want to provide accessors like json_get_string(root, "/user/name"), you must define ownership clearly. That function should return a borrowed pointer to the internal string; the caller must not free it and must not use it after json_free. Document this explicitly and include it in the header. This avoids extra allocations while still keeping the main parse API safe.

Finally, consider ABI stability. Exposing struct json_value in the header means any change to the struct layout breaks ABI. For a stable library, you should make json_value opaque and provide accessor functions. This design is more work but makes the API boundary strong. For this project, you can implement json_value as opaque in the public header and expose a small set of getters: json_type, json_get_string, json_get_number, json_array_len, json_array_get, json_object_get.

How This Fits in This Project

Ownership and tree representation are central to Sec. 3.5 (data structures), Sec. 3.7 (outcome), and Sec. 6.2 (memory tests). The same ownership discipline is used in Project 2’s client API. Also used in: Project 2.

Definitions & Key Terms

JSON tree -> In-memory representation of parsed JSON.
Tagged union -> Struct with a type tag and union for variant data.
Borrowed pointer -> Pointer to internal data; caller must not free.
Opaque type -> Struct hidden from header to stabilize ABI.

Mental Model Diagram (ASCII)

root (object)
|-- "user" -> object
|   |-- "name" -> string "Alice"
|   `-- "age" -> number 42
`-- "flags" -> array [true, false]

How It Works (Step-by-Step)

Parser allocates json_value nodes for each value.
Strings are copied into owned buffers.
Arrays/objects grow dynamically as items are parsed.
json_parse returns the root node; caller owns it.
json_free walks the tree and frees all allocations.

Minimal Concrete Example

json_value *root = json_parse(text, &err);
if (!root) { fprintf(stderr, "%s", err.message); }
// use root
json_free(root); // frees entire tree

Common Misconceptions

“Returning a pointer means caller owns everything.” -> Not if you return borrowed internals.
“Omitting json_free is okay for short programs.” -> It creates bad habits and leaks.
“Exposing struct fields is harmless.” -> It breaks ABI on change.

Check-Your-Understanding Questions

Why is an opaque json_value safer than exposing the struct?
What happens if you store string slices into the input buffer?
How do you handle allocation failure mid-parse?

Check-Your-Understanding Answers

It prevents ABI breakage and hides internals.
The tree becomes invalid when the input buffer is freed.
Free all previously allocated nodes and return an error.

Real-World Applications

Parsers in embedded systems that need strict ownership rules.
JSON libraries that provide accessors rather than exposed structs.

Where You’ll Apply It

In this project: Sec. 3.5 (data structures), Sec. 5.10 Phase 2 (parser output).
Also used in: Project 2.

References

“Effective C” - Ch. 6
“C Interfaces and Implementations” - Ch. 3

Key Insight

A JSON parser is only as safe as its ownership model.

Summary

A robust JSON library returns an owned tree, provides explicit json_free, and uses accessors for safety. This keeps the API stable and prevents leaks.

Homework/Exercises to Practice the Concept

Implement json_free for a tagged union tree.
Add an arena allocator and free it in one call.
Design accessor functions and make the struct opaque.

Solutions to the Homework/Exercises

Recursively free children, then free node.
Track allocations in a pool and free at the end.
Expose only functions; hide struct in .c file.

2.3 Error Reporting, Limits, and Defensive Parsing

Fundamentals

A parser must report errors precisely and fail safely. This includes line/column reporting, specific error codes, and strict limits on input size and nesting depth. Without limits, a small input can cause a stack overflow or memory exhaustion. A defensive parser validates all inputs and refuses to parse beyond its configured limits. This is part of the boundary contract: you guarantee safe behavior even with hostile input.

Deep Dive into the Concept

Error reporting in parsers is a UX feature. When parsing fails, you should provide the line and column where the error occurred, a short message, and optionally an error code. This makes debugging input issues easy. In C, you can define a json_error struct containing line, col, code, and message. The lexer updates line and column as it reads, and the parser copies those into the error struct on failure.

Limits are a security boundary. If you parse arbitrary input without depth limits, a malicious user can send deeply nested arrays to crash your program via stack overflow. Similarly, if you do not cap string length or total node count, a large JSON file can exhaust memory. You should define reasonable defaults (e.g., max depth 256, max nodes 1,000,000, max string length 1MB) and allow the caller to override them. These limits should be enforced in the parser: before recursing, check depth; before allocating, check node count; before copying a string, check length.

Another defensive concern is numeric parsing. If you parse numbers into double, you must detect overflow and special cases. Use strtod and check errno for ERANGE. If a number is too large, return a parse error with a clear message (“number out of range”). Similarly, you should reject invalid escapes in strings (e.g., incomplete \uXXXX). These validation steps are part of the contract: callers can rely on you to reject malformed input.

Error recovery is optional. For a library, it’s usually better to fail fast rather than attempt to recover. Provide a single error per parse call and require the caller to fix the input. This simplifies the API and makes error semantics deterministic. Your parser should also guarantee that on error, no partially built tree is returned. The safest pattern is: if error occurs, free all allocated nodes and return NULL.

Finally, consider determinism. For testing, you should ensure error messages are deterministic and do not depend on undefined behavior. Store error strings in static buffers or allocate them within json_error. Do not include raw pointers or addresses. Deterministic errors make it easier to test and compare outputs.

How This Fits in This Project

This concept defines your json_error struct and parse limits in Sec. 3.5 and Sec. 3.7, and drives your test cases in Sec. 6.2. These defensive practices are also necessary for Project 5’s HTTP parser. Also used in: Project 5.

Definitions & Key Terms

Parse error -> Failure to match JSON grammar.
Line/column -> Exact location of error in input.
Depth limit -> Maximum allowed nesting.
Fail-closed -> On error, return NULL and free partial state.

Mental Model Diagram (ASCII)

input -> lexer (line/col) -> parser (depth,count) -> ok tree
                                     |
                                     v
                                  error struct

How It Works (Step-by-Step)

Initialize json_error with zeros.
Lexer tracks line/col; parser tracks depth and node count.
On error, set err->line, err->col, err->message.
Free partial tree and return NULL.
Caller prints err and exits.

Minimal Concrete Example

typedef struct {
    int line;
    int col;
    int code;
    char message[128];
} json_error;

Common Misconceptions

“Depth limits are unnecessary for small inputs.” -> Attackers can craft tiny but deep inputs.
“Partial trees are useful on error.” -> They complicate ownership and recovery.
“Line/col tracking is optional.” -> It’s essential for usability.

Check-Your-Understanding Questions

Why is a depth limit important?
What should happen to partially parsed data on error?
How should you report numeric overflow?

Check-Your-Understanding Answers

It prevents stack overflow and denial-of-service.
It should be freed; return NULL.
Set an error with ERANGE and report “number out of range.”

Real-World Applications

JSON parsers in browsers and APIs.
Configuration file parsers with strict validation.

Where You’ll Apply It

In this project: Sec. 3.7 (error model), Sec. 6.2 (invalid tests).
Also used in: Project 5.

References

RFC 8259
“Secure Coding in C and C++” - input validation chapters

Key Insight

A parser boundary is a security boundary; strict validation is non-negotiable.

Summary

Precise errors and strict limits make your parser safe, predictable, and testable. They also protect callers from hostile input.

Homework/Exercises to Practice the Concept

Add a depth limit and test with deeply nested input.
Add string length limit and test with oversized strings.
Implement a deterministic error message format.

Solutions to the Homework/Exercises

Track depth and return error when exceeded.
Reject strings longer than configured max.
Use fixed templates like “line X col Y: ".

3. Project Specification

3.1 What You Will Build

A JSON parser library libjsonlite that parses text into an opaque JSON value tree and provides accessors and error reporting.

3.2 Functional Requirements

Parse: json_parse(const char*, json_error*) returns root or NULL.
Free: json_free(json_value*) releases all memory.
Accessors: Functions to access object keys and array items.
Errors: Line/column + message on failure.
Limits: Configurable depth and size limits.

3.3 Non-Functional Requirements

Performance: Parse 1MB JSON under 100 ms on typical machine.
Reliability: No leaks or crashes on invalid input.
Usability: Clear, deterministic error messages.

3.4 Example Usage / Output

OK: parsed 12 nodes
root.type = object
root["user"].name = "Alice"

3.5 Data Formats / Schemas / Protocols

JSON value types:

NULL | BOOL | NUMBER | STRING | ARRAY | OBJECT

3.6 Edge Cases

Deeply nested arrays.
Invalid escape sequences.
Numbers out of range.
Trailing characters after root.

3.7 Real World Outcome

You can parse JSON files, inspect fields, and get clear error messages with exact locations.

3.7.1 How to Run (Copy/Paste)

make
./json-parse ./sample.json

3.7.2 Golden Path Demo (Deterministic)

Use a fixed sample.json file:

{"user":{"name":"Alice","age":42},"flags":[true,false]}

3.7.3 If CLI: Exact Terminal Transcript

$ ./json-parse ./sample.json
OK: parsed 7 nodes
root.type = object
root["user"].name = "Alice"
$ echo $?
0

Failure demo (invalid JSON):

$ ./json-parse ./broken.json
ERROR line 1 col 28: expected '}' but found ']'
$ echo $?
2

Exit Codes:

0 success
2 parse error
3 file read error

4. Solution Architecture

4.1 High-Level Design

input text -> lexer -> tokens -> parser -> json_value tree
                               |
                               v
                            json_error

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct json_value json_value; // opaque

4.4 Algorithm Overview

Key Algorithm: parse_value

Switch on current token type.
For objects/arrays, recurse with depth+1.
Construct node, push into parent.
On error, free partial tree and return NULL.

Complexity Analysis:

Time: O(n) where n is input length.
Space: O(m) where m is number of nodes.

5. Implementation Guide

5.1 Development Environment Setup

cc --version
make --version

5.2 Project Structure

libjsonlite/
|-- include/
|   `-- jsonlite.h
|-- src/
|   |-- lexer.c
|   |-- parser.c
|   |-- value.c
|   `-- error.c
|-- tools/
|   `-- json-parse.c
`-- Makefile

5.3 The Core Question You’re Answering

“How do you design a parser interface that is safe for hostile input?”

5.4 Concepts You Must Understand First

JSON grammar and recursive parsing.
Ownership and memory management in trees.
Error models with line/col and strict limits.

5.5 Questions to Guide Your Design

Will you store numbers as double or as strings?
Will you copy strings or store slices?
What default limits will you enforce?

5.6 Thinking Exercise

Write three invalid JSON strings that should trigger different errors. Predict line/col for each.

5.7 The Interview Questions They’ll Ask

How do you prevent stack overflow in recursive parsers?
Why use opaque types for json_value?
How do you report parse errors precisely?

5.8 Hints in Layers

Hint 1: Start with a lexer

typedef enum { TOK_LBRACE, TOK_STRING, TOK_NUMBER, TOK_TRUE, TOK_FALSE, TOK_NULL } tok_t;

Hint 2: Track depth

if (depth > MAX_DEPTH) return set_error(err, "depth limit");

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Lexer (3-4 days)

Goals: tokens with line/col. Tasks: implement string/number parsing and errors. Checkpoint: lexer passes token tests.

Phase 2: Parser (4-5 days)

Goals: build tree from tokens. Tasks: parse objects/arrays, enforce depth limit. Checkpoint: parse valid JSON into tree.

Phase 3: Hardening (2-3 days)

Goals: errors and limits. Tasks: add node count and size limits, tests. Checkpoint: invalid inputs yield precise errors.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Deep nesting: exceeds max depth -> error.
Invalid escape: "\\u12" -> error.
Trailing data: "{} extra" -> error.

6.3 Test Data

{"a":1,"b":[true,false]}
{"a":[1,2,]

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Print tokens as they are generated.
Add assertions for depth and node counts.

7.3 Performance Traps

Excessive malloc per token; consider arena allocation.

8. Extensions & Challenges

8.1 Beginner Extensions

Add pretty-printer to output formatted JSON.
Add json_get_path helper for /a/b/c paths.

8.2 Intermediate Extensions

Add streaming parser for large files.
Add incremental parsing API.

8.3 Advanced Extensions

Add JSON schema validation.
Add SIMD-accelerated lexer.

9. Real-World Connections

9.1 Industry Applications

Config parsers in embedded systems.
JSON parsing in API clients.

cJSON - small JSON parser library.
jsmn - minimalist JSON parser.

9.3 Interview Relevance

Parsing algorithms and error handling.
Memory management for tree structures.

10. Resources

10.1 Essential Reading

RFC 8259
“C Interfaces and Implementations” - Ch. 3

10.2 Video Resources

“Parsing Basics” - university lecture (searchable title)

10.3 Tools & Documentation

man strtod

11. Self-Assessment Checklist

11.1 Understanding

I can describe JSON grammar rules.
I can explain ownership of the JSON tree.
I can explain how line/col errors are tracked.

11.2 Implementation

Parser handles valid JSON.
Invalid JSON produces deterministic errors.
No memory leaks in json_free.

11.3 Growth

I can explain parser design in an interview.
I documented limits and error codes.

12. Submission / Completion Criteria

Minimum Viable Completion:

json_parse and json_free implemented.
Error struct with line/col.
Handles invalid input safely.

Full Completion:

Accessor API for arrays/objects.
Limits for depth and size.
CLI tool json-parse.

Excellence (Going Above & Beyond):

Streaming parser.
Schema validation.