Project 6: JSON Parser Library

Build a complete JSON parser with DOM representation, std::variant for type safety, and pretty-printing to master parsing and data structures.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 2-3 weeks
Language C++
Prerequisites std::variant, recursive data structures, string manipulation
Key Topics Lexical analysis, recursive descent parsing, std::variant, Unicode handling

1. Learning Objectives

By completing this project, you will:

  • Master tokenization: Convert raw text into a stream of typed tokens
  • Implement recursive descent parsing: Build hierarchical structures from flat token streams
  • Use std::variant effectively: Model algebraic data types for JSON values
  • Handle Unicode escapes: Parse \uXXXX sequences correctly
  • Design clean APIs: Create an intuitive interface for JSON manipulation
  • Implement serialization: Convert DOM back to formatted JSON text
  • Write robust error handling: Provide clear error messages with line/column info

2. Theoretical Foundation

2.1 Core Concepts

JSON Grammar (simplified):

value   := object | array | string | number | "true" | "false" | "null"
object  := '{' (pair (',' pair)*)? '}'
pair    := string ':' value
array   := '[' (value (',' value)*)? ']'
string  := '"' characters '"'
number  := integer fraction? exponent?

The Parser Pipeline:

                    ┌─────────────────────────────────────────────────────┐
                    │                  Parser Pipeline                     │
                    │                                                      │
  Raw JSON String   │   Tokenizer        Parser           DOM             │
  ──────────────────┼──────────────────────────────────────────────────────│
                    │                                                      │
  {"name":"Alice"}  │  ┌──────────┐    ┌──────────┐    ┌──────────────┐   │
  ──────────────────┼─►│ Tokenize │───►│  Parse   │───►│  JSON Tree   │   │
                    │  └──────────┘    └──────────┘    └──────────────┘   │
                    │       │               │                │            │
                    │       ▼               ▼                ▼            │
                    │  [{, "name",     Recursive        Object {         │
                    │   :, "Alice",    descent           "name":         │
                    │   }]             builds tree        String("Alice") │
                    │                                    }                │
                    └─────────────────────────────────────────────────────┘

Representing JSON Values with std::variant:

// JSON can be one of these types:
using JsonValue = std::variant<
    std::nullptr_t,                    // null
    bool,                              // true, false
    double,                            // numbers (JSON has no int type)
    std::string,                       // strings
    std::vector<Json>,                 // arrays
    std::map<std::string, Json>        // objects
>;

// The recursive structure:
┌─────────────────────────────────────────────────────┐
                    Json class                        
                                                      
  JsonValue value_ ◄─┬─ nullptr_t (null)             
                     ├─ bool (true/false)            
                     ├─ double (number)              
                     ├─ string (string)              
                     ├─ vector<Json> (array)         
                     └─ map<string, Json> (object)   
                                                     
                            └──► Contains more Json   
                                 objects (recursive)  
└─────────────────────────────────────────────────────┘

Tokenizer State Machine:

                         ┌─────────────────┐
                         │     START       │
                         └────────┬────────┘
                                  │
           ┌──────────────────────┼──────────────────────┐
           │          │           │          │           │
           ▼          ▼           ▼          ▼           ▼
      ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
      │ STRING │ │ NUMBER │ │KEYWORD │ │ PUNCT  │ │  EOF   │
      │   "    │ │ 0-9,-  │ │t,f,n   │ │{,},[,],│ │        │
      └────────┘ └────────┘ └────────┘ └────────┘ └────────┘
           │          │           │          │
           ▼          ▼           ▼          ▼
    Read until  Parse with  Match full  Single char
    closing "   sign, frac, keyword     token
    Handle \    exponent

2.2 Why This Matters

JSON is the lingua franca of data exchange:

  • APIs: REST, GraphQL responses
  • Configuration: package.json, tsconfig.json
  • Data storage: MongoDB, CouchDB documents
  • Logging: Structured log formats

Building a JSON parser teaches:

  • How parsers in general work (useful for DSLs, config files)
  • How to handle recursive/nested data structures
  • How type systems model heterogeneous data
  • How to design library APIs

2.3 Historical Context

JSON (JavaScript Object Notation) was formalized by Douglas Crockford in 2001, derived from JavaScript object literal syntax. Its simplicity (only 6 data types, minimal syntax) made it quickly eclipse XML for web APIs.

Key milestones:

  • 2001: Douglas Crockford formalizes JSON
  • 2006: RFC 4627 published
  • 2013: ECMA-404 standard
  • 2017: RFC 8259 (current standard)

Popular C++ JSON libraries:

  • nlohmann/json: Header-only, intuitive API, widely used
  • RapidJSON: Performance-focused, SAX and DOM
  • simdjson: SIMD-accelerated, fastest parser available

2.4 Common Misconceptions

Misconception 1: “JSON numbers are integers or floats” Reality: JSON has only one number type. The spec says “number” with no distinction. Libraries typically use double which handles both.

Misconception 2: “JSON strings are just ASCII” Reality: JSON strings are Unicode (typically UTF-8). You must handle \uXXXX escape sequences and potentially surrogate pairs for characters outside BMP.

Misconception 3: “Object key order doesn’t matter” Reality: The JSON spec says order is undefined, but many applications depend on it. Consider using std::map (ordered) or std::vector<pair> if order matters.

Misconception 4: “Trailing commas are allowed” Reality: [1, 2, 3,] is invalid JSON. Your parser must reject it.


3. Project Specification

3.1 What You Will Build

A JSON library that:

  • Parses JSON text into a DOM (Document Object Model)
  • Supports all JSON types (null, bool, number, string, array, object)
  • Allows querying and modifying the DOM
  • Serializes back to JSON with optional pretty-printing
  • Provides clear error messages for invalid JSON

3.2 Functional Requirements

Requirement Description
FR-1 Json::parse(string) parses JSON text
FR-2 operator[] accesses object keys and array indices
FR-3 as_string(), as_int(), as_bool() extract typed values
FR-4 dump() serializes to JSON string
FR-5 dump(indent) serializes with pretty-printing
FR-6 Handle all JSON escape sequences including \uXXXX
FR-7 Throw JsonParseError with line/column on invalid input

3.3 Non-Functional Requirements

Requirement Description
NFR-1 Parse 1MB JSON file in < 100ms
NFR-2 Memory usage proportional to DOM size
NFR-3 No undefined behavior on invalid input
NFR-4 Compiles with -Wall -Wextra -Werror

3.4 Example Usage / Output

#include "json.hpp"
#include <iostream>

int main() {
    // Parse JSON string
    std::string json_text = R"({
        "name": "Alice",
        "age": 30,
        "active": true,
        "balance": null,
        "hobbies": ["reading", "coding", "hiking"],
        "address": {
            "city": "Seattle",
            "zip": "98101"
        }
    })";

    Json json = Json::parse(json_text);

    // Access values
    std::cout << json["name"].as_string() << std::endl;     // "Alice"
    std::cout << json["age"].as_int() << std::endl;         // 30
    std::cout << json["active"].as_bool() << std::endl;     // true
    std::cout << json["hobbies"][0].as_string() << std::endl;  // "reading"
    std::cout << json["address"]["city"].as_string() << std::endl;  // "Seattle"

    // Check types
    std::cout << json["balance"].is_null() << std::endl;    // true
    std::cout << json["hobbies"].is_array() << std::endl;   // true

    // Modify
    json["age"] = 31;
    json["hobbies"].push_back("swimming");
    json["address"]["state"] = "WA";

    // Serialize with pretty-printing (2-space indent)
    std::cout << json.dump(2) << std::endl;

    // Error handling
    try {
        Json bad = Json::parse("{ invalid json }");
    } catch (const JsonParseError& e) {
        std::cerr << e.what() << std::endl;
        // "Parse error at line 1, column 3: Expected string key, got 'invalid'"
    }

    return 0;
}

Expected Output:

Alice
30
1
reading
Seattle
1
1
{
  "name": "Alice",
  "age": 31,
  "active": true,
  "balance": null,
  "hobbies": [
    "reading",
    "coding",
    "hiking",
    "swimming"
  ],
  "address": {
    "city": "Seattle",
    "zip": "98101",
    "state": "WA"
  }
}
Parse error at line 1, column 3: Expected string key, got 'invalid'

3.5 Real World Outcome

$ ./json_parser test_files/large.json
Parsed 1.2 MB JSON in 45ms
  Objects: 12,456
  Arrays: 3,891
  Strings: 98,234
  Numbers: 45,678
  Booleans: 2,345
  Nulls: 123

Validation: All values accessible
Round-trip: dump(parse(input)) == input (whitespace-normalized)
Memory usage: 4.8 MB (4x file size - typical for DOM)

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────────┐
│                          JSON Library                                │
│                                                                      │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────────────┐ │
│  │    Token       │  │   Tokenizer    │  │      Json Class        │ │
│  │                │  │                │  │                        │ │
│  │  Type (enum)   │  │  input string  │  │  variant<...> value_   │ │
│  │  value/text    │  │  position      │  │                        │ │
│  │  line, column  │  │  line, column  │  │  parse()               │ │
│  │                │  │                │  │  dump()                │ │
│  │                │  │  nextToken()   │  │  operator[]            │ │
│  │                │  │  peek()        │  │  as_string/int/bool    │ │
│  └────────────────┘  └────────────────┘  └────────────────────────┘ │
│                                                                      │
│  ┌────────────────┐  ┌────────────────────────────────────────────┐ │
│  │ JsonParseError │  │               Parser                       │ │
│  │                │  │                                            │ │
│  │  message       │  │  parseValue()  parseObject()  parseArray() │ │
│  │  line, column  │  │  parseString() parseNumber()               │ │
│  └────────────────┘  └────────────────────────────────────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component Responsibility
Token Represents a single lexical unit (string, number, punctuation)
Tokenizer Converts input string to token stream
Parser Builds Json tree from token stream
Json DOM node with type-safe access methods
JsonParseError Exception with position information

4.3 Data Structures

Token Types:

enum class TokenType {
    LEFT_BRACE,     // {
    RIGHT_BRACE,    // }
    LEFT_BRACKET,   // [
    RIGHT_BRACKET,  // ]
    COLON,          // :
    COMMA,          // ,
    STRING,         // "..."
    NUMBER,         // 123, 1.5e10
    TRUE,           // true
    FALSE,          // false
    NULL_LITERAL,   // null
    END_OF_FILE
};

Json Value Type:

class Json;  // Forward declaration

using JsonObject = std::map<std::string, Json>;
using JsonArray = std::vector<Json>;
using JsonValue = std::variant<
    std::nullptr_t,
    bool,
    double,
    std::string,
    JsonArray,
    JsonObject
>;

4.4 Algorithm Overview

Tokenizer Algorithm:

while not at end:
    skip whitespace
    c = current char

    if c == '{': emit LEFT_BRACE
    if c == '}': emit RIGHT_BRACE
    if c == '[': emit LEFT_BRACKET
    if c == ']': emit RIGHT_BRACKET
    if c == ':': emit COLON
    if c == ',': emit COMMA
    if c == '"': parse_string()
    if c is digit or '-': parse_number()
    if c == 't': expect "true", emit TRUE
    if c == 'f': expect "false", emit FALSE
    if c == 'n': expect "null", emit NULL_LITERAL
    else: error

emit END_OF_FILE

Parser Algorithm (Recursive Descent):

parseValue():
    token = peek()
    switch token.type:
        case LEFT_BRACE:  return parseObject()
        case LEFT_BRACKET: return parseArray()
        case STRING:      advance(); return Json(token.string_value)
        case NUMBER:      advance(); return Json(token.number_value)
        case TRUE:        advance(); return Json(true)
        case FALSE:       advance(); return Json(false)
        case NULL:        advance(); return Json(nullptr)
        default:          error("unexpected token")

parseObject():
    expect(LEFT_BRACE)
    object = {}
    if peek() != RIGHT_BRACE:
        key = expect(STRING)
        expect(COLON)
        value = parseValue()
        object[key] = value

        while peek() == COMMA:
            advance()  // consume comma
            key = expect(STRING)
            expect(COLON)
            value = parseValue()
            object[key] = value

    expect(RIGHT_BRACE)
    return Json(object)

parseArray():
    expect(LEFT_BRACKET)
    array = []
    if peek() != RIGHT_BRACKET:
        array.push_back(parseValue())

        while peek() == COMMA:
            advance()  // consume comma
            array.push_back(parseValue())

    expect(RIGHT_BRACKET)
    return Json(array)

5. Implementation Guide

5.1 Development Environment Setup

Required:

  • C++17 or later (for std::variant, std::string_view)
  • Compiler: GCC 7+, Clang 5+, MSVC 2017+

Recommended:

# Compile with all warnings
g++ -std=c++17 -Wall -Wextra -Werror -O2 json.cpp -o json

# For debugging
g++ -std=c++17 -g -fsanitize=address,undefined json.cpp -o json

5.2 Project Structure

json_parser/
├── include/
│   ├── json.hpp           # Main header (can be header-only)
│   ├── tokenizer.hpp      # Tokenizer class
│   └── parser.hpp         # Parser class
├── src/
│   ├── json.cpp           # Implementation (if not header-only)
│   └── main.cpp           # CLI tool
├── tests/
│   ├── test_tokenizer.cpp
│   ├── test_parser.cpp
│   ├── test_roundtrip.cpp
│   └── test_files/        # JSON test files
│       ├── valid/
│       └── invalid/
├── CMakeLists.txt
└── README.md

5.3 The Core Question You’re Answering

“How do I transform unstructured text into a structured, queryable, modifiable data structure?”

This is the fundamental problem of parsing. JSON is an ideal learning vehicle because:

  • Grammar is simple and well-defined
  • Real-world relevance is high
  • Errors are easy to test (many invalid JSON examples available)

5.4 Concepts You Must Understand First

  1. What is std::variant and how do you access values in it?
    • std::get<T>(), std::get_if<T>(), std::visit()
    • Reference: cppreference std::variant
  2. What is recursive descent parsing?
    • One function per grammar rule, functions call each other
    • Reference: “Language Implementation Patterns” Ch. 2
  3. How do you handle Unicode in C++?
    • UTF-8 encoding, \uXXXX escape sequences
    • Surrogate pairs for characters outside BMP
  4. What’s the difference between lexical analysis and parsing?
    • Tokenizer: characters -> tokens
    • Parser: tokens -> tree
  5. How do you report errors with position information?
    • Track line and column during tokenization
    • Include in exception message

5.5 Questions to Guide Your Design

API Design:

  • Should operator[] on a non-object throw or return null?
  • Should the parser accept trailing commas (lenient mode)?
  • How to handle duplicate keys in objects?

Implementation:

  • Should tokenizer be lazy (generate on demand) or eager (all tokens at once)?
  • How to handle very large numbers (beyond double precision)?
  • Should string parsing allocate incrementally or compute size first?

Performance:

  • Is std::variant’s overhead acceptable?
  • Should small strings use SSO (small string optimization)?
  • Is recursive descent okay for deeply nested JSON?

5.6 Thinking Exercise

Parse this JSON by hand, noting each token and the parser state:

{"items": [1, {"nested": true}], "count": 2}

Create a trace table: | Step | Token | Parser State | Action | |——|——-|————–|——–| | 1 | { | START | Begin object | | 2 | "items" | OBJECT_KEY | Store key | | 3 | : | OBJECT_COLON | Expect value | | 4 | [ | ARRAY_START | Begin array (nested) | | … | … | … | … |

5.7 Hints in Layers

Hint 1 - Starting Point (Conceptual): Build the tokenizer first, completely separate from parsing. Test it independently. Then build the parser assuming you have a working token stream.

Hint 2 - Next Level (More Specific): For the tokenizer, the tricky parts are:

  • String escape sequences (especially \uXXXX)
  • Number parsing (sign, fraction, exponent)
  • Tracking line/column through whitespace and newlines

For the parser, the tricky part is:

  • Making Json class both hold values and be recursive (contain vector)

Hint 3 - Technical Details (Approach):

String parsing with escape handling:

std::string parseString() {
    std::string result;
    advance();  // consume opening "

    while (current() != '"') {
        if (current() == '\\') {
            advance();
            switch (current()) {
                case '"':  result += '"'; break;
                case '\\': result += '\\'; break;
                case '/':  result += '/'; break;
                case 'b':  result += '\b'; break;
                case 'f':  result += '\f'; break;
                case 'n':  result += '\n'; break;
                case 'r':  result += '\r'; break;
                case 't':  result += '\t'; break;
                case 'u':  result += parseUnicodeEscape(); break;
                default:   error("invalid escape");
            }
        } else {
            result += current();
        }
        advance();
    }
    advance();  // consume closing "
    return result;
}

Hint 4 - Tools and Debugging:

  • Use the JSON Test Suite: https://github.com/nst/JSONTestSuite
  • Test with y_*.json (valid), n_*.json (invalid), i_*.json (implementation-defined)
  • Use AddressSanitizer to catch buffer overflows in string parsing

5.8 The Interview Questions They’ll Ask

  1. “Why use std::variant instead of inheritance for JSON values?”
    • Variant is value semantics (no heap allocation for small types), no vtable overhead, exhaustive matching with std::visit
  2. “How do you handle JSON numbers that don’t fit in a double?”
    • Options: Use string storage for large numbers, throw error, or use arbitrary precision library
  3. “What’s the time complexity of your parser?”
    • O(n) for well-formed input where n is input size
  4. “How would you make the parser faster?”
    • SIMD for whitespace skipping, avoid std::string copies with string_view, custom allocator
  5. “How do you handle invalid UTF-8 in strings?”
    • Options: reject, replace with replacement character, pass through
  6. “What’s the difference between SAX and DOM parsing?”
    • DOM builds full tree in memory. SAX streams events (start object, key, value, end object). SAX uses less memory for large files.

5.9 Books That Will Help

Topic Book & Chapter
Parsing fundamentals “Language Implementation Patterns” Ch. 2-3 - Terence Parr
std::variant “C++17 - The Complete Guide” Ch. 16 - Nicolai Josuttis
String handling “The C++ Programming Language” Ch. 36 - Bjarne Stroustrup
Error handling “C++ Primer Plus” Ch. 15 - Stephen Prata
Unicode “Programming with Unicode” - Victor Stinner (free online)

5.10 Implementation Phases

Phase 1: Tokenizer (Day 1-4)

  • Implement Token class with type enum
  • Implement Tokenizer that handles all token types
  • Handle string escape sequences (except \u)
  • Track line and column numbers
  • Test with simple JSON strings

Phase 2: Basic Parser (Day 5-8)

  • Implement Json class with std::variant
  • Implement parseValue, parseObject, parseArray
  • Handle all primitive types
  • Test with nested structures

Phase 3: Full Parsing (Day 9-11)

  • Add \uXXXX Unicode escape handling
  • Add comprehensive error messages
  • Handle edge cases (empty objects, deeply nested)
  • Validate against JSON Test Suite

Phase 4: Serialization (Day 12-14)

  • Implement dump() for compact output
  • Implement dump(indent) for pretty-printing
  • Ensure round-trip: parse(dump(json)) == json
  • Handle special characters in strings

5.11 Key Implementation Decisions

Decision Recommended Choice Reasoning
Number type double JSON spec uses floating point, double handles integers up to 2^53
Object type std::map<string, Json> Ordered iteration, O(log n) lookup acceptable
Array type std::vector<Json> Standard choice, efficient iteration
String type std::string UTF-8 encoded, standard choice

6. Testing Strategy

Test Categories:

// Valid JSON parsing
TEST(JsonParser, ParsesEmptyObject) {
    auto json = Json::parse("{}");
    EXPECT_TRUE(json.is_object());
    EXPECT_EQ(json.size(), 0);
}

TEST(JsonParser, ParsesNestedStructure) {
    auto json = Json::parse(R"({"a": {"b": [1, 2, 3]}})");
    EXPECT_EQ(json["a"]["b"][1].as_int(), 2);
}

// Invalid JSON rejection
TEST(JsonParser, RejectsTrailingComma) {
    EXPECT_THROW(Json::parse("[1, 2,]"), JsonParseError);
}

TEST(JsonParser, RejectsSingleQuotes) {
    EXPECT_THROW(Json::parse("{'key': 'value'}"), JsonParseError);
}

// Round-trip
TEST(JsonParser, RoundTrip) {
    std::string input = R"({"a":1,"b":"hello","c":[1,2,3]})";
    auto json = Json::parse(input);
    auto output = json.dump();
    auto reparsed = Json::parse(output);
    EXPECT_EQ(json, reparsed);
}

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Off-by-one in strings Missing last char or crash Not handling closing quote correctly Check boundary conditions
Unicode decode error Garbage characters Not converting \uXXXX to UTF-8 Implement proper UTF-8 encoding
Stack overflow Crash on deep nesting Recursive descent with no limit Add depth limit or use iterative parser
Number precision loss 9007199254740993 becomes 9007199254740992 double can’t represent all integers Document limitation or use string for big ints
Memory leak Growing memory Recursive variant without proper destructor Ensure proper move/copy semantics

8. Extensions & Challenges

  1. SAX Parser: Implement event-based parsing for large files
  2. JSON Pointer: Implement RFC 6901 for path-based access (/foo/bar/0)
  3. JSON Patch: Implement RFC 6902 for diff/patch operations
  4. Schema Validation: Validate against JSON Schema
  5. Comments Support: Allow // and /* */ comments (non-standard but useful)
  6. SIMD Optimization: Use SSE/AVX for faster whitespace skipping

9. Real-World Connections

  • nlohmann/json: The most popular C++ JSON library, similar design
  • RapidJSON: Performance-focused, uses different architecture
  • simdjson: Uses SIMD for 2-4x faster parsing
  • jq: Command-line JSON processor, good for testing

10. Resources


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • Parses all valid JSON from test suite
  • Rejects all invalid JSON from test suite
  • Handles all escape sequences including \uXXXX
  • Error messages include line and column
  • Round-trip: parse(dump(x)) == x for all valid JSON
  • Pretty-printing produces correctly indented output
  • No memory leaks (Valgrind clean)
  • No undefined behavior (ASan/UBSan clean)
  • Performance: 1MB file in < 100ms

12. Submission / Completion Criteria

Your implementation is complete when:

  1. Passes JSON Test Suite: All y_.json parse, all n_.json reject
  2. API is intuitive: Similar to nlohmann/json interface
  3. Error messages are helpful: Include position and context
  4. Documentation complete: Header comments, examples
  5. Can explain: You can describe tokenizer and parser in an interview