Project 6: JSON Parser Library
Build a complete JSON parser with DOM representation, std::variant for type safety, and pretty-printing to master parsing and data structures.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 2-3 weeks |
| Language | C++ |
| Prerequisites | std::variant, recursive data structures, string manipulation |
| Key Topics | Lexical analysis, recursive descent parsing, std::variant, Unicode handling |
1. Learning Objectives
By completing this project, you will:
- Master tokenization: Convert raw text into a stream of typed tokens
- Implement recursive descent parsing: Build hierarchical structures from flat token streams
- Use std::variant effectively: Model algebraic data types for JSON values
- Handle Unicode escapes: Parse \uXXXX sequences correctly
- Design clean APIs: Create an intuitive interface for JSON manipulation
- Implement serialization: Convert DOM back to formatted JSON text
- Write robust error handling: Provide clear error messages with line/column info
2. Theoretical Foundation
2.1 Core Concepts
JSON Grammar (simplified):
value := object | array | string | number | "true" | "false" | "null"
object := '{' (pair (',' pair)*)? '}'
pair := string ':' value
array := '[' (value (',' value)*)? ']'
string := '"' characters '"'
number := integer fraction? exponent?
The Parser Pipeline:
┌─────────────────────────────────────────────────────┐
│ Parser Pipeline │
│ │
Raw JSON String │ Tokenizer Parser DOM │
──────────────────┼──────────────────────────────────────────────────────│
│ │
{"name":"Alice"} │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
──────────────────┼─►│ Tokenize │───►│ Parse │───►│ JSON Tree │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ [{, "name", Recursive Object { │
│ :, "Alice", descent "name": │
│ }] builds tree String("Alice") │
│ } │
└─────────────────────────────────────────────────────┘
Representing JSON Values with std::variant:
// JSON can be one of these types:
using JsonValue = std::variant<
std::nullptr_t, // null
bool, // true, false
double, // numbers (JSON has no int type)
std::string, // strings
std::vector<Json>, // arrays
std::map<std::string, Json> // objects
>;
// The recursive structure:
┌─────────────────────────────────────────────────────┐
│ Json class │
│ │
│ JsonValue value_ ◄─┬─ nullptr_t (null) │
│ ├─ bool (true/false) │
│ ├─ double (number) │
│ ├─ string (string) │
│ ├─ vector<Json> (array) │
│ └─ map<string, Json> (object) │
│ │ │
│ └──► Contains more Json │
│ objects (recursive) │
└─────────────────────────────────────────────────────┘
Tokenizer State Machine:
┌─────────────────┐
│ START │
└────────┬────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ STRING │ │ NUMBER │ │KEYWORD │ │ PUNCT │ │ EOF │
│ " │ │ 0-9,- │ │t,f,n │ │{,},[,],│ │ │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
│ │ │ │
▼ ▼ ▼ ▼
Read until Parse with Match full Single char
closing " sign, frac, keyword token
Handle \ exponent
2.2 Why This Matters
JSON is the lingua franca of data exchange:
- APIs: REST, GraphQL responses
- Configuration: package.json, tsconfig.json
- Data storage: MongoDB, CouchDB documents
- Logging: Structured log formats
Building a JSON parser teaches:
- How parsers in general work (useful for DSLs, config files)
- How to handle recursive/nested data structures
- How type systems model heterogeneous data
- How to design library APIs
2.3 Historical Context
JSON (JavaScript Object Notation) was formalized by Douglas Crockford in 2001, derived from JavaScript object literal syntax. Its simplicity (only 6 data types, minimal syntax) made it quickly eclipse XML for web APIs.
Key milestones:
- 2001: Douglas Crockford formalizes JSON
- 2006: RFC 4627 published
- 2013: ECMA-404 standard
- 2017: RFC 8259 (current standard)
Popular C++ JSON libraries:
- nlohmann/json: Header-only, intuitive API, widely used
- RapidJSON: Performance-focused, SAX and DOM
- simdjson: SIMD-accelerated, fastest parser available
2.4 Common Misconceptions
Misconception 1: “JSON numbers are integers or floats”
Reality: JSON has only one number type. The spec says “number” with no distinction. Libraries typically use double which handles both.
Misconception 2: “JSON strings are just ASCII” Reality: JSON strings are Unicode (typically UTF-8). You must handle \uXXXX escape sequences and potentially surrogate pairs for characters outside BMP.
Misconception 3: “Object key order doesn’t matter”
Reality: The JSON spec says order is undefined, but many applications depend on it. Consider using std::map (ordered) or std::vector<pair> if order matters.
Misconception 4: “Trailing commas are allowed”
Reality: [1, 2, 3,] is invalid JSON. Your parser must reject it.
3. Project Specification
3.1 What You Will Build
A JSON library that:
- Parses JSON text into a DOM (Document Object Model)
- Supports all JSON types (null, bool, number, string, array, object)
- Allows querying and modifying the DOM
- Serializes back to JSON with optional pretty-printing
- Provides clear error messages for invalid JSON
3.2 Functional Requirements
| Requirement | Description |
|---|---|
| FR-1 | Json::parse(string) parses JSON text |
| FR-2 | operator[] accesses object keys and array indices |
| FR-3 | as_string(), as_int(), as_bool() extract typed values |
| FR-4 | dump() serializes to JSON string |
| FR-5 | dump(indent) serializes with pretty-printing |
| FR-6 | Handle all JSON escape sequences including \uXXXX |
| FR-7 | Throw JsonParseError with line/column on invalid input |
3.3 Non-Functional Requirements
| Requirement | Description |
|---|---|
| NFR-1 | Parse 1MB JSON file in < 100ms |
| NFR-2 | Memory usage proportional to DOM size |
| NFR-3 | No undefined behavior on invalid input |
| NFR-4 | Compiles with -Wall -Wextra -Werror |
3.4 Example Usage / Output
#include "json.hpp"
#include <iostream>
int main() {
// Parse JSON string
std::string json_text = R"({
"name": "Alice",
"age": 30,
"active": true,
"balance": null,
"hobbies": ["reading", "coding", "hiking"],
"address": {
"city": "Seattle",
"zip": "98101"
}
})";
Json json = Json::parse(json_text);
// Access values
std::cout << json["name"].as_string() << std::endl; // "Alice"
std::cout << json["age"].as_int() << std::endl; // 30
std::cout << json["active"].as_bool() << std::endl; // true
std::cout << json["hobbies"][0].as_string() << std::endl; // "reading"
std::cout << json["address"]["city"].as_string() << std::endl; // "Seattle"
// Check types
std::cout << json["balance"].is_null() << std::endl; // true
std::cout << json["hobbies"].is_array() << std::endl; // true
// Modify
json["age"] = 31;
json["hobbies"].push_back("swimming");
json["address"]["state"] = "WA";
// Serialize with pretty-printing (2-space indent)
std::cout << json.dump(2) << std::endl;
// Error handling
try {
Json bad = Json::parse("{ invalid json }");
} catch (const JsonParseError& e) {
std::cerr << e.what() << std::endl;
// "Parse error at line 1, column 3: Expected string key, got 'invalid'"
}
return 0;
}
Expected Output:
Alice
30
1
reading
Seattle
1
1
{
"name": "Alice",
"age": 31,
"active": true,
"balance": null,
"hobbies": [
"reading",
"coding",
"hiking",
"swimming"
],
"address": {
"city": "Seattle",
"zip": "98101",
"state": "WA"
}
}
Parse error at line 1, column 3: Expected string key, got 'invalid'
3.5 Real World Outcome
$ ./json_parser test_files/large.json
Parsed 1.2 MB JSON in 45ms
Objects: 12,456
Arrays: 3,891
Strings: 98,234
Numbers: 45,678
Booleans: 2,345
Nulls: 123
Validation: All values accessible
Round-trip: dump(parse(input)) == input (whitespace-normalized)
Memory usage: 4.8 MB (4x file size - typical for DOM)
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────────┐
│ JSON Library │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────────────┐ │
│ │ Token │ │ Tokenizer │ │ Json Class │ │
│ │ │ │ │ │ │ │
│ │ Type (enum) │ │ input string │ │ variant<...> value_ │ │
│ │ value/text │ │ position │ │ │ │
│ │ line, column │ │ line, column │ │ parse() │ │
│ │ │ │ │ │ dump() │ │
│ │ │ │ nextToken() │ │ operator[] │ │
│ │ │ │ peek() │ │ as_string/int/bool │ │
│ └────────────────┘ └────────────────┘ └────────────────────────┘ │
│ │
│ ┌────────────────┐ ┌────────────────────────────────────────────┐ │
│ │ JsonParseError │ │ Parser │ │
│ │ │ │ │ │
│ │ message │ │ parseValue() parseObject() parseArray() │ │
│ │ line, column │ │ parseString() parseNumber() │ │
│ └────────────────┘ └────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
4.2 Key Components
| Component | Responsibility |
|---|---|
| Token | Represents a single lexical unit (string, number, punctuation) |
| Tokenizer | Converts input string to token stream |
| Parser | Builds Json tree from token stream |
| Json | DOM node with type-safe access methods |
| JsonParseError | Exception with position information |
4.3 Data Structures
Token Types:
enum class TokenType {
LEFT_BRACE, // {
RIGHT_BRACE, // }
LEFT_BRACKET, // [
RIGHT_BRACKET, // ]
COLON, // :
COMMA, // ,
STRING, // "..."
NUMBER, // 123, 1.5e10
TRUE, // true
FALSE, // false
NULL_LITERAL, // null
END_OF_FILE
};
Json Value Type:
class Json; // Forward declaration
using JsonObject = std::map<std::string, Json>;
using JsonArray = std::vector<Json>;
using JsonValue = std::variant<
std::nullptr_t,
bool,
double,
std::string,
JsonArray,
JsonObject
>;
4.4 Algorithm Overview
Tokenizer Algorithm:
while not at end:
skip whitespace
c = current char
if c == '{': emit LEFT_BRACE
if c == '}': emit RIGHT_BRACE
if c == '[': emit LEFT_BRACKET
if c == ']': emit RIGHT_BRACKET
if c == ':': emit COLON
if c == ',': emit COMMA
if c == '"': parse_string()
if c is digit or '-': parse_number()
if c == 't': expect "true", emit TRUE
if c == 'f': expect "false", emit FALSE
if c == 'n': expect "null", emit NULL_LITERAL
else: error
emit END_OF_FILE
Parser Algorithm (Recursive Descent):
parseValue():
token = peek()
switch token.type:
case LEFT_BRACE: return parseObject()
case LEFT_BRACKET: return parseArray()
case STRING: advance(); return Json(token.string_value)
case NUMBER: advance(); return Json(token.number_value)
case TRUE: advance(); return Json(true)
case FALSE: advance(); return Json(false)
case NULL: advance(); return Json(nullptr)
default: error("unexpected token")
parseObject():
expect(LEFT_BRACE)
object = {}
if peek() != RIGHT_BRACE:
key = expect(STRING)
expect(COLON)
value = parseValue()
object[key] = value
while peek() == COMMA:
advance() // consume comma
key = expect(STRING)
expect(COLON)
value = parseValue()
object[key] = value
expect(RIGHT_BRACE)
return Json(object)
parseArray():
expect(LEFT_BRACKET)
array = []
if peek() != RIGHT_BRACKET:
array.push_back(parseValue())
while peek() == COMMA:
advance() // consume comma
array.push_back(parseValue())
expect(RIGHT_BRACKET)
return Json(array)
5. Implementation Guide
5.1 Development Environment Setup
Required:
- C++17 or later (for std::variant, std::string_view)
- Compiler: GCC 7+, Clang 5+, MSVC 2017+
Recommended:
# Compile with all warnings
g++ -std=c++17 -Wall -Wextra -Werror -O2 json.cpp -o json
# For debugging
g++ -std=c++17 -g -fsanitize=address,undefined json.cpp -o json
5.2 Project Structure
json_parser/
├── include/
│ ├── json.hpp # Main header (can be header-only)
│ ├── tokenizer.hpp # Tokenizer class
│ └── parser.hpp # Parser class
├── src/
│ ├── json.cpp # Implementation (if not header-only)
│ └── main.cpp # CLI tool
├── tests/
│ ├── test_tokenizer.cpp
│ ├── test_parser.cpp
│ ├── test_roundtrip.cpp
│ └── test_files/ # JSON test files
│ ├── valid/
│ └── invalid/
├── CMakeLists.txt
└── README.md
5.3 The Core Question You’re Answering
“How do I transform unstructured text into a structured, queryable, modifiable data structure?”
This is the fundamental problem of parsing. JSON is an ideal learning vehicle because:
- Grammar is simple and well-defined
- Real-world relevance is high
- Errors are easy to test (many invalid JSON examples available)
5.4 Concepts You Must Understand First
- What is std::variant and how do you access values in it?
std::get<T>(),std::get_if<T>(),std::visit()- Reference: cppreference std::variant
- What is recursive descent parsing?
- One function per grammar rule, functions call each other
- Reference: “Language Implementation Patterns” Ch. 2
- How do you handle Unicode in C++?
- UTF-8 encoding, \uXXXX escape sequences
- Surrogate pairs for characters outside BMP
- What’s the difference between lexical analysis and parsing?
- Tokenizer: characters -> tokens
- Parser: tokens -> tree
- How do you report errors with position information?
- Track line and column during tokenization
- Include in exception message
5.5 Questions to Guide Your Design
API Design:
- Should
operator[]on a non-object throw or return null? - Should the parser accept trailing commas (lenient mode)?
- How to handle duplicate keys in objects?
Implementation:
- Should tokenizer be lazy (generate on demand) or eager (all tokens at once)?
- How to handle very large numbers (beyond double precision)?
- Should string parsing allocate incrementally or compute size first?
Performance:
- Is std::variant’s overhead acceptable?
- Should small strings use SSO (small string optimization)?
- Is recursive descent okay for deeply nested JSON?
5.6 Thinking Exercise
Parse this JSON by hand, noting each token and the parser state:
{"items": [1, {"nested": true}], "count": 2}
Create a trace table:
| Step | Token | Parser State | Action |
|——|——-|————–|——–|
| 1 | { | START | Begin object |
| 2 | "items" | OBJECT_KEY | Store key |
| 3 | : | OBJECT_COLON | Expect value |
| 4 | [ | ARRAY_START | Begin array (nested) |
| … | … | … | … |
5.7 Hints in Layers
Hint 1 - Starting Point (Conceptual): Build the tokenizer first, completely separate from parsing. Test it independently. Then build the parser assuming you have a working token stream.
Hint 2 - Next Level (More Specific): For the tokenizer, the tricky parts are:
- String escape sequences (especially \uXXXX)
- Number parsing (sign, fraction, exponent)
- Tracking line/column through whitespace and newlines
For the parser, the tricky part is:
- Making Json class both hold values and be recursive (contain vector
)
Hint 3 - Technical Details (Approach):
String parsing with escape handling:
std::string parseString() {
std::string result;
advance(); // consume opening "
while (current() != '"') {
if (current() == '\\') {
advance();
switch (current()) {
case '"': result += '"'; break;
case '\\': result += '\\'; break;
case '/': result += '/'; break;
case 'b': result += '\b'; break;
case 'f': result += '\f'; break;
case 'n': result += '\n'; break;
case 'r': result += '\r'; break;
case 't': result += '\t'; break;
case 'u': result += parseUnicodeEscape(); break;
default: error("invalid escape");
}
} else {
result += current();
}
advance();
}
advance(); // consume closing "
return result;
}
Hint 4 - Tools and Debugging:
- Use the JSON Test Suite: https://github.com/nst/JSONTestSuite
- Test with
y_*.json(valid),n_*.json(invalid),i_*.json(implementation-defined) - Use AddressSanitizer to catch buffer overflows in string parsing
5.8 The Interview Questions They’ll Ask
- “Why use std::variant instead of inheritance for JSON values?”
- Variant is value semantics (no heap allocation for small types), no vtable overhead, exhaustive matching with std::visit
- “How do you handle JSON numbers that don’t fit in a double?”
- Options: Use string storage for large numbers, throw error, or use arbitrary precision library
- “What’s the time complexity of your parser?”
- O(n) for well-formed input where n is input size
- “How would you make the parser faster?”
- SIMD for whitespace skipping, avoid std::string copies with string_view, custom allocator
- “How do you handle invalid UTF-8 in strings?”
- Options: reject, replace with replacement character, pass through
- “What’s the difference between SAX and DOM parsing?”
- DOM builds full tree in memory. SAX streams events (start object, key, value, end object). SAX uses less memory for large files.
5.9 Books That Will Help
| Topic | Book & Chapter |
|---|---|
| Parsing fundamentals | “Language Implementation Patterns” Ch. 2-3 - Terence Parr |
| std::variant | “C++17 - The Complete Guide” Ch. 16 - Nicolai Josuttis |
| String handling | “The C++ Programming Language” Ch. 36 - Bjarne Stroustrup |
| Error handling | “C++ Primer Plus” Ch. 15 - Stephen Prata |
| Unicode | “Programming with Unicode” - Victor Stinner (free online) |
5.10 Implementation Phases
Phase 1: Tokenizer (Day 1-4)
- Implement Token class with type enum
- Implement Tokenizer that handles all token types
- Handle string escape sequences (except \u)
- Track line and column numbers
- Test with simple JSON strings
Phase 2: Basic Parser (Day 5-8)
- Implement Json class with std::variant
- Implement parseValue, parseObject, parseArray
- Handle all primitive types
- Test with nested structures
Phase 3: Full Parsing (Day 9-11)
- Add \uXXXX Unicode escape handling
- Add comprehensive error messages
- Handle edge cases (empty objects, deeply nested)
- Validate against JSON Test Suite
Phase 4: Serialization (Day 12-14)
- Implement dump() for compact output
- Implement dump(indent) for pretty-printing
- Ensure round-trip: parse(dump(json)) == json
- Handle special characters in strings
5.11 Key Implementation Decisions
| Decision | Recommended Choice | Reasoning |
|---|---|---|
| Number type | double |
JSON spec uses floating point, double handles integers up to 2^53 |
| Object type | std::map<string, Json> |
Ordered iteration, O(log n) lookup acceptable |
| Array type | std::vector<Json> |
Standard choice, efficient iteration |
| String type | std::string |
UTF-8 encoded, standard choice |
6. Testing Strategy
Test Categories:
// Valid JSON parsing
TEST(JsonParser, ParsesEmptyObject) {
auto json = Json::parse("{}");
EXPECT_TRUE(json.is_object());
EXPECT_EQ(json.size(), 0);
}
TEST(JsonParser, ParsesNestedStructure) {
auto json = Json::parse(R"({"a": {"b": [1, 2, 3]}})");
EXPECT_EQ(json["a"]["b"][1].as_int(), 2);
}
// Invalid JSON rejection
TEST(JsonParser, RejectsTrailingComma) {
EXPECT_THROW(Json::parse("[1, 2,]"), JsonParseError);
}
TEST(JsonParser, RejectsSingleQuotes) {
EXPECT_THROW(Json::parse("{'key': 'value'}"), JsonParseError);
}
// Round-trip
TEST(JsonParser, RoundTrip) {
std::string input = R"({"a":1,"b":"hello","c":[1,2,3]})";
auto json = Json::parse(input);
auto output = json.dump();
auto reparsed = Json::parse(output);
EXPECT_EQ(json, reparsed);
}
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Off-by-one in strings | Missing last char or crash | Not handling closing quote correctly | Check boundary conditions |
| Unicode decode error | Garbage characters | Not converting \uXXXX to UTF-8 | Implement proper UTF-8 encoding |
| Stack overflow | Crash on deep nesting | Recursive descent with no limit | Add depth limit or use iterative parser |
| Number precision loss | 9007199254740993 becomes 9007199254740992 | double can’t represent all integers | Document limitation or use string for big ints |
| Memory leak | Growing memory | Recursive variant without proper destructor | Ensure proper move/copy semantics |
8. Extensions & Challenges
- SAX Parser: Implement event-based parsing for large files
- JSON Pointer: Implement RFC 6901 for path-based access (
/foo/bar/0) - JSON Patch: Implement RFC 6902 for diff/patch operations
- Schema Validation: Validate against JSON Schema
- Comments Support: Allow // and /* */ comments (non-standard but useful)
- SIMD Optimization: Use SSE/AVX for faster whitespace skipping
9. Real-World Connections
- nlohmann/json: The most popular C++ JSON library, similar design
- RapidJSON: Performance-focused, uses different architecture
- simdjson: Uses SIMD for 2-4x faster parsing
- jq: Command-line JSON processor, good for testing
10. Resources
- JSON Specification
- RFC 8259 - The JSON Data Interchange Format
- JSON Test Suite
- cppreference: std::variant
11. Self-Assessment Checklist
Before considering this project complete, verify:
- Parses all valid JSON from test suite
- Rejects all invalid JSON from test suite
- Handles all escape sequences including \uXXXX
- Error messages include line and column
- Round-trip: parse(dump(x)) == x for all valid JSON
- Pretty-printing produces correctly indented output
- No memory leaks (Valgrind clean)
- No undefined behavior (ASan/UBSan clean)
- Performance: 1MB file in < 100ms
12. Submission / Completion Criteria
Your implementation is complete when:
- Passes JSON Test Suite: All y_.json parse, all n_.json reject
- API is intuitive: Similar to nlohmann/json interface
- Error messages are helpful: Include position and context
- Documentation complete: Header comments, examples
- Can explain: You can describe tokenizer and parser in an interview