Project 4: JSON Parser with Explicit Ownership

Project 4: JSON Parser with Explicit Ownership

Sprint: 2 - Data & Invariants Difficulty: Advanced Time Estimate: 2 weeks Prerequisites: Projects 1 & 3 (dynamic array and arena), basic parsing concepts


Overview

What youโ€™ll build: A JSON parser that produces a tree structure with clear, documented ownership semanticsโ€”who owns each string, each array, each object.

Why it teaches Data & Invariants: JSON parsing is a gauntlet of ownership decisions:

  • Does the parsed tree own the strings, or point into the source?
  • Who frees nested objects?
  • What happens to the tree if parsing fails halfway?
  • How do you represent null vs โ€œkey not presentโ€?

Every node in your tree has an invariant. Every edge represents an ownership relationship.

The Core Question Youโ€™re Answering:

โ€œHow do you build a recursive tree structure in C where every node clearly owns its children, strings have explicit lifetime semantics, and cleanup is guaranteed even on partial parse failures?โ€


Learning Objectives

By the end of this project, you will be able to:

  1. Implement recursive descent parsing in C
  2. Design ownership semantics for tree structures
  3. Use tagged unions to represent multiple types
  4. Handle parse errors with proper cleanup
  5. Distinguish null from absent values
  6. Integrate arena allocation for efficient memory management

Theoretical Foundation

JSON Value Types

JSON Types:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  null     - The value null                                 โ”‚
โ”‚  boolean  - true or false                                  โ”‚
โ”‚  number   - integers and floats (64-bit double)           โ”‚
โ”‚  string   - UTF-8 text in quotes                          โ”‚
โ”‚  array    - ordered list [v1, v2, ...]                    โ”‚
โ”‚  object   - key-value pairs {"k1": v1, "k2": v2, ...}     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Ownership Patterns

OWNERSHIP IN JSON TREES:

Pattern 1: Deep Copy (Safe, Slower)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input string: {"name": "Alice"}                          โ”‚
โ”‚                                                           โ”‚
โ”‚  Parsed tree:                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                             โ”‚
โ”‚  โ”‚ Object   โ”‚                                             โ”‚
โ”‚  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                        โ”‚
โ”‚  โ”‚ โ”‚pairsโ”€โ”ผโ”€โ”ผโ”€โ”€โ”€โ”€>โ”‚ "name" copy โ”‚  โ† OWNS this string    โ”‚
โ”‚  โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚     โ”‚ "Alice" copyโ”‚  โ† OWNS this string    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                        โ”‚
โ”‚                                                           โ”‚
โ”‚  Tree OWNS all string memory. Can outlive input.          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pattern 2: Shallow Copy (Fast, Dangerous)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Input string: {"name": "Alice"}                          โ”‚
โ”‚                       โ†‘      โ†‘                            โ”‚
โ”‚  Parsed tree:         โ”‚      โ”‚                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”‚      โ”‚                            โ”‚
โ”‚  โ”‚ Object   โ”‚         โ”‚      โ”‚                            โ”‚
โ”‚  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”                        โ”‚
โ”‚  โ”‚ โ”‚pairsโ”€โ”ผโ”€โ”ผโ”€โ”€>โ”‚ ptr โ”‚  ptr โ”‚  โ”‚  โ† Points INTO input   โ”‚
โ”‚  โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”˜                        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                             โ”‚
โ”‚                                                           โ”‚
โ”‚  Tree BORROWS strings. Dies if input freed.               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Pattern 3: Arena + Copy (Best of Both)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Arena owns all memory. Tree uses arena allocations.       โ”‚
โ”‚  To free: arena_destroy() frees EVERYTHING.               โ”‚
โ”‚  No individual free calls. No ownership tracking.          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Representing Absence vs Null

Key distinction:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  {"value": null}  - key exists, value is null              โ”‚
โ”‚  {}               - key does not exist                     โ”‚
โ”‚                                                            โ”‚
โ”‚  These are DIFFERENT!                                      โ”‚
โ”‚                                                            โ”‚
โ”‚  json_get(obj, "value") should distinguish:               โ”‚
โ”‚    โ€ข Found key, value is null โ†’ return JSON_NULL type     โ”‚
โ”‚    โ€ข Key not found โ†’ return NULL pointer (or error code)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Project Specification

Core API

// Parsing
JsonValue* json_parse(const char* input);
JsonValue* json_parse_file(const char* filename);
void json_free(JsonValue* value);

// Type checking
JsonType json_type(JsonValue* value);
bool json_is_null(JsonValue* value);
bool json_is_bool(JsonValue* value);
bool json_is_number(JsonValue* value);
bool json_is_string(JsonValue* value);
bool json_is_array(JsonValue* value);
bool json_is_object(JsonValue* value);

// Value extraction
bool json_get_bool(JsonValue* value);
double json_get_number(JsonValue* value);
const char* json_get_string(JsonValue* value);

// Array access
size_t json_array_length(JsonValue* array);
JsonValue* json_array_get(JsonValue* array, size_t index);

// Object access
size_t json_object_length(JsonValue* object);
JsonValue* json_object_get(JsonValue* object, const char* key);
const char* json_object_key_at(JsonValue* object, size_t index);
JsonValue* json_object_value_at(JsonValue* object, size_t index);

// Utility
void json_print(JsonValue* value);
void json_pretty_print(JsonValue* value, int indent);

Expected Output

$ ./jsonquery data.json
Successfully parsed JSON in 0.34ms
Tree structure:
  - Root: Object (4 keys)
  - Total nodes: 247
  - Strings: 89 (12.4 KB)
  - Arrays: 12
  - Objects: 18
  - Primitives: 128

$ ./jsonquery data.json '.users[0].name'
"Alice Johnson"

$ ./jsonquery data.json '.users[].age' --format=csv
28
34
22

$ ./jsonquery input.json --pretty-print
{
  "users": [
    {
      "name": "Alice",
      "age": 28,
      "active": true
    }
  ]
}

Solution Architecture

Data Structures

typedef enum {
    JSON_NULL,
    JSON_BOOL,
    JSON_NUMBER,
    JSON_STRING,
    JSON_ARRAY,
    JSON_OBJECT
} JsonType;

typedef struct JsonValue {
    JsonType type;
    union {
        bool bool_val;
        double number_val;
        struct {
            char* str;
            size_t len;
        } string_val;
        struct {
            struct JsonValue** elements;
            size_t count;
            size_t capacity;
        } array_val;
        struct {
            char** keys;
            struct JsonValue** values;
            size_t count;
            size_t capacity;
        } object_val;
    } value;
} JsonValue;

Parsing Strategy

JsonValue* parse_value(const char** cursor) {
    skip_whitespace(cursor);
    char c = **cursor;

    if (c == '{') return parse_object(cursor);
    if (c == '[') return parse_array(cursor);
    if (c == '"') return parse_string(cursor);
    if (c == 't' || c == 'f') return parse_bool(cursor);
    if (c == 'n') return parse_null(cursor);
    if (c == '-' || isdigit(c)) return parse_number(cursor);

    return NULL;  // Parse error
}

Implementation Guide

Phase 1: Lexer/Tokenizer (2-3 hours)

static void skip_whitespace(const char** cursor) {
    while (**cursor == ' ' || **cursor == '\t' ||
           **cursor == '\n' || **cursor == '\r') {
        (*cursor)++;
    }
}

static bool expect(const char** cursor, char c) {
    skip_whitespace(cursor);
    if (**cursor == c) {
        (*cursor)++;
        return true;
    }
    return false;
}

Phase 2: Primitive Parsing (2-3 hours)

static JsonValue* parse_null(const char** cursor) {
    if (strncmp(*cursor, "null", 4) == 0) {
        *cursor += 4;
        JsonValue* v = malloc(sizeof(JsonValue));
        v->type = JSON_NULL;
        return v;
    }
    return NULL;
}

static JsonValue* parse_bool(const char** cursor) {
    JsonValue* v = malloc(sizeof(JsonValue));
    v->type = JSON_BOOL;

    if (strncmp(*cursor, "true", 4) == 0) {
        *cursor += 4;
        v->value.bool_val = true;
        return v;
    }
    if (strncmp(*cursor, "false", 5) == 0) {
        *cursor += 5;
        v->value.bool_val = false;
        return v;
    }

    free(v);
    return NULL;
}

static JsonValue* parse_number(const char** cursor) {
    char* end;
    double num = strtod(*cursor, &end);

    if (end == *cursor) return NULL;

    JsonValue* v = malloc(sizeof(JsonValue));
    v->type = JSON_NUMBER;
    v->value.number_val = num;
    *cursor = end;
    return v;
}

Phase 3: String Parsing (2-3 hours)

static JsonValue* parse_string(const char** cursor) {
    if (**cursor != '"') return NULL;
    (*cursor)++;

    const char* start = *cursor;
    size_t len = 0;

    // Find end of string (handle escapes)
    while (**cursor != '"') {
        if (**cursor == '\0') return NULL;  // Unterminated
        if (**cursor == '\\') {
            (*cursor)++;
            if (**cursor == '\0') return NULL;
        }
        (*cursor)++;
        len++;
    }

    // Copy string
    char* str = malloc(len + 1);
    // ... copy with escape handling ...

    (*cursor)++;  // Skip closing quote

    JsonValue* v = malloc(sizeof(JsonValue));
    v->type = JSON_STRING;
    v->value.string_val.str = str;
    v->value.string_val.len = len;
    return v;
}

Phase 4: Array and Object Parsing (3-4 hours)

static JsonValue* parse_array(const char** cursor) {
    if (!expect(cursor, '[')) return NULL;

    JsonValue* v = malloc(sizeof(JsonValue));
    v->type = JSON_ARRAY;
    v->value.array_val.elements = NULL;
    v->value.array_val.count = 0;
    v->value.array_val.capacity = 0;

    skip_whitespace(cursor);
    if (**cursor == ']') {
        (*cursor)++;
        return v;  // Empty array
    }

    while (1) {
        JsonValue* elem = parse_value(cursor);
        if (!elem) {
            json_free(v);
            return NULL;
        }

        // Add to array (grow if needed)
        array_push(v, elem);

        skip_whitespace(cursor);
        if (**cursor == ']') {
            (*cursor)++;
            return v;
        }
        if (!expect(cursor, ',')) {
            json_free(v);
            return NULL;
        }
    }
}

static JsonValue* parse_object(const char** cursor) {
    if (!expect(cursor, '{')) return NULL;

    JsonValue* v = malloc(sizeof(JsonValue));
    v->type = JSON_OBJECT;
    v->value.object_val.keys = NULL;
    v->value.object_val.values = NULL;
    v->value.object_val.count = 0;
    v->value.object_val.capacity = 0;

    skip_whitespace(cursor);
    if (**cursor == '}') {
        (*cursor)++;
        return v;  // Empty object
    }

    while (1) {
        // Parse key (must be string)
        JsonValue* key = parse_string(cursor);
        if (!key) {
            json_free(v);
            return NULL;
        }

        if (!expect(cursor, ':')) {
            json_free(key);
            json_free(v);
            return NULL;
        }

        JsonValue* value = parse_value(cursor);
        if (!value) {
            json_free(key);
            json_free(v);
            return NULL;
        }

        // Add key-value pair
        object_insert(v, key->value.string_val.str, value);
        free(key);  // We took ownership of the string

        skip_whitespace(cursor);
        if (**cursor == '}') {
            (*cursor)++;
            return v;
        }
        if (!expect(cursor, ',')) {
            json_free(v);
            return NULL;
        }
    }
}

Phase 5: Free Function (1-2 hours)

void json_free(JsonValue* value) {
    if (!value) return;

    switch (value->type) {
        case JSON_STRING:
            free(value->value.string_val.str);
            break;

        case JSON_ARRAY:
            for (size_t i = 0; i < value->value.array_val.count; i++) {
                json_free(value->value.array_val.elements[i]);
            }
            free(value->value.array_val.elements);
            break;

        case JSON_OBJECT:
            for (size_t i = 0; i < value->value.object_val.count; i++) {
                free(value->value.object_val.keys[i]);
                json_free(value->value.object_val.values[i]);
            }
            free(value->value.object_val.keys);
            free(value->value.object_val.values);
            break;

        default:
            break;
    }

    free(value);
}

Common Pitfalls

Pitfall 1: Partial Parse Cleanup

// WRONG: Memory leak on parse error
JsonValue* elem = parse_value(cursor);
if (!elem) return NULL;  // Leaks already-parsed array elements!

// CORRECT: Free on error
JsonValue* elem = parse_value(cursor);
if (!elem) {
    json_free(v);  // Free what we've built so far
    return NULL;
}

Pitfall 2: String Ownership Confusion

// DANGEROUS: Who owns key->value.string_val.str?
char* key_str = key->value.string_val.str;
object_insert(v, key_str, value);
json_free(key);  // This frees key_str too!
// Now v has a dangling pointer

// CORRECT: Transfer ownership explicitly
char* key_str = key->value.string_val.str;
key->value.string_val.str = NULL;  // Prevent double-free
object_insert(v, key_str, value);  // v now owns key_str
free(key);  // Safe - str is NULL

Pitfall 3: Confusing Null and Missing

// BAD API:
JsonValue* v = json_object_get(obj, "missing_key");
if (v == NULL) {
    // Is key missing, or is value null?
}

// GOOD API:
bool found;
JsonValue* v = json_object_get(obj, "key", &found);
if (!found) {
    // Key doesn't exist
} else if (v->type == JSON_NULL) {
    // Key exists, value is null
}

Interview Preparation

Common Questions

  1. โ€œDeep copy vs shallow copy for strings?โ€
    • Deep: safe, owns data, can outlive input
    • Shallow: fast, borrows data, dies with input
    • Arena: fast, safe, bulk deallocation
  2. โ€œHow do you handle parse errors with partial trees?โ€
    • Track all allocations
    • Free everything built so far
    • Or use arena: just destroy arena
  3. โ€œNull vs key not found?โ€
    • Return error code + output parameter
    • Or use sentinel value (not NULL)
    • Document behavior clearly

Self-Assessment Checklist

  • Parse all JSON value types
  • Handle nested structures correctly
  • Clean up on parse errors
  • No memory leaks (Valgrind clean)
  • Distinguish null from missing keys

Next Project: P05: Linked List Database - Apply invariants to a classic data structure with complex operations.