LEARN PROGRAMMING LANGUAGE CONSTRUCTION DEEP DIVE

Learn Programming Language Construction: From Tokens to Machine Code

Goal: Deeply understand how programming languages work—from lexing and parsing through interpretation and compilation, to virtual machines, garbage collection, and type systems.

Why Building a Language Matters

Every time you write code, you’re using a tool that someone designed and built. That “someone” made choices about syntax, semantics, memory management, and execution strategy. By building your own language, you’ll understand:

Why languages have the syntax they do
How your code becomes machine instructions
Why some operations are fast and others slow
What “garbage collection” actually does
How type systems catch bugs before runtime
Why language design involves deep trade-offs

After completing these projects, you will be able to:

Design and implement a complete programming language
Build interpreters (tree-walking and bytecode-based)
Build compilers that generate native code via LLVM
Implement garbage collectors and memory managers
Design type systems with inference
Understand any language’s implementation at a deep level

Core Concept Analysis

The Language Implementation Pipeline

Source Code
     │
     ▼
┌─────────────────┐
│     Lexer       │  "Tokenization"
│   (Scanner)     │  Source → Tokens
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│     Parser      │  "Syntax Analysis"
│                 │  Tokens → AST
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Semantic      │  "Type Checking, Name Resolution"
│   Analyzer      │  AST → Annotated AST
└────────┬────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌───────┐  ┌───────────┐
│ Inter │  │ Compiler  │
│ preter│  │           │
└───────┘  └─────┬─────┘
                 │
           ┌─────┴─────┐
           ▼           ▼
      ┌────────┐  ┌─────────┐
      │Bytecode│  │ Native  │
      │   VM   │  │  Code   │
      └────────┘  └─────────┘

Fundamental Concepts

1. Lexical Analysis (Tokenization)

The lexer breaks source code into tokens—the “words” of your language:

let x = 42 + y;

Tokens:
  LET        "let"
  IDENTIFIER "x"
  EQUALS     "="
  NUMBER     "42"
  PLUS       "+"
  IDENTIFIER "y"
  SEMICOLON  ";"

Key concepts:

Finite automata: States and transitions for recognizing patterns
Regular expressions: Formal way to specify token patterns
Lookahead: Sometimes you need to peek at next characters
Position tracking: For error messages (line, column)

2. Parsing (Syntax Analysis)

The parser builds a tree structure (AST) from tokens:

let x = 42 + y;

AST:
LetStatement
├── name: "x"
└── value: BinaryExpr
          ├── op: "+"
          ├── left: NumberLiteral(42)
          └── right: Identifier("y")

Key concepts:

Grammar: Formal rules for valid syntax (BNF, EBNF)
Recursive descent: Hand-written parser following grammar
Operator precedence: 2 + 3 * 4 = 2 + (3 * 4), not (2 + 3) * 4
Parser generators: Tools that generate parsers from grammars

3. Semantic Analysis

After parsing, we need to check meaning:

Name resolution: What does x refer to?
Type checking: Can you add a string and a number?
Scope analysis: Is this variable accessible here?

fn foo() {
    let x = 1;
    {
        let y = x + 2;  // x is visible here
    }
    print(y);  // ERROR: y not in scope
}

4. Interpretation vs Compilation

Tree-Walking Interpreter: Directly executes the AST

eval(BinaryExpr(+, 1, 2))
  → eval(1) + eval(2)
  → 1 + 2
  → 3

Bytecode Compiler + VM: Compiles to intermediate representation

Source:     1 + 2
Bytecode:   PUSH 1
            PUSH 2
            ADD
VM Stack:   [] → [1] → [1,2] → [3]

Native Compiler: Generates machine code

Source:     1 + 2
Assembly:   mov eax, 1
            add eax, 2

5. Virtual Machines

Stack-based (JVM, Python, CLR):

PUSH 5      ; stack: [5]
PUSH 3      ; stack: [5, 3]
ADD         ; stack: [8]

Register-based (Lua, Dalvik):

LOAD R1, 5    ; R1 = 5
LOAD R2, 3    ; R2 = 3
ADD R0, R1, R2 ; R0 = R1 + R2 = 8

6. Memory Management

Manual (C, C++): Programmer allocates/frees Reference counting: Track references, free when zero Garbage collection: Automatically find and free unreachable objects

Mark-sweep: Mark reachable, sweep unmarked
Copying: Copy live objects, abandon old space
Generational: Focus on young objects (they die fast)

7. Type Systems

No types (Assembly): Anything goes Dynamic typing (Python, JS): Types checked at runtime Static typing (C, Java): Types checked at compile time Type inference (Haskell, Rust): Compiler deduces types

-- Hindley-Milner infers: map :: (a -> b) -> [a] -> [b]
map f []     = []
map f (x:xs) = f x : map f xs

Project List

Projects are ordered from fundamental to advanced, building on each other.

Project 1: Calculator Language (Lexer + Parser Basics)

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Python
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Lexing / Parsing / Expression Evaluation
Software or Tool: Text editor, C compiler
Main Book: “Crafting Interpreters” by Robert Nystrom (Chapters 4-6)

What you’ll build: A calculator that parses and evaluates mathematical expressions with proper operator precedence, parentheses, and variables—like a tiny spreadsheet formula engine.

Why it teaches language basics: This strips away all complexity to focus on the core: taking text, breaking it into tokens, building a tree, and evaluating it. Every language starts here.

Core challenges you’ll face:

Tokenizing numbers, operators, identifiers → maps to finite state machines, character classification
Parsing with correct precedence → maps to Pratt parsing or recursive descent with precedence climbing
Handling parentheses → maps to recursive parsing
Evaluating the AST → maps to tree-walking interpretation

Key Concepts:

Lexical Analysis: “Crafting Interpreters” Chapter 4 - Robert Nystrom
Operator Precedence: Pratt Parsing - Bob Nystrom
Expression Evaluation: “Crafting Interpreters” Chapter 7 - Robert Nystrom
Recursive Descent: “Engineering a Compiler” Chapter 3 - Cooper & Torczon

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, understanding of recursion

Real world outcome:

calc> 2 + 3 * 4
14
calc> (2 + 3) * 4
20
calc> x = 10
10
calc> x * 2 + 5
25
calc> sin(3.14159 / 2)
1.0
calc> sqrt(x)
3.162...

Implementation Hints:

Start with the lexer (tokenizer):

Token Types:
- NUMBER (integer or float)
- IDENTIFIER (variable names, function names)
- PLUS, MINUS, STAR, SLASH, CARET (operators)
- LPAREN, RPAREN
- EQUALS (assignment)
- EOF (end of input)
Lexer Structure:
- Keep track of current position in source
- next_char(): advance and return character
- peek(): look at next character without consuming
- scan_token(): produce the next token
- Skip whitespace between tokens
Parser with Precedence (Pratt parsing approach):
- Each operator has a “binding power” (precedence)
- + and -: binding power 10
- * and /: binding power 20
- ^ (power): binding power 30, right-associative
- Unary -: higher than binary operators
- Parentheses: parse inner expression at binding power 0

Recursive Descent Alternative:

expression → term (('+' | '-') term)*
term       → factor (('*' | '/') factor)*
factor     → primary ('^' factor)?
primary    → NUMBER | IDENTIFIER | '(' expression ')' | '-' primary

Evaluation:
- For each AST node type, define how to evaluate
- Binary: evaluate left, evaluate right, apply operator
- Variable: look up in environment (hash table)
- Assignment: evaluate right side, store in environment

Ask yourself: Why does * have higher precedence than +? How would you add the modulo (%) operator?

Learning milestones:

Lexer correctly tokenizes expressions → You understand scanning
Parser handles precedence correctly → You understand parsing
Variables work → You understand symbol tables
Functions like sin/cos work → You understand function calls
Error messages show line/column → You understand error handling

Project 2: JSON Parser (Recursive Descent Mastery)

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Zig
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 1: Beginner
Knowledge Area: Parsing / Data Structures
Software or Tool: Text editor, C compiler
Main Book: “Crafting Interpreters” by Robert Nystrom (Chapter 6)

What you’ll build: A complete JSON parser that handles all JSON types (objects, arrays, strings, numbers, booleans, null), including proper Unicode escape handling and error reporting.

Why it teaches recursive descent: JSON has a clean, recursive grammar that’s perfect for learning recursive descent parsing. Objects contain values, values can be objects—this recursion is the heart of parsing.

Core challenges you’ll face:

Handling nested structures → maps to recursive parsing
String escape sequences → maps to lexer complexity, Unicode
Number formats → maps to handling edge cases (-0, exponents)
Memory management → maps to building data structures while parsing

Key Concepts:

JSON Grammar: RFC 8259
Recursive Descent: “Crafting Interpreters” Chapter 6 - Robert Nystrom
Unicode Handling: “The Absolute Minimum Every Developer Must Know About Unicode” - Joel Spolsky

Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1 completed, understanding of recursion and data structures

Real world outcome:

// Parse JSON and access values
JsonValue* root = json_parse(source);

if (root->type == JSON_OBJECT) {
    JsonValue* name = json_object_get(root, "name");
    printf("Name: %s\n", name->string_value);

    JsonValue* scores = json_object_get(root, "scores");
    for (int i = 0; i < scores->array_length; i++) {
        printf("Score %d: %f\n", i, scores->array[i]->number_value);
    }
}

json_free(root);

Implementation Hints:

JSON grammar is simple and clean:

value   → object | array | string | number | "true" | "false" | "null"
object  → '{' (pair (',' pair)*)? '}'
pair    → string ':' value
array   → '[' (value (',' value)*)? ']'

Data Structure:

typedef enum {
    JSON_NULL, JSON_BOOL, JSON_NUMBER, JSON_STRING, JSON_ARRAY, JSON_OBJECT
} JsonType;

typedef struct JsonValue {
    JsonType type;
    union {
        bool bool_value;
        double number_value;
        char* string_value;
        struct { struct JsonValue** items; size_t count; } array;
        struct { char** keys; struct JsonValue** values; size_t count; } object;
    };
} JsonValue;

Parsing Functions (one per grammar rule):

JsonValue* parse_value(Parser* p);      // Entry point
JsonValue* parse_object(Parser* p);     // Parse {...}
JsonValue* parse_array(Parser* p);      // Parse [...]
JsonValue* parse_string(Parser* p);     // Parse "..."
JsonValue* parse_number(Parser* p);     // Parse 123.45e-6

String Escapes:
- \" → "
- \\ → \
- \/ → /
- \n, \r, \t, \b, \f → control characters
- \uXXXX → Unicode code point (tricky!)
Error Handling:
- Track line and column in lexer
- On error, report position and expected token
- Consider recovery (skip to next } or ])

Think about: How would you handle very deeply nested JSON (stack overflow)? How would you stream-parse a huge JSON file?

Learning milestones:

Simple values parse (null, true, numbers) → You understand token matching
Arrays work → You understand recursive descent
Objects work → You understand key-value parsing
Escape sequences work → You understand string handling
Good error messages → You understand error recovery

Project 3: Regular Expression Engine

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Finite Automata / Pattern Matching
Software or Tool: Text editor, C compiler
Main Book: “Engineering a Compiler” by Cooper & Torczon (Chapter 2)

What you’ll build: A regular expression engine that compiles regex patterns to NFAs (or directly matches via backtracking), supporting common operators: concatenation, alternation (|), repetition (*, +, ?), and character classes.

Why it teaches lexer foundations: Lexers are built on regular expressions. Understanding how regex engines work—either through finite automata or backtracking—gives you insight into what makes tokenization efficient or pathological.

Core challenges you’ll face:

Parsing regex syntax → maps to recursive descent on regex grammar
Building NFA from regex → maps to Thompson’s construction
NFA simulation → maps to subset construction, epsilon closure
Handling pathological cases → maps to understanding catastrophic backtracking

Key Concepts:

Thompson’s Construction: “Engineering a Compiler” Chapter 2 - Cooper & Torczon
NFA to DFA: “Compilers: Principles, Techniques, and Tools” Chapter 3 - Aho et al.
Regular Expression Matching Can Be Simple And Fast: Russ Cox’s Article

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Understanding of graphs, finite automata basics

Real world outcome:

Regex* re = regex_compile("a(b|c)*d");
printf("%d\n", regex_match(re, "ad"));      // 1 (true)
printf("%d\n", regex_match(re, "abcd"));    // 1 (true)
printf("%d\n", regex_match(re, "abbbcd"));  // 1 (true)
printf("%d\n", regex_match(re, "aed"));     // 0 (false)

// With capturing groups
Match m = regex_search(regex_compile("(\\d+)-(\\d+)"), "range: 10-20");
printf("Start: %s, End: %s\n", m.groups[1], m.groups[2]);
// Output: Start: 10, End: 20

Implementation Hints:

Two main approaches:

Approach 1: NFA Simulation (Thompson’s Algorithm)

Parse regex into AST:

typedef enum { CHAR, CONCAT, ALT, STAR, PLUS, QUEST } RegexType;
typedef struct RegexNode {
    RegexType type;
    char c;  // for CHAR
    struct RegexNode *left, *right;  // for binary ops
    struct RegexNode *child;  // for unary ops
} RegexNode;

Convert AST to NFA (Thompson’s construction):
- Each regex constructs an NFA fragment with one start and one end state
- CHAR: Single transition on that character
- CONCAT: Connect end of first to start of second
- ALT: New start with ε-transitions to both; both ends ε-transition to new end
- STAR: ε from start to end (skip); ε from end back to start (loop)
Simulate NFA:
- Maintain set of current states (can be in multiple states due to ε)
- For each input character, compute next state set
- Match succeeds if final state set contains accept state

Approach 2: Backtracking

Directly interpret regex AST:
- At alternation: try first option, backtrack if fails
- At repetition: try matching, save position for backtracking
- Simpler to implement but exponential worst-case

Character Classes:

[abc] → matches a, b, or c
[a-z] → matches lowercase letters
[^abc] → matches anything except a, b, c
\d, \w, \s → digit, word char, whitespace

Think about: Why does (a+)+ take exponential time on “aaaaaaaaaaaaaaab” with backtracking but linear time with NFA simulation?

Learning milestones:

**Simple patterns match (abc, a b)** → You understand basic NFA
Repetition works (a*, a+) → You understand loops in NFA
Character classes work → You understand efficient matching
No exponential blowup → You understand NFA simulation
Capture groups work → You understand tracking positions

Project 4: Tree-Walking Interpreter (Full Language)

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Interpreters / Language Design
Software or Tool: Text editor, C compiler
Main Book: “Crafting Interpreters” by Robert Nystrom (Part II)

What you’ll build: A complete tree-walking interpreter for a dynamically-typed language with variables, functions, closures, control flow, and basic data structures—similar to Lox from Crafting Interpreters.

Why it teaches interpretation: Tree-walking is the most straightforward way to execute code. You’ll implement every language feature directly, seeing exactly how variables, scope, functions, and closures work.

Core challenges you’ll face:

Variable scoping → maps to environments, scope chains
Function calls → maps to activation records, call stacks
Closures → maps to capturing environment references
Control flow → maps to early returns, break/continue

Key Concepts:

Environments and Scope: “Crafting Interpreters” Chapter 8 - Robert Nystrom
Functions and Closures: “Crafting Interpreters” Chapters 10-11 - Robert Nystrom
Control Flow: “Crafting Interpreters” Chapter 9 - Robert Nystrom

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed, understanding of scope and functions

Real world outcome:

// Your language can run this:
fn fib(n) {
    if (n <= 1) return n;
    return fib(n - 1) + fib(n - 2);
}

print(fib(20));  // 6765

// Closures work:
fn makeCounter() {
    var count = 0;
    fn increment() {
        count = count + 1;
        return count;
    }
    return increment;
}

var counter = makeCounter();
print(counter());  // 1
print(counter());  // 2
print(counter());  // 3

Implementation Hints:

Design your language’s features:

Data Types:
- Nil, Boolean, Number (double), String
- Later: Arrays, Objects/Hashes
Expressions:
- Literals, variables, binary ops, unary ops
- Grouping (parentheses)
- Function calls
- Property access (for objects)
Statements:
- Expression statements (foo();)
- Print statement (print x;)
- Variable declaration (var x = 1;)
- Block ({ ... })
- If/else, while, for
- Function declaration
- Return

Environment (Symbol Table):

typedef struct Environment {
    struct Environment* enclosing;  // Parent scope
    HashTable values;               // Variable bindings
} Environment;

Value env_get(Environment* env, char* name) {
    if (hash_table_has(&env->values, name)) {
        return hash_table_get(&env->values, name);
    }
    if (env->enclosing != NULL) {
        return env_get(env->enclosing, name);
    }
    error("Undefined variable: %s", name);
}

Function Values:

typedef struct {
    ASTNode* declaration;    // The function's AST
    Environment* closure;    // Environment where function was defined
} Function;

Executing a Call:
- Create new environment with function’s closure as parent
- Bind parameters to argument values
- Execute function body in new environment
- Handle return (might need exception-like mechanism)
Closures:
- When creating a function, capture current environment
- When calling, use captured environment as parent
- This makes inner functions “remember” outer variables

Learning milestones:

Variables and arithmetic work → You understand environments
Control flow works → You understand interpretation
Functions work → You understand call semantics
Closures work → You understand lexical scoping
You can write recursive programs → Full interpreter working

Project 5: Bytecode Compiler + Stack VM

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Zig
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Compilation / Virtual Machines
Software or Tool: Text editor, C compiler
Main Book: “Crafting Interpreters” by Robert Nystrom (Part III)

What you’ll build: A bytecode compiler that translates your language to a custom instruction set, and a stack-based virtual machine to execute it—10-100x faster than tree-walking.

Why it teaches compilation: This is where the magic happens. You’ll transform high-level constructs into low-level operations, manage a virtual stack, and understand exactly how “real” VMs like the JVM work.

Core challenges you’ll face:

Designing an instruction set → maps to what operations does the VM support
Compiling expressions to stack ops → maps to postfix/RPN translation
Compiling control flow → maps to jumps, patching addresses
Compiling functions → maps to call frames, return addresses
VM execution loop → maps to fetch-decode-execute cycle

Key Concepts:

Bytecode Design: “Crafting Interpreters” Chapter 14 - Robert Nystrom
Stack-Based VM: “Crafting Interpreters” Chapter 15 - Robert Nystrom
Compiling Expressions: “Crafting Interpreters” Chapter 17 - Robert Nystrom
Jumps and Control Flow: “Crafting Interpreters” Chapter 23 - Robert Nystrom

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 4 completed

Real world outcome:

Source:  1 + 2 * 3

Bytecode:
0000  OP_CONSTANT  1
0002  OP_CONSTANT  2
0004  OP_CONSTANT  3
0006  OP_MULTIPLY
0007  OP_ADD
0008  OP_RETURN

Execution:
  []
  [1]           after OP_CONSTANT 1
  [1, 2]        after OP_CONSTANT 2
  [1, 2, 3]     after OP_CONSTANT 3
  [1, 6]        after OP_MULTIPLY
  [7]           after OP_ADD

Implementation Hints:

Instruction Set Design:

typedef enum {
    OP_CONSTANT,     // Push constant onto stack
    OP_NIL, OP_TRUE, OP_FALSE,
    OP_ADD, OP_SUBTRACT, OP_MULTIPLY, OP_DIVIDE,
    OP_NEGATE, OP_NOT,
    OP_EQUAL, OP_GREATER, OP_LESS,
    OP_POP,          // Discard top of stack
    OP_GET_LOCAL,    // Push local variable
    OP_SET_LOCAL,    // Store into local variable
    OP_GET_GLOBAL,   // Push global variable
    OP_SET_GLOBAL,   // Store into global
    OP_JUMP,         // Unconditional jump
    OP_JUMP_IF_FALSE, // Conditional jump
    OP_LOOP,         // Jump backward
    OP_CALL,         // Call function
    OP_RETURN,       // Return from function
} OpCode;

Chunk (Bytecode Container):

typedef struct {
    uint8_t* code;      // Bytecode array
    int* lines;         // Source line for each instruction
    ValueArray constants; // Constant pool
    int count, capacity;
} Chunk;

Compiling Expressions:
- Numbers: emit OP_CONSTANT with value in constant pool
- Binary ops: compile left, compile right, emit operator
- Unary: compile operand, emit operator
- Variables: emit OP_GET_LOCAL/GLOBAL with index
Compiling Control Flow:
- if: compile condition, emit OP_JUMP_IF_FALSE with placeholder, compile then-branch, patch jump, optionally compile else
- while: mark loop start, compile condition, emit OP_JUMP_IF_FALSE, compile body, emit OP_LOOP back to start, patch exit jump
- Need to “patch” jump offsets after knowing where to jump

VM Structure:

typedef struct {
    Chunk* chunk;
    uint8_t* ip;        // Instruction pointer
    Value stack[STACK_MAX];
    Value* stackTop;
    CallFrame frames[FRAMES_MAX];
    int frameCount;
} VM;

Main Execution Loop:

for (;;) {
    uint8_t instruction = *vm.ip++;
    switch (instruction) {
        case OP_CONSTANT: push(READ_CONSTANT()); break;
        case OP_ADD: { Value b = pop(); Value a = pop(); push(a + b); } break;
        case OP_RETURN: return;
        // ... etc
    }
}

Learning milestones:

Expressions compile and run → You understand stack VMs
Variables work → You understand locals/globals
Control flow works → You understand jumps
Functions work → You understand call frames
Performance is much faster → You understand why bytecode matters

Project 6: Mark-Sweep Garbage Collector

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Zig
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Memory Management / Garbage Collection
Software or Tool: Text editor, C compiler, Valgrind
Main Book: “Crafting Interpreters” by Robert Nystrom (Chapter 26)

What you’ll build: An automatic memory manager using mark-sweep garbage collection—objects are automatically freed when no longer reachable from the program.

Why it teaches memory management: Every high-level language needs memory management. Understanding mark-sweep—the simplest tracing GC—teaches you how the runtime finds and frees garbage, and why GC pauses happen.

Core challenges you’ll face:

Tracking all allocations → maps to object lists, allocation wrappers
Finding roots → maps to understanding what keeps objects alive
Marking reachable objects → maps to graph traversal
Sweeping dead objects → maps to freeing memory, list management
Triggering GC at right times → maps to GC scheduling

Key Concepts:

Mark-Sweep Basics: “Crafting Interpreters” Chapter 26 - Robert Nystrom
Baby’s First Garbage Collector: journal.stuffwithstuff.com - Bob Nystrom
Garbage Collection Handbook: “The Garbage Collection Handbook” by Jones, Hosking, Moss

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 5 completed

Real world outcome:

// In your language runtime:
for (int i = 0; i < 1000000; i++) {
    // These temporary strings are automatically collected
    var s = "hello" + i;
}
// Memory usage stays bounded, not 1 million strings!

// Circular references handled correctly:
var a = {};
var b = {};
a.ref = b;
b.ref = a;
// When a and b go out of scope, both are freed
// (reference counting would leak!)

Implementation Hints:

Object Header:

typedef struct Object {
    struct Object* next;  // Linked list of all objects
    bool marked;          // GC mark bit
    ObjType type;         // What kind of object
} Object;

typedef struct {
    Object obj;
    double value;
} ObjNumber;

typedef struct {
    Object obj;
    int length;
    char chars[];  // Flexible array member
} ObjString;

Tracking All Objects:

Object* vm.objects = NULL;  // Head of linked list

Object* allocate_object(size_t size, ObjType type) {
    Object* obj = malloc(size);
    obj->type = type;
    obj->marked = false;
    obj->next = vm.objects;  // Add to list
    vm.objects = obj;

    // Maybe trigger GC
    vm.bytesAllocated += size;
    if (vm.bytesAllocated > vm.nextGC) {
        collectGarbage();
    }

    return obj;
}

Finding Roots (what’s definitely alive):
- Stack: all values on VM stack
- Globals: all global variables
- Call frames: function objects, closures
- Compiler state (during compilation)

Mark Phase:

void markObject(Object* object) {
    if (object == NULL || object->marked) return;
    object->marked = true;

    // Add to worklist for tracing
    grayStack.push(object);
}

void traceReferences() {
    while (grayStack.count > 0) {
        Object* object = grayStack.pop();
        blackenObject(object);  // Mark objects this object references
    }
}

Sweep Phase:

void sweep() {
    Object** object = &vm.objects;
    while (*object != NULL) {
        if (!(*object)->marked) {
            // Unreachable - free it
            Object* unreached = *object;
            *object = unreached->next;
            freeObject(unreached);
        } else {
            // Reachable - unmark for next GC cycle
            (*object)->marked = false;
            object = &(*object)->next;
        }
    }
}

GC Scheduling:
- After collection, set next threshold based on live memory
- vm.nextGC = vm.bytesAllocated * GC_HEAP_GROW_FACTOR;

Learning milestones:

All allocations tracked → You understand object lifecycle
Mark phase traverses correctly → You understand reachability
Sweep frees correct objects → You understand collection
Circular references collected → Mark-sweep handles them
Memory stays bounded → GC is working correctly

Project 7: Closures and Upvalues

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Closures / Variable Capture
Software or Tool: Text editor, C compiler
Main Book: “Crafting Interpreters” by Robert Nystrom (Chapter 25)

What you’ll build: A proper closure implementation where functions can capture and modify variables from enclosing scopes, even after those scopes have exited.

Why it teaches closures deeply: Closures are fundamental to functional programming but tricky to implement correctly. Variables must “escape” the stack when the enclosing function returns—this requires moving them to the heap.

Core challenges you’ll face:

Identifying captured variables → maps to static analysis during compilation
Creating upvalue objects → maps to heap-allocated variable storage
Closing over stack variables → maps to detecting when to move to heap
Sharing upvalues between closures → maps to upvalue deduplication

Key Concepts:

Closures and Upvalues: “Crafting Interpreters” Chapter 25 - Robert Nystrom
Closure Implementation: “Programming Language Pragmatics” Chapter 3 - Michael Scott

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 5 completed

Real world outcome:

fn makeCounter() {
    var count = 0;
    fn increment() {
        count = count + 1;  // Captures 'count' from outer scope
        return count;
    }
    return increment;
}

var c1 = makeCounter();
var c2 = makeCounter();

print(c1());  // 1
print(c1());  // 2
print(c2());  // 1 (separate counter!)
print(c1());  // 3

Implementation Hints:

The challenge: when makeCounter returns, count is on the stack. But increment needs it later!

Upvalue Object:

typedef struct ObjUpvalue {
    Object obj;
    Value* location;   // Points to stack slot or 'closed'
    Value closed;      // Value when closed over
    struct ObjUpvalue* next;
} ObjUpvalue;

Closure Object:

typedef struct {
    Object obj;
    ObjFunction* function;
    ObjUpvalue** upvalues;  // Array of captured variables
    int upvalueCount;
} ObjClosure;

Compile-Time Analysis:
- Track which variables are captured by nested functions
- Mark those variables as needing upvalues
- Record in function’s bytecode which upvalues to capture
Creating Upvalues at Runtime:
- When a closure captures a local, create an ObjUpvalue pointing to stack slot
- Keep list of open upvalues (still pointing to stack)
- Share upvalues if multiple closures capture same variable
Closing Upvalues:
- When a local goes out of scope, check if any upvalue points to it
- If so, copy value into upvalue’s closed field
- Update location pointer to point to closed
Accessing Upvalues:
- OP_GET_UPVALUE: get value through upvalue indirection
- OP_SET_UPVALUE: set value through upvalue indirection
- Works whether upvalue is open (stack) or closed (heap)

Learning milestones:

Simple closures work → You understand basic capture
Mutations visible to closure → Upvalue indirection works
Closures outlive enclosing function → Closing works
Multiple closures share variables → Upvalue sharing works
Nested closures work → Full upvalue chain working

Project 8: Classes and Inheritance

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Object-Oriented Programming / Method Dispatch
Software or Tool: Text editor, C compiler
Main Book: “Crafting Interpreters” by Robert Nystrom (Chapters 27-29)

What you’ll build: An object-oriented extension to your language with classes, instances, methods, constructors, inheritance, and super calls.

Why it teaches OOP implementation: OOP is everywhere, but its implementation is rarely explained. You’ll understand how methods are bound to instances, how this works, and how inheritance actually dispatches to superclass methods.

Core challenges you’ll face:

Instance memory layout → maps to objects as hash tables or fixed layouts
Method binding → maps to methods capturing ‘this’
Constructor chaining → maps to initializers, superclass init
Method inheritance → maps to method tables, lookup chains
Super calls → maps to starting lookup from superclass

Key Concepts:

Classes and Instances: “Crafting Interpreters” Chapter 27 - Robert Nystrom
Methods and Initializers: “Crafting Interpreters” Chapter 28 - Robert Nystrom
Superclasses: “Crafting Interpreters” Chapter 29 - Robert Nystrom

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 7 completed

Real world outcome:

class Animal {
    init(name) {
        this.name = name;
    }

    speak() {
        print(this.name + " makes a sound");
    }
}

class Dog < Animal {
    init(name, breed) {
        super.init(name);
        this.breed = breed;
    }

    speak() {
        print(this.name + " barks!");
    }

    describe() {
        print(this.name + " is a " + this.breed);
    }
}

var dog = Dog("Rex", "German Shepherd");
dog.speak();     // Rex barks!
dog.describe();  // Rex is a German Shepherd

Implementation Hints:

Class Object:

typedef struct {
    Object obj;
    ObjString* name;
    Table methods;  // Hash table of method names → closures
} ObjClass;

Instance Object:

typedef struct {
    Object obj;
    ObjClass* klass;
    Table fields;  // Hash table of field names → values
} ObjInstance;

Bound Method (method + receiver):

typedef struct {
    Object obj;
    Value receiver;      // The 'this' object
    ObjClosure* method;  // The method closure
} ObjBoundMethod;

Method Access (instance.method):
- If instance has field with that name, return field
- Otherwise, look up in class’s method table
- Wrap in BoundMethod with instance as receiver
Method Invocation:
- At call time, bound method unwraps
- Push receiver as first local (slot 0 = this)
- Execute method body
Inheritance:
- Store superclass reference in ObjClass
- Method lookup: check class, then superclass chain
- Or: copy superclass methods into subclass table at creation
Super Calls:
- super.method() starts lookup at current class’s superclass
- Need to track which class’s method is currently executing
- Compile super as upvalue + method name

Learning milestones:

Classes and instances work → You understand object layout
Methods can access ‘this’ → You understand method binding
Constructors work → You understand initialization
Inheritance works → You understand method lookup
Super works → Full OOP implementation complete

Project 9: Static Type Checker

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: OCaml or Rust
Alternative Programming Languages: Haskell, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Type Systems / Static Analysis
Software or Tool: OCaml/Rust compiler
Main Book: “Types and Programming Languages” by Benjamin C. Pierce

What you’ll build: A static type checker for a simple typed language, including type annotations, basic type inference, generic types (parametric polymorphism), and helpful error messages.

Why it teaches type systems: Types catch bugs before runtime. Understanding how type checkers work—unification, substitution, generalization—prepares you for advanced languages like Rust, Haskell, or TypeScript.

Core challenges you’ll face:

Type representation → maps to algebraic data types for types
Bidirectional type checking → maps to synthesizing vs checking modes
Type inference → maps to unification algorithm
Generics → maps to instantiation, generalization
Error messages → maps to tracking source locations, explaining failures

Key Concepts:

Type Checking Basics: “Types and Programming Languages” Chapters 8-11 - Benjamin Pierce
Hindley-Milner: Write You a Haskell - Stephen Diehl
Bidirectional Type Checking: “Bidirectional Typing” - Dunfield & Krishnaswami

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 4 completed, understanding of type theory basics

Real world outcome:

// Your type checker catches these errors:

let x: int = "hello";  // Error: expected int, got string

fn add(a: int, b: int): int {
    return a + b;
}
add(1, "two");  // Error: argument 2: expected int, got string

// Type inference works:
let y = 42;        // inferred: int
let z = y + 1;     // inferred: int
let s = "hi";      // inferred: string

// Generics work:
fn identity<T>(x: T): T { return x; }
let a = identity(42);     // a: int
let b = identity("hi");   // b: string

Implementation Hints:

Type Representation:

type typ =
  | TInt
  | TBool
  | TString
  | TFun of typ * typ        (* function type: arg -> return *)
  | TVar of int              (* type variable (for inference) *)
  | TForall of int * typ     (* polymorphic type: ∀α. T *)
  | TArray of typ

Type Environment:

type env = (string * typ) list  (* variable name → type *)

let lookup name env =
  List.assoc_opt name env

Bidirectional Type Checking:
- Synthesize mode: Compute the type of an expression
- Check mode: Verify expression has a given type
- Some constructs synthesize (variables, function calls)
- Some check (function bodies against declared return type)

Unification (for type inference):

(* Make two types equal by finding substitution *)
let rec unify t1 t2 subst =
  let t1 = apply_subst subst t1 in
  let t2 = apply_subst subst t2 in
  match t1, t2 with
  | TInt, TInt -> subst
  | TVar n, t | t, TVar n ->
      if occurs n t then error "infinite type"
      else extend_subst n t subst
  | TFun (a1, r1), TFun (a2, r2) ->
      let subst = unify a1 a2 subst in
      unify r1 r2 subst
  | _ -> error ("cannot unify " ^ show t1 ^ " with " ^ show t2)

Generalization and Instantiation:
- Generalize: Turn monotype into polytype (close over free type vars)
- Instantiate: Turn polytype into monotype (create fresh vars)
Error Messages:
- Track source locations through type checking
- Report expected vs actual type
- Show context (which expression, which argument)

Learning milestones:

Basic types check correctly → You understand type checking
Functions type check → You understand function types
Type errors are caught → Your checker is useful
Inference works → You understand unification
Generics work → You understand polymorphism

Project 10: Register-Based Virtual Machine

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Zig
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Virtual Machines / Code Generation
Software or Tool: Text editor, C compiler
Main Book: “The Implementation of Lua 5.0” - Roberto Ierusalimschy

What you’ll build: A register-based VM (like Lua or Dalvik) with register allocation, more compact bytecode, and potentially better performance than a stack VM.

Why it teaches VM design trade-offs: Stack VMs are simple but verbose. Register VMs have fewer instructions and can be faster, but require register allocation. Comparing both teaches you VM design trade-offs.

Core challenges you’ll face:

Register allocation → maps to which virtual registers hold which values
Instruction encoding → maps to fixed-width vs variable-width
Function call convention → maps to register windows, parameter passing
Comparison with stack VM → maps to trade-off analysis

Key Concepts:

Lua VM Design: “The Implementation of Lua 5.0” - Ierusalimschy et al.
Register Allocation: “Engineering a Compiler” Chapter 13 - Cooper & Torczon
VM Design Comparison: Writing Interpreters in Rust

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 5 completed

Real world outcome:

Source: a = b + c * d

Stack VM (7 instructions):
  LOAD b
  LOAD c
  LOAD d
  MUL
  ADD
  STORE a

Register VM (3 instructions):
  MUL R2, R1, R3    ; R2 = c * d
  ADD R0, R4, R2    ; R0 = b + R2
  MOV a, R0         ; a = R0

Implementation Hints:

Instruction Format (Lua-style, 32 bits):

┌────────┬────────┬────────┬────────┐
│ opcode │   A    │   B    │   C    │
│ 6 bits │ 8 bits │ 9 bits │ 9 bits │
└────────┴────────┴────────┴────────┘

or

┌────────┬────────┬───────────────────┐
│ opcode │   A    │        Bx         │
│ 6 bits │ 8 bits │     18 bits       │
└────────┴────────┴───────────────────┘

Instruction Types:

OP_MOVE    A B      // R[A] = R[B]
OP_LOADK   A Bx     // R[A] = K[Bx] (constant)
OP_ADD     A B C    // R[A] = R[B] + R[C]
OP_JMP     sBx      // pc += sBx (signed jump)
OP_CALL    A B C    // call R[A] with B-1 args, C-1 results

Register Allocation (simple approach):
- During compilation, track “current register”
- Each expression result goes in next register
- After statement, reset register counter
- Functions have fixed register windows
Function Calls:
- Callee’s registers start after caller’s
- Arguments: R[A+1], R[A+2], …
- Return values placed starting at R[A]
- Stack frames still needed for return address
Comparison with Stack VM:
- Register VM: fewer instructions, more explicit operands
- Stack VM: simpler compiler, implicit operand location
- Performance: register VM often faster (fewer dispatches)

Learning milestones:

Basic arithmetic works → You understand register operations
Variables work with register allocation → You understand allocation
Functions work → You understand calling conventions
Benchmarks show difference → You understand performance
You can compare stack vs register → Deep VM understanding

Project 11: LLVM Backend (Native Compilation)

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: Rust (inkwell crate)
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Compilers / Code Generation
Software or Tool: LLVM, Clang, CMake
Main Book: LLVM Kaleidoscope Tutorial

What you’ll build: A compiler backend that generates LLVM IR, allowing your language to compile to native machine code with optimizations—like a “real” compiler.

Why it teaches native compilation: LLVM is the infrastructure behind Rust, Swift, Julia, and many others. Using it teaches you IR design, SSA form, and how high-level constructs become machine instructions.

Core challenges you’ll face:

LLVM IR generation → maps to understanding SSA form, basic blocks
Type mapping → maps to translating your types to LLVM types
Function compilation → maps to parameters, locals, return values
Control flow → maps to basic blocks, phi nodes, terminators
Calling C functions → maps to FFI, extern declarations

Key Concepts:

LLVM IR: LLVM Language Reference
Kaleidoscope Tutorial: LLVM Tutorial
SSA Form: “Engineering a Compiler” Chapter 9 - Cooper & Torczon
LLVM C++ API: Mukul Rathi’s Tutorial

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-4 completed, C++ knowledge

Real world outcome:

Source (your language):
  fn fib(n: int): int {
      if n <= 1 { return n; }
      return fib(n-1) + fib(n-2);
  }
  print(fib(35));

Generated LLVM IR:
  define i64 @fib(i64 %n) {
  entry:
    %cmp = icmp sle i64 %n, 1
    br i1 %cmp, label %then, label %else
  then:
    ret i64 %n
  else:
    %n1 = sub i64 %n, 1
    %fib1 = call i64 @fib(i64 %n1)
    %n2 = sub i64 %n, 2
    %fib2 = call i64 @fib(i64 %n2)
    %sum = add i64 %fib1, %fib2
    ret i64 %sum
  }

Compile and run:
  $ ./mylang fib.mylang | llc -o fib.s
  $ clang fib.s -o fib
  $ ./fib
  9227465

Implementation Hints:

Setup LLVM:

#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/IRBuilder.h"

LLVMContext context;
Module module("my_module", context);
IRBuilder<> builder(context);

Create Types:

Type* intType = Type::getInt64Ty(context);
Type* doubleType = Type::getDoubleTy(context);
Type* boolType = Type::getInt1Ty(context);
Type* voidType = Type::getVoidTy(context);

Create a Function:

FunctionType* funcType = FunctionType::get(intType, {intType}, false);
Function* func = Function::Create(funcType,
    Function::ExternalLinkage, "fib", module);

BasicBlock* entry = BasicBlock::Create(context, "entry", func);
builder.SetInsertPoint(entry);

Generate IR for Expressions:

Value* codegen(BinaryExpr* expr) {
    Value* L = codegen(expr->left);
    Value* R = codegen(expr->right);
    switch (expr->op) {
        case '+': return builder.CreateAdd(L, R, "addtmp");
        case '-': return builder.CreateSub(L, R, "subtmp");
        case '*': return builder.CreateMul(L, R, "multmp");
        case '/': return builder.CreateSDiv(L, R, "divtmp");
    }
}

Control Flow with Basic Blocks:

// if (cond) { then } else { else }
BasicBlock* thenBB = BasicBlock::Create(context, "then", func);
BasicBlock* elseBB = BasicBlock::Create(context, "else", func);
BasicBlock* mergeBB = BasicBlock::Create(context, "merge", func);

builder.CreateCondBr(condValue, thenBB, elseBB);

builder.SetInsertPoint(thenBB);
Value* thenVal = codegen(thenExpr);
builder.CreateBr(mergeBB);

builder.SetInsertPoint(elseBB);
Value* elseVal = codegen(elseExpr);
builder.CreateBr(mergeBB);

builder.SetInsertPoint(mergeBB);
PHINode* phi = builder.CreatePHI(intType, 2);
phi->addIncoming(thenVal, thenBB);
phi->addIncoming(elseVal, elseBB);

JIT Compilation (optional):
- Use LLVM’s ORC JIT
- Compile and run immediately without separate linking

Learning milestones:

Simple functions compile to IR → You understand LLVM basics
Expressions work → You understand IR generation
Control flow works → You understand basic blocks
Native binary runs → Full native compilation!
Optimizations improve code → You understand optimization passes

Project 12: Hindley-Milner Type Inference

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: OCaml or Haskell
Alternative Programming Languages: Rust, F#
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 1. The “Resume Gold”
Difficulty: Level 5: Master
Knowledge Area: Type Theory / Type Inference
Software or Tool: OCaml or GHC
Main Book: “Types and Programming Languages” by Benjamin C. Pierce (Chapters 22-23)

What you’ll build: A complete Hindley-Milner type inference engine that can deduce types for a program without any type annotations—like ML, Haskell, or Rust.

Why it teaches type inference: HM is the gold standard for type inference. Understanding Algorithm W, unification, and let-polymorphism gives you deep insight into how modern type systems work.

Core challenges you’ll face:

Fresh type variables → maps to generating unique identifiers
Unification → maps to making type equations consistent
Substitution → maps to applying solved types throughout
Generalization and instantiation → maps to polymorphism
Let-polymorphism → maps to where generalization happens

Key Concepts:

Algorithm W: “Types and Programming Languages” Chapter 22 - Benjamin Pierce
Unification: Write You a Haskell - Stephen Diehl
Let-Polymorphism: Hindley-Milner Tutorial

Difficulty: Master Time estimate: 2-3 weeks Prerequisites: Project 9 completed, understanding of lambda calculus

Real world outcome:

(* No type annotations needed! *)

let id = fun x -> x
(* Inferred: id : ∀α. α → α *)

let compose = fun f -> fun g -> fun x -> f (g x)
(* Inferred: compose : ∀α β γ. (β → γ) → (α → β) → (α → γ) *)

let rec map = fun f -> fun lst ->
    match lst with
    | [] -> []
    | x :: xs -> f x :: map f xs
(* Inferred: map : ∀α β. (α → β) → [α] → [β] *)

let result = map (fun x -> x + 1) [1, 2, 3]
(* result : [int] *)

Implementation Hints:

Type Data Structure:

type typ =
  | TVar of tvar ref     (* Type variable *)
  | TInt | TBool         (* Primitive types *)
  | TArrow of typ * typ  (* Function type *)
  | TList of typ         (* List type *)

and tvar =
  | Unbound of int * level  (* Unsolved, with unique id and level *)
  | Link of typ             (* Solved: points to another type *)

Algorithm W Outline:

let rec infer env expr =
  match expr with
  | Var x -> instantiate (lookup x env)
  | Lam (x, body) ->
      let tv = fresh_tvar () in
      let bodyType = infer (extend env x tv) body in
      TArrow (tv, bodyType)
  | App (func, arg) ->
      let funcType = infer env func in
      let argType = infer env arg in
      let resultType = fresh_tvar () in
      unify funcType (TArrow (argType, resultType));
      resultType
  | Let (x, value, body) ->
      let valueType = infer env value in
      let genType = generalize env valueType in
      infer (extend env x genType) body

Unification:

let rec unify t1 t2 =
  let t1 = find t1 in  (* Follow links *)
  let t2 = find t2 in
  match t1, t2 with
  | TVar { contents = Unbound (id1, _) },
    TVar { contents = Unbound (id2, _) } when id1 = id2 -> ()
  | TVar ({ contents = Unbound _ } as tv), t
  | t, TVar ({ contents = Unbound _ } as tv) ->
      occurs_check tv t;
      tv := Link t
  | TArrow (a1, r1), TArrow (a2, r2) ->
      unify a1 a2; unify r1 r2
  | TInt, TInt | TBool, TBool -> ()
  | _ -> error "type mismatch"

Generalization (at let bindings):

let generalize env typ =
  let freeVars = free_tvars typ in
  let envVars = free_tvars_in_env env in
  let toGeneralize = freeVars - envVars in
  (* Wrap typ in ∀ for each variable in toGeneralize *)

Instantiation (when using a polymorphic value):

let instantiate scheme =
  (* Replace each bound type variable with a fresh one *)

Levels for Efficient Generalization:
- Each scope has a “level”
- Type variables remember their level
- Only generalize variables at deeper levels

Learning milestones:

Simple expressions infer → You understand basic inference
Functions infer → You understand arrow types
Polymorphic let works → You understand generalization
Type errors are caught → Inference is complete
You can explain Algorithm W → Deep type theory understanding

Project 13: Self-Hosting Compiler

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: Your language!
Alternative Programming Languages: (must be your language)
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Bootstrapping / Compiler Design
Software or Tool: Your language implementation
Main Book: “Compilers: Principles, Techniques, and Tools” by Aho et al.

What you’ll build: A compiler for your language, written in your language—the ultimate test that your language is “real” and powerful enough to build complex software.

Why it teaches language completeness: Self-hosting proves your language is capable of real work. It forces you to use every feature, find missing ones, and truly eat your own dog food.

Core challenges you’ll face:

Bootstrapping → maps to chicken-and-egg: first compile with external impl
Language completeness → maps to what features are you missing?
Standard library → maps to file I/O, string manipulation
Error handling → maps to robust parsing and analysis
Performance → maps to is it fast enough to compile itself?

Key Concepts:

Bootstrapping: “Compilers: Principles, Techniques, and Tools” Chapter 1 - Aho et al.
Self-Hosting History: Ken Thompson’s “Reflections on Trusting Trust”
Practical Bootstrapping: Rust’s bootstrapping process documentation

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects, your language must be substantial

Real world outcome:

# Stage 0: Compile your compiler with the C implementation
$ ./mylang-c mylang.mylang -o mylang-stage1

# Stage 1: Compile your compiler with itself
$ ./mylang-stage1 mylang.mylang -o mylang-stage2

# Stage 2: Verify - should produce identical output
$ ./mylang-stage2 mylang.mylang -o mylang-stage3
$ diff mylang-stage2 mylang-stage3
# No differences - your compiler is self-hosting!

Implementation Hints:

Required Language Features:
- Strings with manipulation (slicing, concatenation)
- Arrays or lists
- Hash tables / dictionaries
- File I/O (read source, write output)
- Sufficient control flow
- Functions (of course)
- Error reporting
Bootstrapping Process:
- Stage 0: Use existing implementation (C, Python, etc.) to compile
- Stage 1: Compile with Stage 0 output
- Stage 2: Compile with Stage 1 output
- Verify: Stage 2 output should equal Stage 1 output
- (If not equal, you have a bug!)
Strategy:
- Start simple: get lexer working in your language
- Then parser, then AST printer
- Then code generator
- Test each phase independently
Common Issues:
- String handling is always more complex than expected
- Error messages are critical (debugging is hard)
- You’ll find language design flaws
- Performance may require optimization
Trust and Verification:
- The famous “trusting trust” attack
- Cross-compilation can help verify correctness
- Diverse double-compiling

Learning milestones:

Lexer runs in your language → Language is somewhat useful
Parser runs → Data structures and recursion work
Full compiler runs → Language is complete
Self-compilation works → You’ve bootstrapped!
Stage 1 = Stage 2 → Compiler is correct

Project 14: Incremental Parser (IDE Support)

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C++, TypeScript
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Parsing / IDE Tooling
Software or Tool: Your language, tree-sitter or custom
Main Book: “Modern Compiler Implementation” by Andrew Appel

What you’ll build: A parser that can handle incomplete, erroneous code and update the AST incrementally as the user types—the foundation for IDE features.

Why it teaches practical tooling: Real-world language tooling needs to work with broken code and be fast enough for every keystroke. This is a very different challenge than batch compilation.

Core challenges you’ll face:

Error recovery → maps to continuing after syntax errors
Incremental parsing → maps to reusing work from previous parse
Partial ASTs → maps to representing incomplete constructs
Fast enough for keystrokes → maps to performance optimization

Key Concepts:

Error Recovery: “Modern Compiler Implementation” Chapter 3 - Andrew Appel
Incremental Parsing: tree-sitter documentation
Red-Green Trees: Roslyn compiler architecture
LSP: Language Server Protocol specification

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-4 completed, understanding of parsing

Real world outcome:

User is typing:
  fn foo(x: int) {
      let y = x +|
                ^cursor here

Parser output:
  - Function "foo" (complete)
    - Parameter "x" of type "int"
    - Body (incomplete)
      - Let statement (incomplete)
        - Name: "y"
        - Value: BinaryExpr (incomplete)
          - Left: "x"
          - Op: "+"
          - Right: <missing expression>

IDE can still:
  - Offer completions for right-hand side
  - Show error for incomplete expression
  - Navigate to definition of "x"

Implementation Hints:

Error Recovery Strategies:
- Panic mode: Skip tokens until sync point (;, }, etc.)
- Phrase level: Insert/delete tokens to make progress
- Error productions: Grammar rules for common mistakes

Error Nodes in AST:

enum Expr {
    Binary { left: Expr, op: Token, right: Expr },
    Number(f64),
    Error { message: String, range: Range },  // Placeholder
}

Incremental Approach:
- Track which parts of source changed
- Invalidate only affected nodes
- Reparse only invalidated regions
- Keep unchanged subtrees
Tree-Sitter Style:
- Concrete syntax tree (CST) stores all tokens
- Error nodes mark problems but tree is always complete
- Queries to extract semantic info
- Built-in incremental reparsing
Performance:
- Avoid full reparse on every keystroke
- Debounce typing events
- Parse in background thread
- Cancel stale parses
Language Server Protocol:
- Standard protocol for IDE features
- Completion, diagnostics, goto definition
- Your incremental parser is the foundation

Learning milestones:

Parser continues after errors → You understand recovery
AST represents incomplete code → You understand error nodes
Fast enough for typing → You understand performance needs
Incremental updates work → You understand incremental parsing
LSP integration works → Full IDE support

Project Comparison Table

Project	Difficulty	Time	Depth	Fun Factor
1. Calculator	Beginner	Weekend	⭐⭐	⭐⭐
2. JSON Parser	Beginner	Weekend	⭐⭐	⭐⭐
3. Regex Engine	Intermediate	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐
4. Tree-Walking Interpreter	Intermediate	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
5. Bytecode VM	Advanced	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
6. Garbage Collector	Advanced	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐
7. Closures/Upvalues	Advanced	1-2 weeks	⭐⭐⭐⭐	⭐⭐⭐
8. Classes/Inheritance	Advanced	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐
9. Static Type Checker	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
10. Register VM	Expert	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐
11. LLVM Backend	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
12. Type Inference	Master	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
13. Self-Hosting	Master	2-3 months	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
14. Incremental Parser	Expert	3-4 weeks	⭐⭐⭐⭐	⭐⭐⭐

Recommended Learning Path

Phase 1: Foundations (Weeks 1-4)

Project 1: Calculator Language - Learn lexing and parsing
Project 2: JSON Parser - Master recursive descent
Project 4: Tree-Walking Interpreter - Build a complete language

Phase 2: Performance (Weeks 5-10)

Project 5: Bytecode VM - Understand compilation
Project 6: Garbage Collector - Master memory management
Project 7: Closures - Understand advanced features

Phase 3: Advanced Features (Weeks 11-16)

Project 8: Classes/Inheritance - Add OOP
Project 9: Static Type Checker - Understand types
Project 3: Regex Engine - Deep automata theory

Phase 4: Production Quality (Weeks 17-24)

Project 11: LLVM Backend - Native compilation
Project 12: Type Inference - Advanced type systems
Project 13: Self-Hosting - Ultimate validation

Final Capstone Project: Production-Quality Language

File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
Main Programming Language: C/C++ with LLVM, or Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Complete Language Implementation

What you’ll build: A complete, production-quality programming language combining all previous projects: bytecode VM or native compilation, garbage collection, static types with inference, classes, closures, and IDE tooling.

Features:

Static typing with type inference
First-class functions and closures
Classes with single inheritance
Garbage collection
LLVM native code generation
Standard library (I/O, strings, collections)
Package manager integration
Language server for IDE support
Documentation generator
REPL for interactive use

This capstone integrates everything you’ve learned into a language people could actually use.

Summary

#	Project	Main Language
1	Calculator Language	C
2	JSON Parser	C
3	Regular Expression Engine	C
4	Tree-Walking Interpreter	C
5	Bytecode Compiler + Stack VM	C
6	Mark-Sweep Garbage Collector	C
7	Closures and Upvalues	C
8	Classes and Inheritance	C
9	Static Type Checker	OCaml/Rust
10	Register-Based Virtual Machine	C
11	LLVM Backend	C++
12	Hindley-Milner Type Inference	OCaml/Haskell
13	Self-Hosting Compiler	Your Language
14	Incremental Parser	Rust
Final	Production-Quality Language	C++/Rust

Essential Resources

Books

“Crafting Interpreters” by Robert Nystrom - FREE online
“Engineering a Compiler” by Cooper & Torczon - Modern compiler textbook
“Types and Programming Languages” by Benjamin Pierce - Type theory bible
“Compilers: Principles, Techniques, and Tools” (Dragon Book) - Classic reference
“Writing a C Compiler” by Nora Sandler - Practical C compiler

Online Resources

LLVM Tutorial - Build with LLVM
Write You a Haskell - Type system implementations
Create Your Own Programming Language with Rust
Writing Interpreters in Rust
Pratt Parsing
Baby’s First Garbage Collector
Regular Expression Matching Can Be Simple And Fast - Russ Cox

Tools

LLVM/Clang - Compiler infrastructure
tree-sitter - Incremental parsing
Valgrind - Memory debugging
GDB/LLDB - Debugging

Master these projects and you’ll understand how every programming language works—from source code to machine execution.