LEARN PROGRAMMING LANGUAGE CONSTRUCTION DEEP DIVE
Learn Programming Language Construction: From Tokens to Machine Code
Goal: Deeply understand how programming languages work—from lexing and parsing through interpretation and compilation, to virtual machines, garbage collection, and type systems.
Why Building a Language Matters
Every time you write code, you’re using a tool that someone designed and built. That “someone” made choices about syntax, semantics, memory management, and execution strategy. By building your own language, you’ll understand:
- Why languages have the syntax they do
- How your code becomes machine instructions
- Why some operations are fast and others slow
- What “garbage collection” actually does
- How type systems catch bugs before runtime
- Why language design involves deep trade-offs
After completing these projects, you will be able to:
- Design and implement a complete programming language
- Build interpreters (tree-walking and bytecode-based)
- Build compilers that generate native code via LLVM
- Implement garbage collectors and memory managers
- Design type systems with inference
- Understand any language’s implementation at a deep level
Core Concept Analysis
The Language Implementation Pipeline
Source Code
│
▼
┌─────────────────┐
│ Lexer │ "Tokenization"
│ (Scanner) │ Source → Tokens
└────────┬────────┘
│
▼
┌─────────────────┐
│ Parser │ "Syntax Analysis"
│ │ Tokens → AST
└────────┬────────┘
│
▼
┌─────────────────┐
│ Semantic │ "Type Checking, Name Resolution"
│ Analyzer │ AST → Annotated AST
└────────┬────────┘
│
┌────┴────┐
▼ ▼
┌───────┐ ┌───────────┐
│ Inter │ │ Compiler │
│ preter│ │ │
└───────┘ └─────┬─────┘
│
┌─────┴─────┐
▼ ▼
┌────────┐ ┌─────────┐
│Bytecode│ │ Native │
│ VM │ │ Code │
└────────┘ └─────────┘
Fundamental Concepts
1. Lexical Analysis (Tokenization)
The lexer breaks source code into tokens—the “words” of your language:
let x = 42 + y;
Tokens:
LET "let"
IDENTIFIER "x"
EQUALS "="
NUMBER "42"
PLUS "+"
IDENTIFIER "y"
SEMICOLON ";"
Key concepts:
- Finite automata: States and transitions for recognizing patterns
- Regular expressions: Formal way to specify token patterns
- Lookahead: Sometimes you need to peek at next characters
- Position tracking: For error messages (line, column)
2. Parsing (Syntax Analysis)
The parser builds a tree structure (AST) from tokens:
let x = 42 + y;
AST:
LetStatement
├── name: "x"
└── value: BinaryExpr
├── op: "+"
├── left: NumberLiteral(42)
└── right: Identifier("y")
Key concepts:
- Grammar: Formal rules for valid syntax (BNF, EBNF)
- Recursive descent: Hand-written parser following grammar
- Operator precedence:
2 + 3 * 4=2 + (3 * 4), not(2 + 3) * 4 - Parser generators: Tools that generate parsers from grammars
3. Semantic Analysis
After parsing, we need to check meaning:
- Name resolution: What does
xrefer to? - Type checking: Can you add a string and a number?
- Scope analysis: Is this variable accessible here?
fn foo() {
let x = 1;
{
let y = x + 2; // x is visible here
}
print(y); // ERROR: y not in scope
}
4. Interpretation vs Compilation
Tree-Walking Interpreter: Directly executes the AST
eval(BinaryExpr(+, 1, 2))
→ eval(1) + eval(2)
→ 1 + 2
→ 3
Bytecode Compiler + VM: Compiles to intermediate representation
Source: 1 + 2
Bytecode: PUSH 1
PUSH 2
ADD
VM Stack: [] → [1] → [1,2] → [3]
Native Compiler: Generates machine code
Source: 1 + 2
Assembly: mov eax, 1
add eax, 2
5. Virtual Machines
Stack-based (JVM, Python, CLR):
PUSH 5 ; stack: [5]
PUSH 3 ; stack: [5, 3]
ADD ; stack: [8]
Register-based (Lua, Dalvik):
LOAD R1, 5 ; R1 = 5
LOAD R2, 3 ; R2 = 3
ADD R0, R1, R2 ; R0 = R1 + R2 = 8
6. Memory Management
Manual (C, C++): Programmer allocates/frees Reference counting: Track references, free when zero Garbage collection: Automatically find and free unreachable objects
- Mark-sweep: Mark reachable, sweep unmarked
- Copying: Copy live objects, abandon old space
- Generational: Focus on young objects (they die fast)
7. Type Systems
No types (Assembly): Anything goes Dynamic typing (Python, JS): Types checked at runtime Static typing (C, Java): Types checked at compile time Type inference (Haskell, Rust): Compiler deduces types
-- Hindley-Milner infers: map :: (a -> b) -> [a] -> [b]
map f [] = []
map f (x:xs) = f x : map f xs
Project List
Projects are ordered from fundamental to advanced, building on each other.
Project 1: Calculator Language (Lexer + Parser Basics)
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Lexing / Parsing / Expression Evaluation
- Software or Tool: Text editor, C compiler
- Main Book: “Crafting Interpreters” by Robert Nystrom (Chapters 4-6)
What you’ll build: A calculator that parses and evaluates mathematical expressions with proper operator precedence, parentheses, and variables—like a tiny spreadsheet formula engine.
Why it teaches language basics: This strips away all complexity to focus on the core: taking text, breaking it into tokens, building a tree, and evaluating it. Every language starts here.
Core challenges you’ll face:
- Tokenizing numbers, operators, identifiers → maps to finite state machines, character classification
- Parsing with correct precedence → maps to Pratt parsing or recursive descent with precedence climbing
- Handling parentheses → maps to recursive parsing
- Evaluating the AST → maps to tree-walking interpretation
Key Concepts:
- Lexical Analysis: “Crafting Interpreters” Chapter 4 - Robert Nystrom
- Operator Precedence: Pratt Parsing - Bob Nystrom
- Expression Evaluation: “Crafting Interpreters” Chapter 7 - Robert Nystrom
- Recursive Descent: “Engineering a Compiler” Chapter 3 - Cooper & Torczon
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, understanding of recursion
Real world outcome:
calc> 2 + 3 * 4
14
calc> (2 + 3) * 4
20
calc> x = 10
10
calc> x * 2 + 5
25
calc> sin(3.14159 / 2)
1.0
calc> sqrt(x)
3.162...
Implementation Hints:
Start with the lexer (tokenizer):
- Token Types:
- NUMBER (integer or float)
- IDENTIFIER (variable names, function names)
- PLUS, MINUS, STAR, SLASH, CARET (operators)
- LPAREN, RPAREN
- EQUALS (assignment)
- EOF (end of input)
- Lexer Structure:
- Keep track of current position in source
next_char(): advance and return characterpeek(): look at next character without consumingscan_token(): produce the next token- Skip whitespace between tokens
- Parser with Precedence (Pratt parsing approach):
- Each operator has a “binding power” (precedence)
+and-: binding power 10*and/: binding power 20^(power): binding power 30, right-associative- Unary
-: higher than binary operators - Parentheses: parse inner expression at binding power 0
- Recursive Descent Alternative:
expression → term (('+' | '-') term)* term → factor (('*' | '/') factor)* factor → primary ('^' factor)? primary → NUMBER | IDENTIFIER | '(' expression ')' | '-' primary - Evaluation:
- For each AST node type, define how to evaluate
- Binary: evaluate left, evaluate right, apply operator
- Variable: look up in environment (hash table)
- Assignment: evaluate right side, store in environment
Ask yourself: Why does * have higher precedence than +? How would you add the modulo (%) operator?
Learning milestones:
- Lexer correctly tokenizes expressions → You understand scanning
- Parser handles precedence correctly → You understand parsing
- Variables work → You understand symbol tables
- Functions like sin/cos work → You understand function calls
- Error messages show line/column → You understand error handling
Project 2: JSON Parser (Recursive Descent Mastery)
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Zig
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 1: Beginner
- Knowledge Area: Parsing / Data Structures
- Software or Tool: Text editor, C compiler
- Main Book: “Crafting Interpreters” by Robert Nystrom (Chapter 6)
What you’ll build: A complete JSON parser that handles all JSON types (objects, arrays, strings, numbers, booleans, null), including proper Unicode escape handling and error reporting.
Why it teaches recursive descent: JSON has a clean, recursive grammar that’s perfect for learning recursive descent parsing. Objects contain values, values can be objects—this recursion is the heart of parsing.
Core challenges you’ll face:
- Handling nested structures → maps to recursive parsing
- String escape sequences → maps to lexer complexity, Unicode
- Number formats → maps to handling edge cases (-0, exponents)
- Memory management → maps to building data structures while parsing
Key Concepts:
- JSON Grammar: RFC 8259
- Recursive Descent: “Crafting Interpreters” Chapter 6 - Robert Nystrom
- Unicode Handling: “The Absolute Minimum Every Developer Must Know About Unicode” - Joel Spolsky
Difficulty: Beginner Time estimate: Weekend Prerequisites: Project 1 completed, understanding of recursion and data structures
Real world outcome:
// Parse JSON and access values
JsonValue* root = json_parse(source);
if (root->type == JSON_OBJECT) {
JsonValue* name = json_object_get(root, "name");
printf("Name: %s\n", name->string_value);
JsonValue* scores = json_object_get(root, "scores");
for (int i = 0; i < scores->array_length; i++) {
printf("Score %d: %f\n", i, scores->array[i]->number_value);
}
}
json_free(root);
Implementation Hints:
JSON grammar is simple and clean:
value → object | array | string | number | "true" | "false" | "null"
object → '{' (pair (',' pair)*)? '}'
pair → string ':' value
array → '[' (value (',' value)*)? ']'
- Data Structure:
typedef enum { JSON_NULL, JSON_BOOL, JSON_NUMBER, JSON_STRING, JSON_ARRAY, JSON_OBJECT } JsonType; typedef struct JsonValue { JsonType type; union { bool bool_value; double number_value; char* string_value; struct { struct JsonValue** items; size_t count; } array; struct { char** keys; struct JsonValue** values; size_t count; } object; }; } JsonValue; - Parsing Functions (one per grammar rule):
JsonValue* parse_value(Parser* p); // Entry point JsonValue* parse_object(Parser* p); // Parse {...} JsonValue* parse_array(Parser* p); // Parse [...] JsonValue* parse_string(Parser* p); // Parse "..." JsonValue* parse_number(Parser* p); // Parse 123.45e-6 - String Escapes:
\"→"\\→\\/→/\n,\r,\t,\b,\f→ control characters\uXXXX→ Unicode code point (tricky!)
- Error Handling:
- Track line and column in lexer
- On error, report position and expected token
- Consider recovery (skip to next
}or])
Think about: How would you handle very deeply nested JSON (stack overflow)? How would you stream-parse a huge JSON file?
Learning milestones:
- Simple values parse (null, true, numbers) → You understand token matching
- Arrays work → You understand recursive descent
- Objects work → You understand key-value parsing
- Escape sequences work → You understand string handling
- Good error messages → You understand error recovery
Project 3: Regular Expression Engine
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Finite Automata / Pattern Matching
- Software or Tool: Text editor, C compiler
- Main Book: “Engineering a Compiler” by Cooper & Torczon (Chapter 2)
What you’ll build: A regular expression engine that compiles regex patterns to NFAs (or directly matches via backtracking), supporting common operators: concatenation, alternation (|), repetition (*, +, ?), and character classes.
Why it teaches lexer foundations: Lexers are built on regular expressions. Understanding how regex engines work—either through finite automata or backtracking—gives you insight into what makes tokenization efficient or pathological.
Core challenges you’ll face:
- Parsing regex syntax → maps to recursive descent on regex grammar
- Building NFA from regex → maps to Thompson’s construction
- NFA simulation → maps to subset construction, epsilon closure
- Handling pathological cases → maps to understanding catastrophic backtracking
Key Concepts:
- Thompson’s Construction: “Engineering a Compiler” Chapter 2 - Cooper & Torczon
- NFA to DFA: “Compilers: Principles, Techniques, and Tools” Chapter 3 - Aho et al.
- Regular Expression Matching Can Be Simple And Fast: Russ Cox’s Article
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Understanding of graphs, finite automata basics
Real world outcome:
Regex* re = regex_compile("a(b|c)*d");
printf("%d\n", regex_match(re, "ad")); // 1 (true)
printf("%d\n", regex_match(re, "abcd")); // 1 (true)
printf("%d\n", regex_match(re, "abbbcd")); // 1 (true)
printf("%d\n", regex_match(re, "aed")); // 0 (false)
// With capturing groups
Match m = regex_search(regex_compile("(\\d+)-(\\d+)"), "range: 10-20");
printf("Start: %s, End: %s\n", m.groups[1], m.groups[2]);
// Output: Start: 10, End: 20
Implementation Hints:
Two main approaches:
Approach 1: NFA Simulation (Thompson’s Algorithm)
- Parse regex into AST:
typedef enum { CHAR, CONCAT, ALT, STAR, PLUS, QUEST } RegexType; typedef struct RegexNode { RegexType type; char c; // for CHAR struct RegexNode *left, *right; // for binary ops struct RegexNode *child; // for unary ops } RegexNode; - Convert AST to NFA (Thompson’s construction):
- Each regex constructs an NFA fragment with one start and one end state
- CHAR: Single transition on that character
- CONCAT: Connect end of first to start of second
- ALT: New start with ε-transitions to both; both ends ε-transition to new end
- STAR: ε from start to end (skip); ε from end back to start (loop)
- Simulate NFA:
- Maintain set of current states (can be in multiple states due to ε)
- For each input character, compute next state set
- Match succeeds if final state set contains accept state
Approach 2: Backtracking
- Directly interpret regex AST:
- At alternation: try first option, backtrack if fails
- At repetition: try matching, save position for backtracking
- Simpler to implement but exponential worst-case
Character Classes:
[abc]→ matches a, b, or c[a-z]→ matches lowercase letters[^abc]→ matches anything except a, b, c\d,\w,\s→ digit, word char, whitespace
Think about: Why does (a+)+ take exponential time on “aaaaaaaaaaaaaaab” with backtracking but linear time with NFA simulation?
Learning milestones:
-
**Simple patterns match (abc, a b)** → You understand basic NFA - Repetition works (a*, a+) → You understand loops in NFA
- Character classes work → You understand efficient matching
- No exponential blowup → You understand NFA simulation
- Capture groups work → You understand tracking positions
Project 4: Tree-Walking Interpreter (Full Language)
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, Java
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Interpreters / Language Design
- Software or Tool: Text editor, C compiler
- Main Book: “Crafting Interpreters” by Robert Nystrom (Part II)
What you’ll build: A complete tree-walking interpreter for a dynamically-typed language with variables, functions, closures, control flow, and basic data structures—similar to Lox from Crafting Interpreters.
Why it teaches interpretation: Tree-walking is the most straightforward way to execute code. You’ll implement every language feature directly, seeing exactly how variables, scope, functions, and closures work.
Core challenges you’ll face:
- Variable scoping → maps to environments, scope chains
- Function calls → maps to activation records, call stacks
- Closures → maps to capturing environment references
- Control flow → maps to early returns, break/continue
Key Concepts:
- Environments and Scope: “Crafting Interpreters” Chapter 8 - Robert Nystrom
- Functions and Closures: “Crafting Interpreters” Chapters 10-11 - Robert Nystrom
- Control Flow: “Crafting Interpreters” Chapter 9 - Robert Nystrom
Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed, understanding of scope and functions
Real world outcome:
// Your language can run this:
fn fib(n) {
if (n <= 1) return n;
return fib(n - 1) + fib(n - 2);
}
print(fib(20)); // 6765
// Closures work:
fn makeCounter() {
var count = 0;
fn increment() {
count = count + 1;
return count;
}
return increment;
}
var counter = makeCounter();
print(counter()); // 1
print(counter()); // 2
print(counter()); // 3
Implementation Hints:
Design your language’s features:
- Data Types:
- Nil, Boolean, Number (double), String
- Later: Arrays, Objects/Hashes
- Expressions:
- Literals, variables, binary ops, unary ops
- Grouping (parentheses)
- Function calls
- Property access (for objects)
- Statements:
- Expression statements (
foo();) - Print statement (
print x;) - Variable declaration (
var x = 1;) - Block (
{ ... }) - If/else, while, for
- Function declaration
- Return
- Expression statements (
- Environment (Symbol Table):
typedef struct Environment { struct Environment* enclosing; // Parent scope HashTable values; // Variable bindings } Environment; Value env_get(Environment* env, char* name) { if (hash_table_has(&env->values, name)) { return hash_table_get(&env->values, name); } if (env->enclosing != NULL) { return env_get(env->enclosing, name); } error("Undefined variable: %s", name); } - Function Values:
typedef struct { ASTNode* declaration; // The function's AST Environment* closure; // Environment where function was defined } Function; - Executing a Call:
- Create new environment with function’s closure as parent
- Bind parameters to argument values
- Execute function body in new environment
- Handle return (might need exception-like mechanism)
- Closures:
- When creating a function, capture current environment
- When calling, use captured environment as parent
- This makes inner functions “remember” outer variables
Learning milestones:
- Variables and arithmetic work → You understand environments
- Control flow works → You understand interpretation
- Functions work → You understand call semantics
- Closures work → You understand lexical scoping
- You can write recursive programs → Full interpreter working
Project 5: Bytecode Compiler + Stack VM
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Compilation / Virtual Machines
- Software or Tool: Text editor, C compiler
- Main Book: “Crafting Interpreters” by Robert Nystrom (Part III)
What you’ll build: A bytecode compiler that translates your language to a custom instruction set, and a stack-based virtual machine to execute it—10-100x faster than tree-walking.
Why it teaches compilation: This is where the magic happens. You’ll transform high-level constructs into low-level operations, manage a virtual stack, and understand exactly how “real” VMs like the JVM work.
Core challenges you’ll face:
- Designing an instruction set → maps to what operations does the VM support
- Compiling expressions to stack ops → maps to postfix/RPN translation
- Compiling control flow → maps to jumps, patching addresses
- Compiling functions → maps to call frames, return addresses
- VM execution loop → maps to fetch-decode-execute cycle
Key Concepts:
- Bytecode Design: “Crafting Interpreters” Chapter 14 - Robert Nystrom
- Stack-Based VM: “Crafting Interpreters” Chapter 15 - Robert Nystrom
- Compiling Expressions: “Crafting Interpreters” Chapter 17 - Robert Nystrom
- Jumps and Control Flow: “Crafting Interpreters” Chapter 23 - Robert Nystrom
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 4 completed
Real world outcome:
Source: 1 + 2 * 3
Bytecode:
0000 OP_CONSTANT 1
0002 OP_CONSTANT 2
0004 OP_CONSTANT 3
0006 OP_MULTIPLY
0007 OP_ADD
0008 OP_RETURN
Execution:
[]
[1] after OP_CONSTANT 1
[1, 2] after OP_CONSTANT 2
[1, 2, 3] after OP_CONSTANT 3
[1, 6] after OP_MULTIPLY
[7] after OP_ADD
Implementation Hints:
- Instruction Set Design:
typedef enum { OP_CONSTANT, // Push constant onto stack OP_NIL, OP_TRUE, OP_FALSE, OP_ADD, OP_SUBTRACT, OP_MULTIPLY, OP_DIVIDE, OP_NEGATE, OP_NOT, OP_EQUAL, OP_GREATER, OP_LESS, OP_POP, // Discard top of stack OP_GET_LOCAL, // Push local variable OP_SET_LOCAL, // Store into local variable OP_GET_GLOBAL, // Push global variable OP_SET_GLOBAL, // Store into global OP_JUMP, // Unconditional jump OP_JUMP_IF_FALSE, // Conditional jump OP_LOOP, // Jump backward OP_CALL, // Call function OP_RETURN, // Return from function } OpCode; - Chunk (Bytecode Container):
typedef struct { uint8_t* code; // Bytecode array int* lines; // Source line for each instruction ValueArray constants; // Constant pool int count, capacity; } Chunk; - Compiling Expressions:
- Numbers: emit OP_CONSTANT with value in constant pool
- Binary ops: compile left, compile right, emit operator
- Unary: compile operand, emit operator
- Variables: emit OP_GET_LOCAL/GLOBAL with index
- Compiling Control Flow:
if: compile condition, emit OP_JUMP_IF_FALSE with placeholder, compile then-branch, patch jump, optionally compile elsewhile: mark loop start, compile condition, emit OP_JUMP_IF_FALSE, compile body, emit OP_LOOP back to start, patch exit jump- Need to “patch” jump offsets after knowing where to jump
- VM Structure:
typedef struct { Chunk* chunk; uint8_t* ip; // Instruction pointer Value stack[STACK_MAX]; Value* stackTop; CallFrame frames[FRAMES_MAX]; int frameCount; } VM; - Main Execution Loop:
for (;;) { uint8_t instruction = *vm.ip++; switch (instruction) { case OP_CONSTANT: push(READ_CONSTANT()); break; case OP_ADD: { Value b = pop(); Value a = pop(); push(a + b); } break; case OP_RETURN: return; // ... etc } }
Learning milestones:
- Expressions compile and run → You understand stack VMs
- Variables work → You understand locals/globals
- Control flow works → You understand jumps
- Functions work → You understand call frames
- Performance is much faster → You understand why bytecode matters
Project 6: Mark-Sweep Garbage Collector
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Management / Garbage Collection
- Software or Tool: Text editor, C compiler, Valgrind
- Main Book: “Crafting Interpreters” by Robert Nystrom (Chapter 26)
What you’ll build: An automatic memory manager using mark-sweep garbage collection—objects are automatically freed when no longer reachable from the program.
Why it teaches memory management: Every high-level language needs memory management. Understanding mark-sweep—the simplest tracing GC—teaches you how the runtime finds and frees garbage, and why GC pauses happen.
Core challenges you’ll face:
- Tracking all allocations → maps to object lists, allocation wrappers
- Finding roots → maps to understanding what keeps objects alive
- Marking reachable objects → maps to graph traversal
- Sweeping dead objects → maps to freeing memory, list management
- Triggering GC at right times → maps to GC scheduling
Key Concepts:
- Mark-Sweep Basics: “Crafting Interpreters” Chapter 26 - Robert Nystrom
- Baby’s First Garbage Collector: journal.stuffwithstuff.com - Bob Nystrom
- Garbage Collection Handbook: “The Garbage Collection Handbook” by Jones, Hosking, Moss
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 5 completed
Real world outcome:
// In your language runtime:
for (int i = 0; i < 1000000; i++) {
// These temporary strings are automatically collected
var s = "hello" + i;
}
// Memory usage stays bounded, not 1 million strings!
// Circular references handled correctly:
var a = {};
var b = {};
a.ref = b;
b.ref = a;
// When a and b go out of scope, both are freed
// (reference counting would leak!)
Implementation Hints:
- Object Header:
typedef struct Object { struct Object* next; // Linked list of all objects bool marked; // GC mark bit ObjType type; // What kind of object } Object; typedef struct { Object obj; double value; } ObjNumber; typedef struct { Object obj; int length; char chars[]; // Flexible array member } ObjString; - Tracking All Objects:
Object* vm.objects = NULL; // Head of linked list Object* allocate_object(size_t size, ObjType type) { Object* obj = malloc(size); obj->type = type; obj->marked = false; obj->next = vm.objects; // Add to list vm.objects = obj; // Maybe trigger GC vm.bytesAllocated += size; if (vm.bytesAllocated > vm.nextGC) { collectGarbage(); } return obj; } - Finding Roots (what’s definitely alive):
- Stack: all values on VM stack
- Globals: all global variables
- Call frames: function objects, closures
- Compiler state (during compilation)
- Mark Phase:
void markObject(Object* object) { if (object == NULL || object->marked) return; object->marked = true; // Add to worklist for tracing grayStack.push(object); } void traceReferences() { while (grayStack.count > 0) { Object* object = grayStack.pop(); blackenObject(object); // Mark objects this object references } } - Sweep Phase:
void sweep() { Object** object = &vm.objects; while (*object != NULL) { if (!(*object)->marked) { // Unreachable - free it Object* unreached = *object; *object = unreached->next; freeObject(unreached); } else { // Reachable - unmark for next GC cycle (*object)->marked = false; object = &(*object)->next; } } } - GC Scheduling:
- After collection, set next threshold based on live memory
vm.nextGC = vm.bytesAllocated * GC_HEAP_GROW_FACTOR;
Learning milestones:
- All allocations tracked → You understand object lifecycle
- Mark phase traverses correctly → You understand reachability
- Sweep frees correct objects → You understand collection
- Circular references collected → Mark-sweep handles them
- Memory stays bounded → GC is working correctly
Project 7: Closures and Upvalues
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Closures / Variable Capture
- Software or Tool: Text editor, C compiler
- Main Book: “Crafting Interpreters” by Robert Nystrom (Chapter 25)
What you’ll build: A proper closure implementation where functions can capture and modify variables from enclosing scopes, even after those scopes have exited.
Why it teaches closures deeply: Closures are fundamental to functional programming but tricky to implement correctly. Variables must “escape” the stack when the enclosing function returns—this requires moving them to the heap.
Core challenges you’ll face:
- Identifying captured variables → maps to static analysis during compilation
- Creating upvalue objects → maps to heap-allocated variable storage
- Closing over stack variables → maps to detecting when to move to heap
- Sharing upvalues between closures → maps to upvalue deduplication
Key Concepts:
- Closures and Upvalues: “Crafting Interpreters” Chapter 25 - Robert Nystrom
- Closure Implementation: “Programming Language Pragmatics” Chapter 3 - Michael Scott
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 5 completed
Real world outcome:
fn makeCounter() {
var count = 0;
fn increment() {
count = count + 1; // Captures 'count' from outer scope
return count;
}
return increment;
}
var c1 = makeCounter();
var c2 = makeCounter();
print(c1()); // 1
print(c1()); // 2
print(c2()); // 1 (separate counter!)
print(c1()); // 3
Implementation Hints:
The challenge: when makeCounter returns, count is on the stack. But increment needs it later!
- Upvalue Object:
typedef struct ObjUpvalue { Object obj; Value* location; // Points to stack slot or 'closed' Value closed; // Value when closed over struct ObjUpvalue* next; } ObjUpvalue; - Closure Object:
typedef struct { Object obj; ObjFunction* function; ObjUpvalue** upvalues; // Array of captured variables int upvalueCount; } ObjClosure; - Compile-Time Analysis:
- Track which variables are captured by nested functions
- Mark those variables as needing upvalues
- Record in function’s bytecode which upvalues to capture
- Creating Upvalues at Runtime:
- When a closure captures a local, create an ObjUpvalue pointing to stack slot
- Keep list of open upvalues (still pointing to stack)
- Share upvalues if multiple closures capture same variable
- Closing Upvalues:
- When a local goes out of scope, check if any upvalue points to it
- If so, copy value into upvalue’s
closedfield - Update location pointer to point to
closed
- Accessing Upvalues:
- OP_GET_UPVALUE: get value through upvalue indirection
- OP_SET_UPVALUE: set value through upvalue indirection
- Works whether upvalue is open (stack) or closed (heap)
Learning milestones:
- Simple closures work → You understand basic capture
- Mutations visible to closure → Upvalue indirection works
- Closures outlive enclosing function → Closing works
- Multiple closures share variables → Upvalue sharing works
- Nested closures work → Full upvalue chain working
Project 8: Classes and Inheritance
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Object-Oriented Programming / Method Dispatch
- Software or Tool: Text editor, C compiler
- Main Book: “Crafting Interpreters” by Robert Nystrom (Chapters 27-29)
What you’ll build: An object-oriented extension to your language with classes, instances, methods, constructors, inheritance, and super calls.
Why it teaches OOP implementation: OOP is everywhere, but its implementation is rarely explained. You’ll understand how methods are bound to instances, how this works, and how inheritance actually dispatches to superclass methods.
Core challenges you’ll face:
- Instance memory layout → maps to objects as hash tables or fixed layouts
- Method binding → maps to methods capturing ‘this’
- Constructor chaining → maps to initializers, superclass init
- Method inheritance → maps to method tables, lookup chains
- Super calls → maps to starting lookup from superclass
Key Concepts:
- Classes and Instances: “Crafting Interpreters” Chapter 27 - Robert Nystrom
- Methods and Initializers: “Crafting Interpreters” Chapter 28 - Robert Nystrom
- Superclasses: “Crafting Interpreters” Chapter 29 - Robert Nystrom
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 7 completed
Real world outcome:
class Animal {
init(name) {
this.name = name;
}
speak() {
print(this.name + " makes a sound");
}
}
class Dog < Animal {
init(name, breed) {
super.init(name);
this.breed = breed;
}
speak() {
print(this.name + " barks!");
}
describe() {
print(this.name + " is a " + this.breed);
}
}
var dog = Dog("Rex", "German Shepherd");
dog.speak(); // Rex barks!
dog.describe(); // Rex is a German Shepherd
Implementation Hints:
- Class Object:
typedef struct { Object obj; ObjString* name; Table methods; // Hash table of method names → closures } ObjClass; - Instance Object:
typedef struct { Object obj; ObjClass* klass; Table fields; // Hash table of field names → values } ObjInstance; - Bound Method (method + receiver):
typedef struct { Object obj; Value receiver; // The 'this' object ObjClosure* method; // The method closure } ObjBoundMethod; - Method Access (
instance.method):- If instance has field with that name, return field
- Otherwise, look up in class’s method table
- Wrap in BoundMethod with instance as receiver
- Method Invocation:
- At call time, bound method unwraps
- Push receiver as first local (slot 0 =
this) - Execute method body
- Inheritance:
- Store superclass reference in ObjClass
- Method lookup: check class, then superclass chain
- Or: copy superclass methods into subclass table at creation
- Super Calls:
super.method()starts lookup at current class’s superclass- Need to track which class’s method is currently executing
- Compile super as upvalue + method name
Learning milestones:
- Classes and instances work → You understand object layout
- Methods can access ‘this’ → You understand method binding
- Constructors work → You understand initialization
- Inheritance works → You understand method lookup
- Super works → Full OOP implementation complete
Project 9: Static Type Checker
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: OCaml or Rust
- Alternative Programming Languages: Haskell, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Type Systems / Static Analysis
- Software or Tool: OCaml/Rust compiler
- Main Book: “Types and Programming Languages” by Benjamin C. Pierce
What you’ll build: A static type checker for a simple typed language, including type annotations, basic type inference, generic types (parametric polymorphism), and helpful error messages.
Why it teaches type systems: Types catch bugs before runtime. Understanding how type checkers work—unification, substitution, generalization—prepares you for advanced languages like Rust, Haskell, or TypeScript.
Core challenges you’ll face:
- Type representation → maps to algebraic data types for types
- Bidirectional type checking → maps to synthesizing vs checking modes
- Type inference → maps to unification algorithm
- Generics → maps to instantiation, generalization
- Error messages → maps to tracking source locations, explaining failures
Key Concepts:
- Type Checking Basics: “Types and Programming Languages” Chapters 8-11 - Benjamin Pierce
- Hindley-Milner: Write You a Haskell - Stephen Diehl
- Bidirectional Type Checking: “Bidirectional Typing” - Dunfield & Krishnaswami
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 4 completed, understanding of type theory basics
Real world outcome:
// Your type checker catches these errors:
let x: int = "hello"; // Error: expected int, got string
fn add(a: int, b: int): int {
return a + b;
}
add(1, "two"); // Error: argument 2: expected int, got string
// Type inference works:
let y = 42; // inferred: int
let z = y + 1; // inferred: int
let s = "hi"; // inferred: string
// Generics work:
fn identity<T>(x: T): T { return x; }
let a = identity(42); // a: int
let b = identity("hi"); // b: string
Implementation Hints:
- Type Representation:
type typ = | TInt | TBool | TString | TFun of typ * typ (* function type: arg -> return *) | TVar of int (* type variable (for inference) *) | TForall of int * typ (* polymorphic type: ∀α. T *) | TArray of typ - Type Environment:
type env = (string * typ) list (* variable name → type *) let lookup name env = List.assoc_opt name env - Bidirectional Type Checking:
- Synthesize mode: Compute the type of an expression
- Check mode: Verify expression has a given type
- Some constructs synthesize (variables, function calls)
- Some check (function bodies against declared return type)
- Unification (for type inference):
(* Make two types equal by finding substitution *) let rec unify t1 t2 subst = let t1 = apply_subst subst t1 in let t2 = apply_subst subst t2 in match t1, t2 with | TInt, TInt -> subst | TVar n, t | t, TVar n -> if occurs n t then error "infinite type" else extend_subst n t subst | TFun (a1, r1), TFun (a2, r2) -> let subst = unify a1 a2 subst in unify r1 r2 subst | _ -> error ("cannot unify " ^ show t1 ^ " with " ^ show t2) - Generalization and Instantiation:
- Generalize: Turn monotype into polytype (close over free type vars)
- Instantiate: Turn polytype into monotype (create fresh vars)
- Error Messages:
- Track source locations through type checking
- Report expected vs actual type
- Show context (which expression, which argument)
Learning milestones:
- Basic types check correctly → You understand type checking
- Functions type check → You understand function types
- Type errors are caught → Your checker is useful
- Inference works → You understand unification
- Generics work → You understand polymorphism
Project 10: Register-Based Virtual Machine
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Virtual Machines / Code Generation
- Software or Tool: Text editor, C compiler
- Main Book: “The Implementation of Lua 5.0” - Roberto Ierusalimschy
What you’ll build: A register-based VM (like Lua or Dalvik) with register allocation, more compact bytecode, and potentially better performance than a stack VM.
Why it teaches VM design trade-offs: Stack VMs are simple but verbose. Register VMs have fewer instructions and can be faster, but require register allocation. Comparing both teaches you VM design trade-offs.
Core challenges you’ll face:
- Register allocation → maps to which virtual registers hold which values
- Instruction encoding → maps to fixed-width vs variable-width
- Function call convention → maps to register windows, parameter passing
- Comparison with stack VM → maps to trade-off analysis
Key Concepts:
- Lua VM Design: “The Implementation of Lua 5.0” - Ierusalimschy et al.
- Register Allocation: “Engineering a Compiler” Chapter 13 - Cooper & Torczon
- VM Design Comparison: Writing Interpreters in Rust
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 5 completed
Real world outcome:
Source: a = b + c * d
Stack VM (7 instructions):
LOAD b
LOAD c
LOAD d
MUL
ADD
STORE a
Register VM (3 instructions):
MUL R2, R1, R3 ; R2 = c * d
ADD R0, R4, R2 ; R0 = b + R2
MOV a, R0 ; a = R0
Implementation Hints:
- Instruction Format (Lua-style, 32 bits):
┌────────┬────────┬────────┬────────┐ │ opcode │ A │ B │ C │ │ 6 bits │ 8 bits │ 9 bits │ 9 bits │ └────────┴────────┴────────┴────────┘ or ┌────────┬────────┬───────────────────┐ │ opcode │ A │ Bx │ │ 6 bits │ 8 bits │ 18 bits │ └────────┴────────┴───────────────────┘ - Instruction Types:
OP_MOVE A B // R[A] = R[B] OP_LOADK A Bx // R[A] = K[Bx] (constant) OP_ADD A B C // R[A] = R[B] + R[C] OP_JMP sBx // pc += sBx (signed jump) OP_CALL A B C // call R[A] with B-1 args, C-1 results - Register Allocation (simple approach):
- During compilation, track “current register”
- Each expression result goes in next register
- After statement, reset register counter
- Functions have fixed register windows
- Function Calls:
- Callee’s registers start after caller’s
- Arguments: R[A+1], R[A+2], …
- Return values placed starting at R[A]
- Stack frames still needed for return address
- Comparison with Stack VM:
- Register VM: fewer instructions, more explicit operands
- Stack VM: simpler compiler, implicit operand location
- Performance: register VM often faster (fewer dispatches)
Learning milestones:
- Basic arithmetic works → You understand register operations
- Variables work with register allocation → You understand allocation
- Functions work → You understand calling conventions
- Benchmarks show difference → You understand performance
- You can compare stack vs register → Deep VM understanding
Project 11: LLVM Backend (Native Compilation)
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C++
- Alternative Programming Languages: Rust (inkwell crate)
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Compilers / Code Generation
- Software or Tool: LLVM, Clang, CMake
- Main Book: LLVM Kaleidoscope Tutorial
What you’ll build: A compiler backend that generates LLVM IR, allowing your language to compile to native machine code with optimizations—like a “real” compiler.
Why it teaches native compilation: LLVM is the infrastructure behind Rust, Swift, Julia, and many others. Using it teaches you IR design, SSA form, and how high-level constructs become machine instructions.
Core challenges you’ll face:
- LLVM IR generation → maps to understanding SSA form, basic blocks
- Type mapping → maps to translating your types to LLVM types
- Function compilation → maps to parameters, locals, return values
- Control flow → maps to basic blocks, phi nodes, terminators
- Calling C functions → maps to FFI, extern declarations
Key Concepts:
- LLVM IR: LLVM Language Reference
- Kaleidoscope Tutorial: LLVM Tutorial
- SSA Form: “Engineering a Compiler” Chapter 9 - Cooper & Torczon
- LLVM C++ API: Mukul Rathi’s Tutorial
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-4 completed, C++ knowledge
Real world outcome:
Source (your language):
fn fib(n: int): int {
if n <= 1 { return n; }
return fib(n-1) + fib(n-2);
}
print(fib(35));
Generated LLVM IR:
define i64 @fib(i64 %n) {
entry:
%cmp = icmp sle i64 %n, 1
br i1 %cmp, label %then, label %else
then:
ret i64 %n
else:
%n1 = sub i64 %n, 1
%fib1 = call i64 @fib(i64 %n1)
%n2 = sub i64 %n, 2
%fib2 = call i64 @fib(i64 %n2)
%sum = add i64 %fib1, %fib2
ret i64 %sum
}
Compile and run:
$ ./mylang fib.mylang | llc -o fib.s
$ clang fib.s -o fib
$ ./fib
9227465
Implementation Hints:
- Setup LLVM:
#include "llvm/IR/LLVMContext.h" #include "llvm/IR/Module.h" #include "llvm/IR/IRBuilder.h" LLVMContext context; Module module("my_module", context); IRBuilder<> builder(context); - Create Types:
Type* intType = Type::getInt64Ty(context); Type* doubleType = Type::getDoubleTy(context); Type* boolType = Type::getInt1Ty(context); Type* voidType = Type::getVoidTy(context); - Create a Function:
FunctionType* funcType = FunctionType::get(intType, {intType}, false); Function* func = Function::Create(funcType, Function::ExternalLinkage, "fib", module); BasicBlock* entry = BasicBlock::Create(context, "entry", func); builder.SetInsertPoint(entry); - Generate IR for Expressions:
Value* codegen(BinaryExpr* expr) { Value* L = codegen(expr->left); Value* R = codegen(expr->right); switch (expr->op) { case '+': return builder.CreateAdd(L, R, "addtmp"); case '-': return builder.CreateSub(L, R, "subtmp"); case '*': return builder.CreateMul(L, R, "multmp"); case '/': return builder.CreateSDiv(L, R, "divtmp"); } } - Control Flow with Basic Blocks:
// if (cond) { then } else { else } BasicBlock* thenBB = BasicBlock::Create(context, "then", func); BasicBlock* elseBB = BasicBlock::Create(context, "else", func); BasicBlock* mergeBB = BasicBlock::Create(context, "merge", func); builder.CreateCondBr(condValue, thenBB, elseBB); builder.SetInsertPoint(thenBB); Value* thenVal = codegen(thenExpr); builder.CreateBr(mergeBB); builder.SetInsertPoint(elseBB); Value* elseVal = codegen(elseExpr); builder.CreateBr(mergeBB); builder.SetInsertPoint(mergeBB); PHINode* phi = builder.CreatePHI(intType, 2); phi->addIncoming(thenVal, thenBB); phi->addIncoming(elseVal, elseBB); - JIT Compilation (optional):
- Use LLVM’s ORC JIT
- Compile and run immediately without separate linking
Learning milestones:
- Simple functions compile to IR → You understand LLVM basics
- Expressions work → You understand IR generation
- Control flow works → You understand basic blocks
- Native binary runs → Full native compilation!
- Optimizations improve code → You understand optimization passes
Project 12: Hindley-Milner Type Inference
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: OCaml or Haskell
- Alternative Programming Languages: Rust, F#
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 5: Master
- Knowledge Area: Type Theory / Type Inference
- Software or Tool: OCaml or GHC
- Main Book: “Types and Programming Languages” by Benjamin C. Pierce (Chapters 22-23)
What you’ll build: A complete Hindley-Milner type inference engine that can deduce types for a program without any type annotations—like ML, Haskell, or Rust.
Why it teaches type inference: HM is the gold standard for type inference. Understanding Algorithm W, unification, and let-polymorphism gives you deep insight into how modern type systems work.
Core challenges you’ll face:
- Fresh type variables → maps to generating unique identifiers
- Unification → maps to making type equations consistent
- Substitution → maps to applying solved types throughout
- Generalization and instantiation → maps to polymorphism
- Let-polymorphism → maps to where generalization happens
Key Concepts:
- Algorithm W: “Types and Programming Languages” Chapter 22 - Benjamin Pierce
- Unification: Write You a Haskell - Stephen Diehl
- Let-Polymorphism: Hindley-Milner Tutorial
Difficulty: Master Time estimate: 2-3 weeks Prerequisites: Project 9 completed, understanding of lambda calculus
Real world outcome:
(* No type annotations needed! *)
let id = fun x -> x
(* Inferred: id : ∀α. α → α *)
let compose = fun f -> fun g -> fun x -> f (g x)
(* Inferred: compose : ∀α β γ. (β → γ) → (α → β) → (α → γ) *)
let rec map = fun f -> fun lst ->
match lst with
| [] -> []
| x :: xs -> f x :: map f xs
(* Inferred: map : ∀α β. (α → β) → [α] → [β] *)
let result = map (fun x -> x + 1) [1, 2, 3]
(* result : [int] *)
Implementation Hints:
- Type Data Structure:
type typ = | TVar of tvar ref (* Type variable *) | TInt | TBool (* Primitive types *) | TArrow of typ * typ (* Function type *) | TList of typ (* List type *) and tvar = | Unbound of int * level (* Unsolved, with unique id and level *) | Link of typ (* Solved: points to another type *) - Algorithm W Outline:
let rec infer env expr = match expr with | Var x -> instantiate (lookup x env) | Lam (x, body) -> let tv = fresh_tvar () in let bodyType = infer (extend env x tv) body in TArrow (tv, bodyType) | App (func, arg) -> let funcType = infer env func in let argType = infer env arg in let resultType = fresh_tvar () in unify funcType (TArrow (argType, resultType)); resultType | Let (x, value, body) -> let valueType = infer env value in let genType = generalize env valueType in infer (extend env x genType) body - Unification:
let rec unify t1 t2 = let t1 = find t1 in (* Follow links *) let t2 = find t2 in match t1, t2 with | TVar { contents = Unbound (id1, _) }, TVar { contents = Unbound (id2, _) } when id1 = id2 -> () | TVar ({ contents = Unbound _ } as tv), t | t, TVar ({ contents = Unbound _ } as tv) -> occurs_check tv t; tv := Link t | TArrow (a1, r1), TArrow (a2, r2) -> unify a1 a2; unify r1 r2 | TInt, TInt | TBool, TBool -> () | _ -> error "type mismatch" - Generalization (at let bindings):
let generalize env typ = let freeVars = free_tvars typ in let envVars = free_tvars_in_env env in let toGeneralize = freeVars - envVars in (* Wrap typ in ∀ for each variable in toGeneralize *) - Instantiation (when using a polymorphic value):
let instantiate scheme = (* Replace each bound type variable with a fresh one *) - Levels for Efficient Generalization:
- Each scope has a “level”
- Type variables remember their level
- Only generalize variables at deeper levels
Learning milestones:
- Simple expressions infer → You understand basic inference
- Functions infer → You understand arrow types
- Polymorphic let works → You understand generalization
- Type errors are caught → Inference is complete
- You can explain Algorithm W → Deep type theory understanding
Project 13: Self-Hosting Compiler
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: Your language!
- Alternative Programming Languages: (must be your language)
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Bootstrapping / Compiler Design
- Software or Tool: Your language implementation
- Main Book: “Compilers: Principles, Techniques, and Tools” by Aho et al.
What you’ll build: A compiler for your language, written in your language—the ultimate test that your language is “real” and powerful enough to build complex software.
Why it teaches language completeness: Self-hosting proves your language is capable of real work. It forces you to use every feature, find missing ones, and truly eat your own dog food.
Core challenges you’ll face:
- Bootstrapping → maps to chicken-and-egg: first compile with external impl
- Language completeness → maps to what features are you missing?
- Standard library → maps to file I/O, string manipulation
- Error handling → maps to robust parsing and analysis
- Performance → maps to is it fast enough to compile itself?
Key Concepts:
- Bootstrapping: “Compilers: Principles, Techniques, and Tools” Chapter 1 - Aho et al.
- Self-Hosting History: Ken Thompson’s “Reflections on Trusting Trust”
- Practical Bootstrapping: Rust’s bootstrapping process documentation
Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects, your language must be substantial
Real world outcome:
# Stage 0: Compile your compiler with the C implementation
$ ./mylang-c mylang.mylang -o mylang-stage1
# Stage 1: Compile your compiler with itself
$ ./mylang-stage1 mylang.mylang -o mylang-stage2
# Stage 2: Verify - should produce identical output
$ ./mylang-stage2 mylang.mylang -o mylang-stage3
$ diff mylang-stage2 mylang-stage3
# No differences - your compiler is self-hosting!
Implementation Hints:
- Required Language Features:
- Strings with manipulation (slicing, concatenation)
- Arrays or lists
- Hash tables / dictionaries
- File I/O (read source, write output)
- Sufficient control flow
- Functions (of course)
- Error reporting
- Bootstrapping Process:
- Stage 0: Use existing implementation (C, Python, etc.) to compile
- Stage 1: Compile with Stage 0 output
- Stage 2: Compile with Stage 1 output
- Verify: Stage 2 output should equal Stage 1 output
- (If not equal, you have a bug!)
- Strategy:
- Start simple: get lexer working in your language
- Then parser, then AST printer
- Then code generator
- Test each phase independently
- Common Issues:
- String handling is always more complex than expected
- Error messages are critical (debugging is hard)
- You’ll find language design flaws
- Performance may require optimization
- Trust and Verification:
- The famous “trusting trust” attack
- Cross-compilation can help verify correctness
- Diverse double-compiling
Learning milestones:
- Lexer runs in your language → Language is somewhat useful
- Parser runs → Data structures and recursion work
- Full compiler runs → Language is complete
- Self-compilation works → You’ve bootstrapped!
- Stage 1 = Stage 2 → Compiler is correct
Project 14: Incremental Parser (IDE Support)
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: Rust
- Alternative Programming Languages: C++, TypeScript
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Parsing / IDE Tooling
- Software or Tool: Your language, tree-sitter or custom
- Main Book: “Modern Compiler Implementation” by Andrew Appel
What you’ll build: A parser that can handle incomplete, erroneous code and update the AST incrementally as the user types—the foundation for IDE features.
Why it teaches practical tooling: Real-world language tooling needs to work with broken code and be fast enough for every keystroke. This is a very different challenge than batch compilation.
Core challenges you’ll face:
- Error recovery → maps to continuing after syntax errors
- Incremental parsing → maps to reusing work from previous parse
- Partial ASTs → maps to representing incomplete constructs
- Fast enough for keystrokes → maps to performance optimization
Key Concepts:
- Error Recovery: “Modern Compiler Implementation” Chapter 3 - Andrew Appel
- Incremental Parsing: tree-sitter documentation
- Red-Green Trees: Roslyn compiler architecture
- LSP: Language Server Protocol specification
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-4 completed, understanding of parsing
Real world outcome:
User is typing:
fn foo(x: int) {
let y = x +|
^cursor here
Parser output:
- Function "foo" (complete)
- Parameter "x" of type "int"
- Body (incomplete)
- Let statement (incomplete)
- Name: "y"
- Value: BinaryExpr (incomplete)
- Left: "x"
- Op: "+"
- Right: <missing expression>
IDE can still:
- Offer completions for right-hand side
- Show error for incomplete expression
- Navigate to definition of "x"
Implementation Hints:
- Error Recovery Strategies:
- Panic mode: Skip tokens until sync point (
;,}, etc.) - Phrase level: Insert/delete tokens to make progress
- Error productions: Grammar rules for common mistakes
- Panic mode: Skip tokens until sync point (
- Error Nodes in AST:
enum Expr { Binary { left: Expr, op: Token, right: Expr }, Number(f64), Error { message: String, range: Range }, // Placeholder } - Incremental Approach:
- Track which parts of source changed
- Invalidate only affected nodes
- Reparse only invalidated regions
- Keep unchanged subtrees
- Tree-Sitter Style:
- Concrete syntax tree (CST) stores all tokens
- Error nodes mark problems but tree is always complete
- Queries to extract semantic info
- Built-in incremental reparsing
- Performance:
- Avoid full reparse on every keystroke
- Debounce typing events
- Parse in background thread
- Cancel stale parses
- Language Server Protocol:
- Standard protocol for IDE features
- Completion, diagnostics, goto definition
- Your incremental parser is the foundation
Learning milestones:
- Parser continues after errors → You understand recovery
- AST represents incomplete code → You understand error nodes
- Fast enough for typing → You understand performance needs
- Incremental updates work → You understand incremental parsing
- LSP integration works → Full IDE support
Project Comparison Table
| Project | Difficulty | Time | Depth | Fun Factor |
|---|---|---|---|---|
| 1. Calculator | Beginner | Weekend | ⭐⭐ | ⭐⭐ |
| 2. JSON Parser | Beginner | Weekend | ⭐⭐ | ⭐⭐ |
| 3. Regex Engine | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 4. Tree-Walking Interpreter | Intermediate | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 5. Bytecode VM | Advanced | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 6. Garbage Collector | Advanced | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 7. Closures/Upvalues | Advanced | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 8. Classes/Inheritance | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 9. Static Type Checker | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 10. Register VM | Expert | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 11. LLVM Backend | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 12. Type Inference | Master | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 13. Self-Hosting | Master | 2-3 months | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 14. Incremental Parser | Expert | 3-4 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
Recommended Learning Path
Phase 1: Foundations (Weeks 1-4)
- Project 1: Calculator Language - Learn lexing and parsing
- Project 2: JSON Parser - Master recursive descent
- Project 4: Tree-Walking Interpreter - Build a complete language
Phase 2: Performance (Weeks 5-10)
- Project 5: Bytecode VM - Understand compilation
- Project 6: Garbage Collector - Master memory management
- Project 7: Closures - Understand advanced features
Phase 3: Advanced Features (Weeks 11-16)
- Project 8: Classes/Inheritance - Add OOP
- Project 9: Static Type Checker - Understand types
- Project 3: Regex Engine - Deep automata theory
Phase 4: Production Quality (Weeks 17-24)
- Project 11: LLVM Backend - Native compilation
- Project 12: Type Inference - Advanced type systems
- Project 13: Self-Hosting - Ultimate validation
Final Capstone Project: Production-Quality Language
- File: LEARN_PROGRAMMING_LANGUAGE_CONSTRUCTION_DEEP_DIVE.md
- Main Programming Language: C/C++ with LLVM, or Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Complete Language Implementation
What you’ll build: A complete, production-quality programming language combining all previous projects: bytecode VM or native compilation, garbage collection, static types with inference, classes, closures, and IDE tooling.
Features:
- Static typing with type inference
- First-class functions and closures
- Classes with single inheritance
- Garbage collection
- LLVM native code generation
- Standard library (I/O, strings, collections)
- Package manager integration
- Language server for IDE support
- Documentation generator
- REPL for interactive use
This capstone integrates everything you’ve learned into a language people could actually use.
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Calculator Language | C |
| 2 | JSON Parser | C |
| 3 | Regular Expression Engine | C |
| 4 | Tree-Walking Interpreter | C |
| 5 | Bytecode Compiler + Stack VM | C |
| 6 | Mark-Sweep Garbage Collector | C |
| 7 | Closures and Upvalues | C |
| 8 | Classes and Inheritance | C |
| 9 | Static Type Checker | OCaml/Rust |
| 10 | Register-Based Virtual Machine | C |
| 11 | LLVM Backend | C++ |
| 12 | Hindley-Milner Type Inference | OCaml/Haskell |
| 13 | Self-Hosting Compiler | Your Language |
| 14 | Incremental Parser | Rust |
| Final | Production-Quality Language | C++/Rust |
Essential Resources
Books
- “Crafting Interpreters” by Robert Nystrom - FREE online
- “Engineering a Compiler” by Cooper & Torczon - Modern compiler textbook
- “Types and Programming Languages” by Benjamin Pierce - Type theory bible
- “Compilers: Principles, Techniques, and Tools” (Dragon Book) - Classic reference
- “Writing a C Compiler” by Nora Sandler - Practical C compiler
Online Resources
- LLVM Tutorial - Build with LLVM
- Write You a Haskell - Type system implementations
- Create Your Own Programming Language with Rust
- Writing Interpreters in Rust
- Pratt Parsing
- Baby’s First Garbage Collector
- Regular Expression Matching Can Be Simple And Fast - Russ Cox
Tools
- LLVM/Clang - Compiler infrastructure
- tree-sitter - Incremental parsing
- Valgrind - Memory debugging
- GDB/LLDB - Debugging
Master these projects and you’ll understand how every programming language works—from source code to machine execution.