Understanding Compilers and Interpreters

Goal: Build a complete, end-to-end mental model of how human-readable source code becomes running behavior. You will learn to transform text into tokens, tokens into structure, structure into meaning, meaning into IR, IR into optimized instructions, and instructions into executing programs. You will also learn how runtimes, VMs, and ABIs make your compiled code actually run on real machines.


Why Compilers and Interpreters Matter (Now)

Compilers are no longer niche tooling. The same LLVM infrastructure powers the production C/C++ compiler Clang, provides optimizer and codegen libraries, and is explicitly designed to make it easier to build new languages and toolchains. citeturn2search0turn0search0

Modern language ecosystems are shaped by compiler architecture. The Swift compiler lowers to LLVM IR, and Rust relies on LLVM for code generation in its primary backend. That means the LLVM pipeline is not just an academic curiosity; it is a foundation for real-world languages. citeturn0search3turn0search1

Compilers also shape platform reach. LLVM IR is an SSA-based representation used throughout compilation phases and can be serialized as bitcode, which is key for optimizers and JIT workflows. citeturn1search0

Compilers are how you target new platforms. Emscripten is explicitly an LLVM-to-WebAssembly compiler, and LLVM-based languages can use it to reach the web. citeturn1search1

Security and trust are compiler concerns. Ken Thompson’s Turing Award lecture “Reflections on Trusting Trust” explains why compiler correctness and provenance are part of software security. citeturn4search7

Language idea
   |
   v
Frontend (lexer/parser/type checker)
   |
   v
IR + Optimizer (SSA, CFG, dataflow)
   |
   v
Backend (codegen + ABI + linker)
   |
   v
Runtime/VM (stack, heap, GC) -> Running program

Big Picture / Mental Model

Think of a compiler as a series of increasingly constrained representations:

Text  ->  Tokens  ->  AST  ->  Typed AST  ->  IR/SSA  ->  Machine code  ->  Binary
 ^         ^           ^        ^             ^           ^                ^
 |         |           |        |             |           |                |
Lexing   Parsing    Syntax   Semantics     Optimizer   Codegen + ABI     Loader

Your projects will repeatedly answer one question: “What structure does this stage need, and how do I preserve meaning while changing representation?”


How to Use This Guide

  1. Read the Goal and Big Picture sections until you can narrate the pipeline in your own words.
  2. Pick a path in Recommended Learning Paths based on your background.
  3. For each project:
    • Do the Thinking Exercise before writing code.
    • Use Hints in Layers only when stuck.
    • Treat Definition of Done as a test plan.
  4. Keep a running “compiler diary” with notes on errors, parse trees, and performance surprises.
  5. When a project feels “done”, re-run it with different inputs and compare outputs across stages.

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

  • Solid C fundamentals (pointers, structs, manual memory management)
  • Basic data structures (stacks, hash maps, trees)
  • Comfort with recursion and debugging segmentation faults

Helpful But Not Required

  • Assembly basics (x86-64 or RISC-V)
  • Graph concepts (CFG, dominators)
  • Experience with build systems (Make/CMake)

Self-Assessment Questions

  • Can you write a recursive-descent parser for a tiny grammar?
  • Can you debug a memory leak with valgrind or asan?
  • Can you explain what a stack frame is and where locals live?
  • Can you trace a simple expression evaluation by hand?

Development Environment Setup

  • C toolchain: clang or gcc, make, cmake
  • Debugging: gdb or lldb, valgrind (Linux) or leaks (macOS)
  • Optional: flex/bison for alternative parser pipelines
  • Unit tests: simple custom harness or ctest

Time Investment (Realistic)

  • Project 1: 1 weekend
  • Project 2: 2-3 weeks
  • Project 3: 2-4 weeks
  • Project 4: 1-2 months
  • Project 5: 1-2 months
  • Capstone: 3-6 months

Important Reality Check

Compilers are deceptively simple at first. The jump from “parses correctly” to “robust, correct, and debuggable” is the real challenge. Expect iterations, not a straight line.


Success Metrics (How You Know You Actually Understand)

  • You can draw the pipeline and explain what each stage needs and produces.
  • You can debug a parsing error by looking at tokens and grammar, not guessing.
  • You can explain why SSA and CFG make optimization possible.
  • You can reason about calling conventions and stack layout in a generated function.
  • You can change a language feature (e.g., add for loops) without re-architecting everything.

Quick Start (First 48 Hours)

Day 1 (2-4 hours)

  • Read the Big Picture section.
  • Implement a tiny lexer that prints tokens with line/column.
  • Manually parse two inputs on paper.

Day 2 (2-4 hours)

  • Build a recursive-descent parser for expr -> term (('+'|'-') term)*.
  • Print an AST in prefix form.
  • Add evaluation and compare results to hand-calculated outputs.

  1. Practical Builder Path (fast feedback)
    • Project 1 -> Project 2 -> Project 3 -> Project 4 -> Capstone
  2. Systems-First Path (strong ABI + assembly focus)
    • Project 1 -> Project 4 -> Project 3 -> Project 2 -> Capstone
  3. Modern Toolchain Path (LLVM-centric)
    • Project 1 -> Project 2 -> Project 5 -> Project 3 -> Capstone

Core Concept Analysis

Lexical Analysis: Turning Characters into Tokens

Lexing is the first gate: raw characters become typed tokens the parser can reason about. You define the vocabulary of the language, handle whitespace/comments, and preserve source locations for error reporting.

Input stream:  "x = 12 + 3.5"
               │   │  │   │
Tokens:        [IDENT(x)] [=] [NUMBER(12)] [+] [NUMBER(3.5)]

Key idea: A lexer does not understand meaning, only text shapes.

Parsing and ASTs: Giving Structure to Tokens

Parsing assigns structure to a flat token stream. You pick or design a grammar, implement precedence rules, and create an AST that captures program shape.

Tokens:  x = 12 + 3 * 4
AST:
        (=)
       /   \
     x     (+)
          /   \
        12     (*)
              / \
             3   4

Key idea: The AST is the bridge between syntax and meaning.

Semantic Analysis: Meaning, Types, and Scope

Semantic analysis checks whether the AST makes sense: variables must exist, types must be compatible, and scopes must be respected. This introduces symbol tables, binding rules, and type checking.

Symbol Table (scope chain)
Global:  print -> (fn)
Local:   x -> number

Key idea: semantics are where “valid syntax” becomes “valid program.”

Intermediate Representations (IR) and SSA

IR provides a language-agnostic form that is easier to optimize and target. LLVM IR is a static single assignment (SSA) based representation used across compilation phases. citeturn1search0

AST   ->   IR (SSA)          ->   Bytecode
a+b        t0 = a + b              OP_LOAD a
           t1 = t0                 OP_LOAD b
                                   OP_ADD

Key idea: IR is the compiler’s working language.

Optimization and Correctness

Optimizations rewrite IR while preserving meaning. You learn data-flow analysis, constant folding, dead code elimination, and why correctness proofs matter.

Before:   x = 2 * 3; y = x + 0
After:    x = 6;     y = x

Key idea: optimization is controlled aggression.

Code Generation and ABI

Code generation chooses instructions, assigns registers, and respects calling conventions. The ABI (like the System V AMD64 psABI) defines how arguments, returns, and stack frames must look. The x86-64 psABI is maintained as a public specification and has a canonical PDF linked from the x86-psABI project resources. citeturn3search5

IR add -> x86-64:
mov rax, [a]
add rax, [b]

Key idea: correct code must obey the platform’s ABI.

Runtime Systems and Execution Models

Interpreters execute ASTs or bytecode; VMs manage stacks, heaps, and dispatch loops; runtimes provide memory management and built-in services.

Call Stack          Heap
┌──────────┐        ┌───────────┐
│ frame fn │ -----> │  objects  │
└──────────┘        └───────────┘

Key idea: a language is not just syntax; it is a runtime contract.


Concept Summary Table

Concept Cluster What You Need to Internalize
Lexical Analysis Patterns become tokens; whitespace/comments vanish, source locations remain.
Parsing + ASTs Grammar and precedence rules create structure; AST is the core program model.
Semantic Analysis Symbol tables, scope, and types define meaning and validity.
IR and SSA ASTs lower to a simpler, optimizable, target-neutral form.
Optimization Transformations must preserve behavior while improving performance.
Code Generation + ABI Instruction selection, register allocation, and calling conventions produce runnable code.
Runtime Systems Execution model, stacks/heaps, and GC define how programs actually run.

Deep Dive Reading by Concept

Frontend: Lexing and Parsing

Concept Book & Chapter Why This Matters
Lexical analysis Compilers: Principles and Practice Ch. 3 Token design and error locations drive parser quality.
Recursive descent parsing Language Implementation Patterns Ch. 2-3 Maps grammar rules directly to code.
Grammar and precedence Writing a C Compiler Ch. 3 Avoids ambiguous parse trees.

Semantics, Symbols, and Types

Concept Book & Chapter Why This Matters
Symbol tables and scopes Engineering a Compiler Ch. 5 Name resolution is the backbone of meaning.
Type checking fundamentals Engineering a Compiler Ch. 6 Prevents invalid operations early.

IR, Optimization, and Codegen

Concept Book & Chapter Why This Matters
SSA form and IR design Engineering a Compiler Ch. 9 Enables data-flow optimizations.
Data-flow analysis Compilers: Principles and Practice Ch. 9 Lets you reason about values across CFGs.
Instruction selection and registers Engineering a Compiler Ch. 11-13 Turns abstract IR into hardware instructions.

Runtime and VMs

Concept Book & Chapter Why This Matters
Bytecode VMs Crafting Interpreters Part III Teaches fast interpreter architecture. citeturn3search2
Runtime systems Engineering a Compiler Ch. 7 Shows how stacks/heaps support language features.

Compilation Pipeline Walkthrough (Core Subsystem Walkthrough)

Source file
  -> Lexer (tokens w/ spans)
  -> Parser (AST)
  -> Resolver (scopes + symbols)
  -> Type Checker (annotated AST)
  -> Lowering (IR + CFG)
  -> Optimizer (SSA + passes)
  -> Codegen (machine code)
  -> Assembler/Linker (binary)
  -> Loader/Runtime (process, stack, heap)

Codebase Navigation Cheatsheet (Compiler Projects)

Typical directories for your own compiler projects:

  • src/lexer.c, src/parser.c, src/ast.c
  • src/resolve.c, src/typecheck.c
  • src/ir.c, src/ssa.c, src/opt.c
  • src/codegen.c, src/emit_x86.c
  • src/vm.c, src/bytecode.c, src/gc.c

Useful rg searches:

  • rg "TOKEN_" src/ (token definitions)
  • rg "parse_" src/ (parser entry points)
  • rg "emit_" src/ (codegen hotspots)
  • rg "resolve" src/ (scope and binding)
  • rg "phi" src/ (SSA construction)

Debugging Toolkit

  • Token dump: print tokens with line/column before parsing
  • AST printer: prefix or indented tree view
  • IR printer: show basic blocks and instructions
  • Trace flags: --trace-lexer, --trace-parser, --trace-vm
  • Memory tools: valgrind, asan, ubsan
  • GDB: set breakpoints in parse_expr, emit_binary, vm_run
  • Differential testing: compare output with a reference interpreter

Data Structure Atlas (Compiler-Specific)

Structure What It Represents Where It Matters
Token Lexeme + type + span Lexing, errors
ASTNode Syntax tree node Parsing, evaluation
Symbol Name binding + scope Semantics, resolution
Type Type info Type checking, codegen
IRInstr Lowered instruction Optimization, codegen
BasicBlock CFG node Control flow analysis
SSAValue Single-assignment value Optimizations
Chunk / Bytecode VM instruction stream Interpreter/VM
CallFrame Function activation Runtime, VM
Obj Heap object GC, runtime

Compiler Contribution Checklist

  • Reproducible test case for every bug
  • Golden output file for new language feature
  • Error messages include spans and helpful context
  • New syntax documented with examples
  • Performance impact measured (even small)
  • Fuzz test or differential test added when possible

Glossary (High-Signal)

  • Token: typed chunk of the source text (identifier, number, keyword)
  • AST: tree representation of program structure
  • SSA: static single assignment, where each value is assigned once
  • CFG: control flow graph, nodes are basic blocks
  • Lowering: transforming a high-level representation into a lower-level one
  • Bytecode: compact instruction stream for a VM
  • ABI: rules for calling conventions, stack layout, and binary interfaces
  • Closure: function with captured environment

Common Failure Signatures

Problem: “Parser accepts invalid syntax”

  • Why: ambiguous grammar or incorrect precedence handling
  • Fix: make precedence explicit or use Pratt parser
  • Quick test: parse 1 + 2 * 3 and confirm AST grouping

Problem: “Variables resolve to wrong scope”

  • Why: environment chain incorrect or shadowing mishandled
  • Fix: track scope depth and capture environment at function creation
  • Quick test: nested function example with shadowed names

Problem: “Generated assembly crashes”

  • Why: stack misalignment or calling convention violation
  • Fix: verify ABI rules, argument registers, and stack alignment
  • Quick test: call a function with 7+ args and compare to clang output

Reading Plan (4 Weeks, Light but Consistent)

  • Week 1: Lexer + parser chapters (Sandler + Parr)
  • Week 2: Semantics and symbol tables (Cooper & Torczon)
  • Week 3: IR + SSA + optimizations (Cooper & Torczon)
  • Week 4: Runtime and VM (Crafting Interpreters Part III) citeturn3search2

Projects

Project 1: Expression Evaluator with Variables

  • File: COMPILERS_INTERPRETERS_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Compilers / Interpreters
  • Software or Tool: Lexing/Parsing logic
  • Main Book: “Compilers: Principles and Practice” by Dave & Dave

What you’ll build: A REPL calculator that parses mathematical expressions, supports variables (x = 5 + 3), and evaluates them while showing the AST before computing results.

Why it teaches compilers/interpreters: This is the “hello world” of language implementation. You’ll implement the complete pipeline (lexer → parser → AST → evaluator) in miniature without the complexity of control flow or functions.

Core challenges you’ll face:

  • Tokenizing expressions with operators, numbers, identifiers, parentheses (lexical analysis)
  • Handling operator precedence correctly (2 + 3 * 4 must equal 14, not 20) (parsing)
  • Building and traversing an AST (IR representation)
  • Managing a symbol table for variables (semantic analysis)

Learning milestones:

  1. Tokenizer works - you understand how source becomes tokens
  2. Parser works - you understand grammars and precedence
  3. Evaluator works - you understand tree traversal and environments
  4. Error messages work - you understand the importance of source locations

Real World Outcome

You run a REPL that shows both the AST and the evaluated result. You can assign variables, reuse them, and see operator precedence reflected in the tree.

$ ./expr_repl
> x = 10 * 2
AST: (= x (* 10 2))
Result: 20
> x + 5
AST: (+ x 5)
Result: 25
> (1 + 2) * 3
AST: (* (+ 1 2) 3)
Result: 9
> y + 1
Error [line 1, col 1]: Undefined variable 'y'

The Core Question You’re Answering

“How do I turn raw text into a structured tree and evaluate it correctly, with variables and precedence?”

Concepts You Must Understand First

Stop and research these before coding:

  1. Token streams and token types
    • What makes a token “identifier” vs “number”?
    • How do you store line/column for errors?
    • Book Reference: Compilers: Principles and Practice Ch. 3
  2. Expression grammars
    • How do precedence and associativity change parse structure?
    • Why do recursive descent functions map to grammar rules?
    • Book Reference: Writing a C Compiler Ch. 3
  3. AST node design
    • What minimal fields are needed to represent expressions?
    • How do you represent unary vs binary nodes?
    • Book Reference: Engineering a Compiler Ch. 4
  4. Symbol tables for variables
    • How do you map names to values?
    • How do you handle reassignment?
    • Book Reference: Compilers: Principles and Practice Ch. 5

Questions to Guide Your Design

  1. What is the smallest grammar that supports literals, variables, and binary operators?
  2. How will you implement precedence: Pratt parser or recursive descent tiers?
  3. How will the AST printer show structure clearly (prefix, infix with parens, or tree)?
  4. What error messages help you fix malformed input fastest?

Thinking Exercise

Manually tokenize and parse this input, then draw the AST:

x = 2 + 3 * (4 - 1)

The Interview Questions They’ll Ask

  1. How does a lexer differ from a parser?
  2. What is operator precedence and how do you enforce it?
  3. Why use an AST instead of evaluating tokens directly?
  4. How would you add unary minus to your grammar?
  5. How does a symbol table work in a tiny interpreter?

Hints in Layers

Hint 1: Start with a token dump Print all tokens before parsing to validate lexing.

Hint 2: Implement precedence in tiers Split parsing into parse_term, parse_factor, parse_primary.

Hint 3: Add AST printing before evaluation If the tree is wrong, evaluation will be wrong.

Hint 4: Make errors local Emit errors as soon as you see a token you don’t expect.

Books That Will Help

Topic Book Chapter Why This Matters
Lexing and tokens Compilers: Principles and Practice Ch. 3 Canonical lexer patterns and token streams.
Parsing expressions Writing a C Compiler Ch. 3 Clean precedence handling.
AST design Engineering a Compiler Ch. 4 Minimal node design patterns.
Symbol tables Compilers: Principles and Practice Ch. 5 Correct variable binding.

Common Pitfalls & Debugging

Problem: “2 + 3 * 4” evaluates to 20

  • Why: precedence handled incorrectly
  • Fix: parse with precedence tiers or Pratt parsing
  • Quick test: ensure AST prints (+ 2 (* 3 4))

Problem: “-3 + 2” crashes

  • Why: unary operators not handled in grammar
  • Fix: add parse_unary before parse_primary
  • Quick test: parse -1 * 2 and confirm tree

Problem: Unknown variable error even after assignment

  • Why: symbol table not updated for assignment nodes
  • Fix: store new value when evaluating assignment
  • Quick test: x = 1 then x + 1

Definition of Done

  • Lexer outputs correct tokens with line/column
  • Parser builds AST that matches operator precedence
  • Evaluator handles variables, reassignment, and parentheses
  • Error messages include line/column for malformed input
  • Unit tests cover at least 10 expressions (including errors)

Project 2: Interpreter for a Programming Language (Lox or Monkey)

  • File: COMPILERS_INTERPRETERS_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Compilers / Interpreters
  • Software or Tool: Lox / Monkey
  • Main Book: “Crafting Interpreters” by Bob Nystrom

What you’ll build: A complete tree-walking interpreter for a dynamically-typed language with variables, functions, closures, control flow, and classes/objects.

Why it teaches compilers/interpreters: This project forces you through the entire interpretation pipeline. By the end, you’ll have implemented closures (which require environments and scope chains), first-class functions, and dynamic dispatch. Crafting Interpreters explicitly walks through this pipeline. citeturn3search0

Core challenges you’ll face:

  • Designing a scanner that handles strings, comments, and multi-character operators (lexical analysis)
  • Implementing a parser that handles statements, expressions, and declarations (parsing)
  • Creating an environment chain for lexical scoping and closures (semantic analysis)
  • Implementing control flow by recursively evaluating AST nodes (runtime execution)

Resources for key challenges:

  • “Crafting Interpreters” by Bob Nystrom (free online) - The definitive guide; walks through building Lox step-by-step citeturn3search0
  • “Writing An Interpreter In Go” by Thorsten Ball - Practical alternative using Monkey language

Learning milestones:

  1. Expressions and statements work - you understand the execution model
  2. Functions work - you understand call stacks and return values
  3. Closures work - you truly understand lexical scope and environments
  4. Classes work - you understand dynamic dispatch and inheritance

Real World Outcome

You can run non-trivial programs in your language, see printed output, and inspect runtime behavior when errors occur.

$ ./lox
> fun makeCounter() { var i = 0; fun count(){ i = i + 1; print i; } return count; }
> var counter = makeCounter();
> counter();
1
> counter();
2
> fun fib(n) { if (n <= 1) return n; return fib(n-1) + fib(n-2); }
> print fib(8);
21

The Core Question You’re Answering

“How does a real programming language execute user-defined code with scope, functions, and objects?”

Concepts You Must Understand First

  1. Recursive descent parsing of statements
    • How do you distinguish declarations from expressions?
    • Book Reference: Crafting Interpreters Ch. 6-8
  2. Environment chains
    • How do nested scopes resolve variable names?
    • Book Reference: Language Implementation Patterns Ch. 6
  3. Closures
    • What does it mean to “capture” a variable?
    • Book Reference: Crafting Interpreters Ch. 11
  4. Runtime error handling
    • How do you include line numbers in errors?
    • Book Reference: Engineering a Compiler Ch. 5

Questions to Guide Your Design

  1. What is the minimal AST shape for statements, blocks, and functions?
  2. How will you model environments to support nested functions?
  3. How will your interpreter implement return, break, and continue?
  4. What is the simplest object model that still supports methods?

Thinking Exercise

Trace variable lookup for this program and list which scope each name resolves to:

var x = "global";
fun outer() {
  var x = "outer";
  fun inner() { print x; }
  inner();
}
outer();

The Interview Questions They’ll Ask

  1. What is lexical scoping and how do interpreters implement it?
  2. How do closures capture environment state?
  3. How do you represent a function call frame?
  4. Why does a tree-walk interpreter tend to be slow?
  5. What is the difference between dynamic and static typing at runtime?

Hints in Layers

Hint 1: Separate parse from execute Get a correct AST before adding evaluation.

Hint 2: Represent environments as a linked list Each scope points to its enclosing scope.

Hint 3: Treat functions as objects Store the function body plus a pointer to its defining environment.

Hint 4: Add a debug mode Print AST and environment depth for each function call.

Books That Will Help

Topic Book Chapter Why This Matters
Interpreter end-to-end Crafting Interpreters Ch. 4-13 Full pipeline in one guide. citeturn3search0
Scopes and symbol tables Language Implementation Patterns Ch. 6 Correct lexical scoping.
AST evaluation Engineering a Compiler Ch. 5 Walkers and semantic checks.
Object model basics Engineering a Compiler Ch. 7 Runtime model design.

Common Pitfalls & Debugging

Problem: “Functions see wrong variable values”

  • Why: environment captured at call time, not definition time
  • Fix: store defining environment inside function object
  • Quick test: nested closures returning counters

Problem: “Return does not exit function”

  • Why: return is treated like a normal expression node
  • Fix: implement a return signal or long-jump style unwind
  • Quick test: fun f(){ return 1; print 2; }

Problem: “Stack overflow on recursion”

  • Why: interpreter uses host recursion for each node
  • Fix: consider iterative evaluation for heavy recursion, or add a recursion limit
  • Quick test: compute fib(30) and verify graceful failure or limit

Definition of Done

  • Programs with variables, functions, and closures execute correctly
  • Blocks and scopes resolve names correctly
  • Runtime errors include line/column and helpful context
  • At least 20 language tests pass (covering control flow and functions)
  • Debug mode prints AST and execution trace for one sample program

Project 3: Bytecode Virtual Machine

  • File: COMPILERS_INTERPRETERS_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Virtual Machines
  • Software or Tool: Bytecode VM
  • Main Book: “Crafting Interpreters” by Bob Nystrom

What you’ll build: A stack-based VM that executes bytecode instructions, with a compiler that transforms AST → bytecode. Include a disassembler to visualize the bytecode.

Why it teaches compilers/interpreters: Tree-walking is slow. Real interpreters (Python, Ruby, Lua, JVM) compile to bytecode first. This project teaches you why that’s faster and how instruction dispatch works.

Core challenges you’ll face:

  • Designing an instruction set (what opcodes do you need?) (IR design)
  • Implementing the bytecode compiler that linearizes your AST (code generation)
  • Building a dispatch loop that executes instructions efficiently (runtime systems)
  • Managing a value stack and call frames (runtime memory model)

Learning milestones:

  1. Simple expressions compile and run - you understand bytecode encoding
  2. Control flow works - you understand jump instructions and patching
  3. Functions work - you understand call frames and the call stack
  4. Performance improves over tree-walking - you understand why bytecode is faster

Real World Outcome

You can disassemble bytecode, execute it step-by-step, and watch the stack evolve as instructions run.

$ ./vm samples/arith.lox --trace
== disassembly of <script> ==
0000    1 OP_CONSTANT    0 '10'
0002    | OP_CONSTANT    1 '20'
0004    | OP_ADD
0005    | OP_PRINT
0006    2 OP_RETURN

stack: [10]
stack: [10, 20]
stack: [30]
30

The Core Question You’re Answering

“How do real-world interpreters execute programs quickly without compiling to native code?”

Concepts You Must Understand First

  1. Stack machine execution model
    • How do push/pop instructions implement expressions?
    • Book Reference: Crafting Interpreters Part III
  2. Bytecode encoding
    • How do you represent constants and instructions compactly?
    • Book Reference: Engineering a Compiler Ch. 8
  3. Control flow with jumps
    • How do you patch jump offsets after parsing?
    • Book Reference: Engineering a Compiler Ch. 6
  4. Call frames
    • How does a VM represent function calls and returns?
    • Book Reference: Engineering a Compiler Ch. 7

Questions to Guide Your Design

  1. What is the minimal instruction set for expressions, variables, and control flow?
  2. How will you encode constants: constant pool + operand or inline?
  3. How will the VM keep track of call frames and local slots?
  4. Where will you add tracing or debugging hooks?

Thinking Exercise

Design bytecode for this expression and show the stack after each instruction:

(1 + 2) * (3 - 4)

The Interview Questions They’ll Ask

  1. Why does a bytecode VM outperform a tree-walk interpreter?
  2. What is a stack machine and how does it evaluate expressions?
  3. How do you implement control flow with jump instructions?
  4. What is the tradeoff between stack-based and register-based VMs?
  5. How do call frames differ from the native call stack?

Hints in Layers

Hint 1: Start with a tiny instruction set Implement constants, add, subtract, print, return.

Hint 2: Build a disassembler early It will save you days of debugging.

Hint 3: Store locals in fixed slots Index locals by slot number to avoid name lookups at runtime.

Hint 4: Add a trace flag Print stack state after each opcode to validate semantics.

Books That Will Help

Topic Book Chapter Why This Matters
Bytecode VM design Crafting Interpreters Part III Full VM architecture. citeturn3search2
Instruction sets Engineering a Compiler Ch. 8 IR to low-level instruction mapping.
Control flow lowering Engineering a Compiler Ch. 6 Jump and branch construction.
Runtime call frames Engineering a Compiler Ch. 7 Frames + stack layout.

Common Pitfalls & Debugging

Problem: “Jumps land in the wrong place”

  • Why: patching offsets before the target location is known
  • Fix: store patch locations and backfill after code emission
  • Quick test: compile a loop and verify disassembly addresses

Problem: “Stack underflow”

  • Why: opcode pops more values than it pushes
  • Fix: add debug assertions on stack depth
  • Quick test: run a program with nested operations and trace stack

Problem: “Function locals overwrite each other”

  • Why: incorrect frame base pointer or slot indexing
  • Fix: track frame->slots base and compute local slot addresses
  • Quick test: nested function calls with locals of same name

Definition of Done

  • Disassembler prints correct opcodes and constants
  • Bytecode executes sample programs correctly
  • Trace mode shows valid stack evolution
  • Functions and closures (if included) run correctly
  • Performance is meaningfully better than tree-walk baseline

Project 4: Compiler to x86-64 Assembly

  • File: COMPILERS_INTERPRETERS_LEARNING_PROJECTS.md
  • Programming Language: C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compilers / Architecture
  • Software or Tool: x86-64 Assembly
  • Main Book: “Writing a C Compiler” by Nora Sandler

What you’ll build: A compiler that takes a C subset (integers, arithmetic, functions, control flow) and produces actual x86-64 assembly that runs natively.

Why it teaches compilers/interpreters: This is where you confront the real machine. You must understand calling conventions, register allocation, stack frames, and how assembly executes. The ABI rules are what make your code callable by the OS and other programs. citeturn3search5

Core challenges you’ll face:

  • Generating correct assembly for expressions using registers (instruction selection)
  • Implementing the System V AMD64 calling convention (ABI understanding)
  • Allocating registers efficiently without running out (register allocation)
  • Generating correct code for control flow (if/else, loops) (code generation)

Learning milestones:

  1. Arithmetic expressions compile - you understand instruction selection
  2. Functions work - you understand calling conventions and stack frames
  3. Control flow works - you understand conditional jumps and labels
  4. The binary runs - you’ve created a real executable from source code

Real World Outcome

You generate x86-64 assembly, assemble and link it, and run a native binary produced by your own compiler.

$ ./mycc factorial.c -o factorial.s
$ clang factorial.s -o factorial
$ ./factorial
120
$ objdump -d factorial | sed -n '1,20p'
0000000000001139 <main>:
1139: 55                   push   %rbp
113a: 48 89 e5             mov    %rsp,%rbp
...

The Core Question You’re Answering

“How do high-level constructs become real CPU instructions that obey a platform ABI?”

Concepts You Must Understand First

  1. Basic x86-64 instruction set
    • How do mov, add, and cmp work?
    • Book Reference: Computer Systems: A Programmer’s Perspective Ch. 3
  2. Calling conventions (System V AMD64)
    • Which registers pass arguments and return values?
    • Book Reference: Writing a C Compiler Ch. 9
  3. Stack frames
    • How are locals stored and aligned?
    • Book Reference: Engineering a Compiler Ch. 6
  4. Control flow lowering
    • How do if and while become jumps?
    • Book Reference: Writing a C Compiler Ch. 8

Questions to Guide Your Design

  1. Which expressions can be compiled with registers only, and when do you spill?
  2. How will you represent temporary values during codegen?
  3. How will you handle function prologue/epilogue correctly?
  4. How will you generate labels for branches and loops?

Thinking Exercise

Manually lower this to assembly (pseudo-assembly is fine) and mark the stack layout:

int add(int a, int b) { return a + b; }

The Interview Questions They’ll Ask

  1. What is a calling convention and why does it matter?
  2. How do you allocate registers in a simple compiler?
  3. How does an if statement turn into machine code?
  4. What does stack alignment mean on x86-64?
  5. Why is instruction selection not the same as parsing?

Hints in Layers

Hint 1: Emit a tiny subset first Start with integer literals, addition, and return.

Hint 2: Hardcode a simple calling convention Implement only the first six integer arguments in registers.

Hint 3: Build a debug flag Emit comments next to assembly to see which AST node generated it.

Hint 4: Compare to clang Compile the same input with clang -S -O0 and diff the assembly.

Books That Will Help

Topic Book Chapter Why This Matters
x86-64 basics Computer Systems: A Programmer’s Perspective Ch. 3 Machine-level instructions.
Calling conventions Writing a C Compiler Ch. 9 ABI rules for calls.
Code generation Engineering a Compiler Ch. 11 Instruction selection strategies.
Register allocation Engineering a Compiler Ch. 13 Avoiding spills and clobbers.

Common Pitfalls & Debugging

Problem: “Segfault in function call”

  • Why: incorrect prologue/epilogue or stack misalignment
  • Fix: follow the System V AMD64 ABI exactly
  • Quick test: call a function with 7 integer args and compare to clang

Problem: “Wrong result only in nested expressions”

  • Why: register clobber or temp overwrites
  • Fix: implement a simple register allocator or spill strategy
  • Quick test: compile (a+b)*(c+d) and compare to reference output

Problem: “Loop never ends”

  • Why: wrong conditional jump or compare order
  • Fix: add debug comments and verify condition inversion
  • Quick test: compile a for loop and inspect generated labels

Definition of Done

  • Correct assembly for arithmetic and control flow
  • Functions obey System V AMD64 calling conventions
  • Stack alignment matches ABI requirements
  • Generated binaries run correctly for at least 10 test programs
  • Debug mode emits annotated assembly

Project 5: LLVM Frontend for a Custom Language

  • File: COMPILERS_INTERPRETERS_LEARNING_PROJECTS.md
  • Programming Language: C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compilers
  • Software or Tool: LLVM
  • Main Book: “Engineering a Compiler” by Cooper & Torczon

What you’ll build: A compiler frontend that parses your language and emits LLVM IR, letting LLVM handle optimization and code generation for multiple architectures. LLVM is designed as a reusable optimizer and code generator, which is why this is the industry path. citeturn2search0

Why it teaches compilers/interpreters: This is how modern compilers (Clang, Swift, Rust) work: they build a frontend and use LLVM for backends. The LLVM IR is SSA-based and shared across optimization phases. citeturn0search0turn0search3turn0search1turn1search0

Core challenges you’ll face:

  • Learning LLVM IR and its type system (IR design)
  • Using the LLVM C API or bindings to emit IR programmatically (code generation)
  • Implementing SSA form (all values assigned exactly once) (optimization)
  • Connecting to LLVM’s optimization and codegen pipeline (backend integration)

Learning milestones:

  1. Hello World compiles via LLVM - you understand the LLVM pipeline
  2. Functions and control flow work - you understand LLVM IR structure
  3. Optimizations kick in - you see -O2 transform your code
  4. Cross-compile to different targets - you understand LLVM’s architecture

Real World Outcome

You compile a small program written in your language into LLVM IR, run it with lli, and emit a native binary for different targets.

$ ./mylangc samples/fib.ml --emit-llvm -o fib.ll
$ lli fib.ll
55
$ ./mylangc samples/fib.ml -o fib_native
$ ./fib_native
55

The Core Question You’re Answering

“How do modern compilers leverage LLVM to turn a custom frontend into production-grade binaries?”

Concepts You Must Understand First

  1. LLVM IR basics
    • What do functions, blocks, and instructions look like?
    • Book Reference: Engineering a Compiler Ch. 9
  2. SSA form
    • Why is each value assigned once?
    • Book Reference: Engineering a Compiler Ch. 9
  3. IR generation APIs
    • How do you build IR with LLVM builders?
    • Book Reference: LLVM Kaleidoscope Tutorial (official docs)
  4. Optimization pipeline
    • What does -O2 actually do to your IR?
    • Book Reference: Engineering a Compiler Ch. 10

Questions to Guide Your Design

  1. How will your AST map to LLVM types and instructions?
  2. How will you represent control flow: conditional branches and phi nodes?
  3. How will you handle runtime support for your language (printing, strings)?
  4. How will you validate and dump IR for debugging?

Thinking Exercise

Write the LLVM IR (or outline it) for:

func max(a, b) { if (a > b) return a; else return b; }

The Interview Questions They’ll Ask

  1. Why is LLVM IR in SSA form?
  2. What is the difference between your frontend and LLVM’s backend?
  3. How do phi nodes work in SSA?
  4. What are the benefits of using LLVM instead of writing your own backend?
  5. How do optimizations preserve correctness?

Hints in Layers

Hint 1: Start with expression-only programs Emit IR for literals and arithmetic before control flow.

Hint 2: Use the verifier Run llvm::verifyModule after each generation step.

Hint 3: Print IR early and often Readable IR is the fastest debugging tool.

Hint 4: Compare with clang -S -emit-llvm Compile a similar program and diff the IR structure.

Books That Will Help

Topic Book Chapter Why This Matters
LLVM IR fundamentals Engineering a Compiler Ch. 9 SSA and IR semantics.
SSA and IR Engineering a Compiler Ch. 9 Enables analysis and optimization.
Optimization passes Engineering a Compiler Ch. 10 Connects to LLVM pipeline.
Frontend design Language Implementation Patterns Ch. 8-9 Clean AST to IR mapping.

Common Pitfalls & Debugging

Problem: “Verifier rejects my module”

  • Why: malformed control flow or mismatched types
  • Fix: check every basic block ends with a terminator; ensure phi inputs match
  • Quick test: run llvm-as or opt -verify on emitted IR

Problem: “Segfault in lli”

  • Why: missing runtime support or invalid pointer math
  • Fix: add runtime stubs for I/O and validate pointer types
  • Quick test: run a program that only uses integer math first

Problem: “Optimizations change output”

  • Why: undefined behavior or incorrect IR assumptions
  • Fix: avoid UB and use explicit types/flags
  • Quick test: diff outputs with -O0 and -O2

Definition of Done

  • IR generation passes LLVM verifier
  • Programs compile and run with lli and native output
  • Control flow and function calls work correctly
  • At least 5 sample programs work across -O0 and -O2
  • IR dumps show SSA correctness (phi nodes where needed)

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
Expression Evaluator Beginner Weekend ⭐⭐ Entry-level ⭐⭐⭐ Quick wins
Full Interpreter (Lox/Monkey) Intermediate 2-3 weeks ⭐⭐⭐⭐ Core concepts ⭐⭐⭐⭐⭐ Magical
Bytecode VM Intermediate-Advanced 2-4 weeks ⭐⭐⭐⭐ Runtime systems ⭐⭐⭐⭐ Satisfying
x86-64 Compiler Advanced 1-2 months ⭐⭐⭐⭐⭐ Full depth ⭐⭐⭐⭐⭐ Mind-blowing
LLVM Frontend Advanced 1-2 months ⭐⭐⭐⭐ Production skills ⭐⭐⭐⭐ Powerful

Recommendation

Start with Project 2 (Full Interpreter) using Bob Nystrom’s “Crafting Interpreters” as your guide. citeturn3search0

Here’s why:

  1. It’s the sweet spot - Complex enough to teach real concepts, achievable enough to finish
  2. Immediate feedback - You can run programs in YOUR language within days
  3. Foundation for everything - Every concept (lexing, parsing, scopes, closures) carries forward
  4. Free, excellent resource - The entire book is free online

After completing it, you’ll have the foundation to tackle either:

  • Project 3 if you want to understand performance and bytecode
  • Project 4 if you want to understand the machine and native compilation

Final Capstone Project: Self-Hosting Compiler

What you’ll build: A compiler for a language written in that same language. The compiler compiles itself.

Why it teaches compilers/interpreters: Self-hosting is the ultimate test. Your compiler must be correct enough and complete enough to compile its own source code. This forces you to implement every feature you need, handle every edge case, and produce efficient enough code that compilation completes in reasonable time.

Core challenges you’ll face:

  • Designing a language expressive enough to write a compiler in (language design)
  • Bootstrapping: first compile with another compiler, then with yourself (build systems)
  • Debugging compiler bugs when the compiler itself might be buggy (systematic debugging)
  • Optimizing compilation speed since you’ll compile your compiler thousands of times (performance engineering)

Learning milestones:

  1. Compiler compiles simple programs - foundation works
  2. Compiler compiles a subset of itself - bootstrapping begins
  3. Compiler fully compiles itself - self-hosting achieved
  4. Stage 2 and Stage 3 binaries are identical - compiler is correct

Real World Outcome

You compile your compiler with itself, compare the resulting binaries, and prove it can sustain its own evolution.

$ ./mycompiler stage0.src -o stage1
$ ./stage1 stage1.src -o stage2
$ ./stage2 stage1.src -o stage3
$ shasum stage2 stage3
<same hash>

The Core Question You’re Answering

“Can I build a language and compiler robust enough to compile themselves reliably?”

Concepts You Must Understand First

  1. Bootstrapping
    • How do you go from an initial compiler to a self-hosted one?
    • Book Reference: Engineering a Compiler Ch. 1
  2. Compiler testing
    • How do you validate correctness beyond unit tests?
    • Book Reference: Writing a C Compiler Ch. 20
  3. Language design trade-offs
    • Which features are essential for implementing a compiler?
    • Book Reference: Language Implementation Patterns Ch. 1
  4. Trusting trust
    • Why compiler provenance matters for security
    • Reference: Ken Thompson’s Turing Award lecture citeturn4search7

Questions to Guide Your Design

  1. What is the minimal feature set needed to implement the compiler itself?
  2. How will you keep the compiler fast enough to recompile frequently?
  3. How will you detect codegen or optimizer regressions?
  4. How will you stage the migration from stage0 to stage1?

Thinking Exercise

List the language features you need to implement a compiler and justify each. Which ones can you postpone?

The Interview Questions They’ll Ask

  1. What does it mean for a compiler to be self-hosting?
  2. How do you validate that your compiler is correct?
  3. What is bootstrapping and why is it hard?
  4. How do you design a language so it can implement its own compiler?
  5. What is a “trusting trust” attack and how do you mitigate it?

Hints in Layers

Hint 1: Start with a stage0 compiler Use another language to build the first version.

Hint 2: Keep the core language small Every feature you add must be implemented and tested twice.

Hint 3: Add a deterministic build mode Stable output helps compare binaries reliably.

Hint 4: Use differential testing Compile the same program with stage1 and stage2 and compare outputs.

Books That Will Help

Topic Book Chapter Why This Matters
Bootstrapping Engineering a Compiler Ch. 1 How compilers evolve.
Compiler testing Writing a C Compiler Ch. 20 Test strategies for compilers.
Language design Language Implementation Patterns Ch. 1 Feature trade-offs.
Secure compiler thinking Reflections on Trusting Trust Essay Security implications. citeturn4search7

Common Pitfalls & Debugging

Problem: “Stage2 differs from Stage3”

  • Why: nondeterministic codegen or unstable symbol ordering
  • Fix: enforce deterministic sorting and stable output
  • Quick test: compile twice and compare hashes

Problem: “Compiler crashes on its own source”

  • Why: missing feature used by the compiler itself
  • Fix: identify minimal feature set and refactor
  • Quick test: compile a reduced version of the compiler first

Problem: “Compilation time explodes”

  • Why: quadratic passes or naive optimizations
  • Fix: add profiling and optimize hot passes
  • Quick test: measure compile time as source size grows

Definition of Done

  • Stage0 compiles Stage1 successfully
  • Stage1 compiles Stage2 and Stage3 with identical hashes
  • Compiler can rebuild itself in a clean environment
  • Regression tests cover all language features used by the compiler
  • Build is deterministic across runs

This journey will transform how you see every piece of software. Every programming language, every tool, every runtime - you’ll understand how they work because you’ve built them yourself.