Project 10: C++ Compiler Frontend
Build a complete compiler frontend for a subset of C++ (or your own language) including lexical analysis, parsing, AST construction, semantic analysis, and LLVM IR code generation.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Master |
| Time Estimate | 3-6 months |
| Language | C++ |
| Prerequisites | All previous projects, theory of computation, LLVM basics |
| Key Topics | Lexing, parsing, AST, type checking, symbol tables, LLVM IR |
Table of Contents
- 1. Learning Objectives
- 2. Theoretical Foundation
- 3. Project Specification
- 4. Solution Architecture
- 5. Implementation Guide
- 6. Testing Strategy
- 7. Common Pitfalls & Debugging
- 8. Extensions & Challenges
- 9. Real-World Connections
- 10. Resources
- 11. Self-Assessment Checklist
- 12. Submission / Completion Criteria
1. Learning Objectives
By completing this project, you will:
- Master lexical analysis: Convert source code into tokens using finite automata
- Implement recursive descent parsing: Build AST from tokens using context-free grammars
- Design AST node hierarchies: Create type-safe, visitable syntax trees
- Build symbol tables: Manage scopes, declarations, and name resolution
- Implement type checking: Verify type compatibility and perform inference
- Generate intermediate representation: Produce LLVM IR or your own bytecode
- Understand compiler optimization: Learn how frontends enable backend optimization
- Debug language implementations: Develop techniques for diagnosing compiler bugs
2. Theoretical Foundation
2.1 Core Concepts
Compilation Pipeline
+-----------------------------------------------------------------------+
| COMPILATION PIPELINE |
+-----------------------------------------------------------------------+
| |
| Source Code |
| | |
| v |
| +------------------+ |
| | Lexical Analysis | "fn main() { let x = 5; }" |
| | (Lexer) | --> [FN, IDENT("main"), LPAREN, RPAREN, ...] |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | Syntax Analysis | Tokens --> Abstract Syntax Tree |
| | (Parser) | |
| +------------------+ FnDecl |
| | / \ |
| | name body |
| | "main" BlockStmt |
| | | |
| v VarDecl |
| +------------------+ / \ |
| | Semantic Analysis| "x" IntLit(5) |
| | (Type Checking) | |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | IR Generation | AST --> LLVM IR / Bytecode |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | Optimization | (LLVM handles this) |
| +------------------+ |
| | |
| v |
| +------------------+ |
| | Code Generation | IR --> Machine Code |
| +------------------+ |
| | |
| v |
| Executable |
| |
+-----------------------------------------------------------------------+
Lexical Analysis
The lexer converts character streams into tokens:
+-----------------------------------------------------------------------+
| LEXICAL ANALYSIS |
+-----------------------------------------------------------------------+
| |
| Input: "let x = 42 + y;" |
| |
| Output: [ |
| Token { type: LET, lexeme: "let", line: 1, col: 1 }, |
| Token { type: IDENTIFIER, lexeme: "x", line: 1, col: 5 }, |
| Token { type: EQUAL, lexeme: "=", line: 1, col: 7 }, |
| Token { type: INTEGER, lexeme: "42", line: 1, col: 9 }, |
| Token { type: PLUS, lexeme: "+", line: 1, col: 12 }, |
| Token { type: IDENTIFIER, lexeme: "y", line: 1, col: 14 }, |
| Token { type: SEMICOLON, lexeme: ";", line: 1, col: 15 }, |
| Token { type: EOF, lexeme: "", line: 1, col: 16 } |
| ] |
| |
| Lexer State Machine (simplified for identifiers): |
| |
| [a-zA-Z_] |
| (Start) ---------> (Identifier) ------+ |
| ^ | [^a-zA-Z0-9_] |
| | v |
| [a-zA-Z0-9_] (Emit IDENTIFIER) |
| |
| Token Types: |
| - Keywords: fn, let, if, else, while, return, struct, true, false |
| - Identifiers: [a-zA-Z_][a-zA-Z0-9_]* |
| - Literals: integers, floats, strings, characters |
| - Operators: +, -, *, /, ==, !=, <, >, <=, >=, &&, ||, ! |
| - Delimiters: (, ), {, }, [, ], ,, ;, :, -> |
| |
+-----------------------------------------------------------------------+
Recursive Descent Parsing
Each grammar rule becomes a parsing function:
+-----------------------------------------------------------------------+
| RECURSIVE DESCENT PARSING |
+-----------------------------------------------------------------------+
| |
| Grammar (EBNF-like): |
| |
| program = declaration* |
| declaration = fnDecl | varDecl | statement |
| fnDecl = "fn" IDENT "(" params? ")" ("->" type)? block |
| varDecl = "let" IDENT (":" type)? ("=" expr)? ";" |
| statement = exprStmt | ifStmt | whileStmt | returnStmt | block |
| block = "{" declaration* "}" |
| exprStmt = expr ";" |
| ifStmt = "if" expr block ("else" block)? |
| whileStmt = "while" expr block |
| returnStmt = "return" expr? ";" |
| |
| Expression Precedence (lowest to highest): |
| assignment = IDENT "=" assignment | logic_or |
| logic_or = logic_and ("||" logic_and)* |
| logic_and = equality ("&&" equality)* |
| equality = comparison (("==" | "!=") comparison)* |
| comparison = term (("<" | ">" | "<=" | ">=") term)* |
| term = factor (("+" | "-") factor)* |
| factor = unary (("*" | "/") unary)* |
| unary = ("!" | "-") unary | call |
| call = primary ("(" args? ")" | "." IDENT)* |
| primary = INTEGER | FLOAT | STRING | "true" | "false" |
| | IDENT | "(" expr ")" |
| |
| Parsing Code Pattern: |
| |
| unique_ptr<Expr> Parser::parseTerm() { |
| auto left = parseFactor(); |
| while (match(PLUS) || match(MINUS)) { |
| Token op = previous(); |
| auto right = parseFactor(); |
| left = make_unique<BinaryExpr>(move(left), op, move(right)); |
| } |
| return left; |
| } |
| |
+-----------------------------------------------------------------------+
Abstract Syntax Tree
+-----------------------------------------------------------------------+
| ABSTRACT SYNTAX TREE |
+-----------------------------------------------------------------------+
| |
| Source: fn factorial(n: int) -> int { |
| if n <= 1 { return 1; } |
| return n * factorial(n - 1); |
| } |
| |
| AST: |
| |
| FnDecl |
| / | \ |
| name params body |
| | | \ |
| "factorial" | BlockStmt |
| | / \ |
| Param("n", int) statements |
| / \ |
| IfStmt ReturnStmt |
| / | \ | |
| cond then else BinaryExpr(*) |
| | | null / \ |
| BinaryExpr(<=) | VarExpr CallExpr |
| / \ | "n" / \ |
| VarExpr IntLit ReturnStmt "factorial" args |
| "n" 1 | | |
| IntLit(1) BinaryExpr(-) |
| / \ |
| VarExpr IntLit |
| "n" 1 |
| |
| Node Base Class: |
| struct ASTNode { |
| SourceLocation location; |
| virtual ~ASTNode() = default; |
| virtual void accept(Visitor& v) = 0; |
| }; |
| |
+-----------------------------------------------------------------------+
Symbol Tables and Scoping
+-----------------------------------------------------------------------+
| SYMBOL TABLES AND SCOPING |
+-----------------------------------------------------------------------+
| |
| Source: |
| fn main() { |
| let x = 10; // x declared in main's scope |
| { |
| let y = 20; // y declared in inner scope |
| let x = 30; // shadows outer x |
| print(x + y); // 30 + 20 = 50 |
| } |
| print(x); // 10 (inner x is gone) |
| } |
| |
| Scope Chain: |
| |
| +-- Global Scope -------------------------------------------------+ |
| | main: FnType | |
| | print: FnType(int) -> void | |
| | | |
| | +-- main() Scope -------------------------------------------+ | |
| | | x: int = 10 | | |
| | | | | |
| | | +-- Inner Block Scope --------------------------------+ | | |
| | | | y: int = 20 | | | |
| | | | x: int = 30 (shadows outer x) | | | |
| | | +-----------------------------------------------------+ | | |
| | +------------------------------------------------------------+ | |
| +------------------------------------------------------------------+ |
| |
| Lookup Algorithm: |
| 1. Check current scope |
| 2. If not found, check parent scope |
| 3. Continue until global scope |
| 4. If still not found, report "undefined variable" error |
| |
| Symbol Entry: |
| struct Symbol { |
| string name; |
| Type* type; |
| SymbolKind kind; // Variable, Function, Type, Parameter |
| bool isMutable; |
| SourceLocation declLocation; |
| }; |
| |
+-----------------------------------------------------------------------+
2.2 Why This Matters
Building a compiler frontend teaches you:
- How programming languages work: From syntax to semantics
- Software engineering at scale: Compilers are complex, well-structured systems
- Understanding optimizations: Know what makes code fast
- Language design tradeoffs: Why languages are designed the way they are
Industry relevance:
- Compiler engineers are highly valued (Google, Apple, Meta, Microsoft)
- Understanding compilers improves code quality
- DSLs (Domain-Specific Languages) use compiler techniques
- IDE features (autocomplete, refactoring) rely on frontend analysis
2.3 Historical Context
1950s: First compilers (FORTRAN by Backus)
1960s: Formal language theory (Chomsky hierarchy)
1970s: Parser generators (yacc, lex)
1980s: LLVM precursors, optimization theory
1990s: JIT compilation (Java, JavaScript)
2000s: LLVM project begins
2010s: Clang becomes production-ready
2020s: Rust, Swift, Julia use LLVM
Key milestones:
- FORTRAN (1957): First optimizing compiler
- ALGOL (1960): Introduced BNF grammar notation
- C (1972): Influenced by compiler simplicity
- LLVM (2000): Modern modular compiler infrastructure
- Clang (2007): C/C++ frontend for LLVM
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “Compilers are too complex to build” | A simple language frontend is ~3000 lines |
| “You need a CS degree” | Self-taught with the right resources |
| “Parser generators are required” | Hand-written parsers are often cleaner |
| “Type checking is the hard part” | It’s systematic once you understand it |
| “LLVM is too complex” | The IR is well-documented and learnable |
3. Project Specification
3.1 What You Will Build
A compiler frontend for a simple typed language featuring:
- Lexer: Tokenizes source code with error recovery
- Parser: Builds AST using recursive descent
- Type System: Static typing with inference
- Semantic Analyzer: Name resolution, type checking
- Code Generator: LLVM IR or tree-walking interpreter
3.2 Functional Requirements
| Requirement | Description |
|---|---|
| FR-1 | Tokenize source with keywords, identifiers, literals, operators |
| FR-2 | Parse expressions with correct precedence and associativity |
| FR-3 | Parse statements: let, if, while, return, blocks |
| FR-4 | Parse functions with typed parameters and return types |
| FR-5 | Build scoped symbol tables for name resolution |
| FR-6 | Type check expressions and statements |
| FR-7 | Generate executable code (LLVM IR or interpreter) |
| FR-8 | Report errors with source location and helpful messages |
3.3 Non-Functional Requirements
| Requirement | Description |
|---|---|
| NFR-1 | Clear error messages with line/column numbers |
| NFR-2 | Continue parsing after syntax errors (error recovery) |
| NFR-3 | Compile 10,000 lines in under 1 second |
| NFR-4 | Memory-safe AST with no leaks |
| NFR-5 | Extensible architecture for new features |
3.4 Example Usage / Output
Source file (test.my):
fn factorial(n: int) -> int {
if n <= 1 {
return 1;
}
return n * factorial(n - 1);
}
fn main() -> int {
let result = factorial(5);
print(result);
return 0;
}
Compilation output:
$ ./mycompiler test.my
[Lexer] 47 tokens
[Parser] AST constructed
[Semantic] Type checking complete
[Codegen] LLVM IR generated
[LLVM] Compiling to native...
[Done] Output: test
$ ./test
120
Error reporting:
$ ./mycompiler bad.my
error[E001]: type mismatch in binary expression
--> bad.my:5:12
|
5 | return x + "hello";
| ^~~~~~~~~~
| expected 'int', found 'string'
|
help: cannot add 'int' and 'string'
error[E002]: undefined variable 'y'
--> bad.my:8:5
|
8 | y = 10;
| ^
| variable 'y' was not declared in this scope
Interpreter mode:
$ ./mycompiler test.my --interpret
120
3.5 Real World Outcome
After completing this project, you will have:
- Working compiler: Compiles source to executable
- Deep understanding: How code becomes machine instructions
- Transferable skills: Applicable to DSLs, query languages, configs
- Portfolio piece: Impressive for compiler or language teams
You should be able to:
- Read and understand production compiler source code
- Design new language features
- Implement syntax highlighters and linters
- Build domain-specific languages
4. Solution Architecture
4.1 High-Level Design
+-----------------------------------------------------------------------+
| COMPILER ARCHITECTURE |
+-----------------------------------------------------------------------+
| |
| +------------------------------------------------------------------+ |
| | Driver | |
| | - Command-line parsing | |
| | - File I/O | |
| | - Error reporting | |
| +------------------------------------------------------------------+ |
| | |
| +----------------------+----------------------+ |
| v v v |
| +----------+ +----------+ +----------+ |
| | Lexer | --> | Parser | --> | Semantic | |
| | | tokens | | AST | Analyzer | |
| +----------+ +----------+ +----------+ |
| | |
| v |
| +--------------------------+ |
| | | |
| v v |
| +----------+ +----------+ |
| | Codegen | |Interpreter| |
| | (LLVM) | | (Tree) | |
| +----------+ +----------+ |
| | |
| v |
| Executable |
| |
+-----------------------------------------------------------------------+
4.2 Key Components
Token Types
enum class TokenType {
// Literals
INTEGER, FLOAT, STRING, CHAR,
// Identifiers and keywords
IDENTIFIER,
FN, LET, IF, ELSE, WHILE, FOR, RETURN,
STRUCT, ENUM, IMPL, TRAIT,
TRUE, FALSE, NULL_LIT,
// Types
INT, FLOAT_TYPE, BOOL, VOID, STRING_TYPE,
// Operators
PLUS, MINUS, STAR, SLASH, PERCENT,
EQUAL, EQUAL_EQUAL, BANG_EQUAL,
LESS, LESS_EQUAL, GREATER, GREATER_EQUAL,
AMPERSAND_AMPERSAND, PIPE_PIPE, BANG,
// Delimiters
LPAREN, RPAREN, LBRACE, RBRACE, LBRACKET, RBRACKET,
COMMA, SEMICOLON, COLON, ARROW, DOT,
// Special
END_OF_FILE, ERROR
};
struct Token {
TokenType type;
std::string lexeme;
std::variant<int64_t, double, std::string> literal;
int line;
int column;
};
AST Node Hierarchy
// Base classes
struct ASTNode {
SourceLocation loc;
virtual ~ASTNode() = default;
virtual void accept(ASTVisitor& visitor) = 0;
};
struct Expr : ASTNode {
Type* resolvedType = nullptr; // Set during type checking
};
struct Stmt : ASTNode {};
struct Decl : ASTNode {};
// Expression nodes
struct IntLiteral : Expr {
int64_t value;
};
struct FloatLiteral : Expr {
double value;
};
struct StringLiteral : Expr {
std::string value;
};
struct BoolLiteral : Expr {
bool value;
};
struct VarExpr : Expr {
std::string name;
Symbol* resolvedSymbol = nullptr;
};
struct BinaryExpr : Expr {
std::unique_ptr<Expr> left;
Token op;
std::unique_ptr<Expr> right;
};
struct UnaryExpr : Expr {
Token op;
std::unique_ptr<Expr> operand;
};
struct CallExpr : Expr {
std::unique_ptr<Expr> callee;
std::vector<std::unique_ptr<Expr>> arguments;
};
struct AssignExpr : Expr {
std::string name;
std::unique_ptr<Expr> value;
};
// Statement nodes
struct ExprStmt : Stmt {
std::unique_ptr<Expr> expression;
};
struct BlockStmt : Stmt {
std::vector<std::unique_ptr<Stmt>> statements;
};
struct IfStmt : Stmt {
std::unique_ptr<Expr> condition;
std::unique_ptr<Stmt> thenBranch;
std::unique_ptr<Stmt> elseBranch; // nullable
};
struct WhileStmt : Stmt {
std::unique_ptr<Expr> condition;
std::unique_ptr<Stmt> body;
};
struct ReturnStmt : Stmt {
std::unique_ptr<Expr> value; // nullable
};
// Declaration nodes
struct VarDecl : Decl {
std::string name;
Type* typeAnnotation; // nullable (inference)
std::unique_ptr<Expr> initializer;
bool isMutable;
};
struct FnDecl : Decl {
std::string name;
std::vector<Parameter> parameters;
Type* returnType;
std::unique_ptr<BlockStmt> body;
};
struct Parameter {
std::string name;
Type* type;
};
4.3 Data Structures
Type System
// Type representation
enum class TypeKind {
PRIMITIVE, // int, float, bool, void
FUNCTION, // (args) -> return
STRUCT, // { field: type, ... }
ARRAY, // [type]
POINTER, // *type
REFERENCE // &type
};
struct Type {
TypeKind kind;
virtual ~Type() = default;
virtual bool equals(const Type& other) const = 0;
virtual std::string toString() const = 0;
};
struct PrimitiveType : Type {
enum Primitive { INT, FLOAT, BOOL, VOID, STRING };
Primitive primitive;
};
struct FunctionType : Type {
std::vector<Type*> paramTypes;
Type* returnType;
};
struct StructType : Type {
std::string name;
std::vector<std::pair<std::string, Type*>> fields;
};
// Type context (owns all types, ensures uniqueness)
class TypeContext {
std::vector<std::unique_ptr<Type>> types;
public:
PrimitiveType* getIntType();
PrimitiveType* getFloatType();
PrimitiveType* getBoolType();
PrimitiveType* getVoidType();
FunctionType* getFunctionType(std::vector<Type*> params, Type* ret);
};
Symbol Table
struct Symbol {
std::string name;
Type* type;
enum Kind { VARIABLE, FUNCTION, PARAMETER, TYPE };
Kind kind;
SourceLocation declLocation;
bool isInitialized = false;
bool isMutable = true;
};
class Scope {
std::unordered_map<std::string, Symbol> symbols;
Scope* parent;
public:
Scope(Scope* parent = nullptr) : parent(parent) {}
bool declare(const std::string& name, Symbol sym);
Symbol* lookup(const std::string& name); // Current scope only
Symbol* resolve(const std::string& name); // Up scope chain
};
class SymbolTable {
std::vector<std::unique_ptr<Scope>> scopes;
Scope* current;
public:
void pushScope();
void popScope();
bool declare(const std::string& name, Symbol sym);
Symbol* resolve(const std::string& name);
};
4.4 Algorithm Overview
Type Checking Algorithm
+-----------------------------------------------------------------------+
| TYPE CHECKING ALGORITHM |
+-----------------------------------------------------------------------+
| |
| For each node in AST (post-order traversal): |
| |
| IntLiteral -> type = int |
| FloatLiteral -> type = float |
| BoolLiteral -> type = bool |
| StringLiteral -> type = string |
| |
| VarExpr -> lookup in symbol table, type = symbol.type |
| |
| BinaryExpr: |
| 1. Check left and right types |
| 2. Verify operator valid for types |
| 3. Determine result type |
| +, -, *, / : (int, int) -> int, (float, float) -> float |
| <, >, <=, >=: (numeric, numeric) -> bool |
| ==, != : (T, T) -> bool |
| &&, || : (bool, bool) -> bool |
| |
| UnaryExpr: |
| ! : bool -> bool |
| - : numeric -> numeric |
| |
| CallExpr: |
| 1. Get function type |
| 2. Check argument count matches |
| 3. Check each argument type matches parameter type |
| 4. Result type = function's return type |
| |
| AssignExpr: |
| 1. Lookup variable |
| 2. Check mutability |
| 3. Check value type matches variable type |
| |
| IfStmt: |
| 1. Condition must be bool |
| 2. Check then/else branches |
| |
| WhileStmt: |
| 1. Condition must be bool |
| 2. Check body |
| |
| ReturnStmt: |
| 1. Get enclosing function's return type |
| 2. Check value type matches (or void if no value) |
| |
| VarDecl: |
| 1. If type annotation, use it |
| 2. Else infer from initializer |
| 3. Add to symbol table |
| |
| FnDecl: |
| 1. Create function type |
| 2. Add to symbol table |
| 3. Push scope, add parameters |
| 4. Check body |
| 5. Verify all paths return (or void) |
| |
+-----------------------------------------------------------------------+
5. Implementation Guide
5.1 Development Environment Setup
# Install LLVM (optional, for code generation)
# Ubuntu/Debian
sudo apt install llvm-15-dev clang-15
# macOS
brew install llvm@15
# Create project
mkdir -p my-compiler/{src,include,tests}
cd my-compiler
# CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(MyCompiler CXX)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
# Optional LLVM
find_package(LLVM 15 CONFIG)
if(LLVM_FOUND)
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
add_definitions(-DHAS_LLVM)
include_directories(${LLVM_INCLUDE_DIRS})
link_directories(${LLVM_LIBRARY_DIRS})
endif()
add_executable(mycompiler
src/main.cpp
src/lexer.cpp
src/parser.cpp
src/ast.cpp
src/types.cpp
src/semantic.cpp
src/codegen.cpp
src/interpreter.cpp
)
if(LLVM_FOUND)
llvm_map_components_to_libnames(LLVM_LIBS core native)
target_link_libraries(mycompiler ${LLVM_LIBS})
endif()
EOF
5.2 Project Structure
my-compiler/
├── CMakeLists.txt
├── include/
│ ├── lexer.hpp
│ ├── token.hpp
│ ├── parser.hpp
│ ├── ast.hpp
│ ├── types.hpp
│ ├── semantic.hpp
│ ├── codegen.hpp
│ ├── interpreter.hpp
│ └── error.hpp
├── src/
│ ├── main.cpp
│ ├── lexer.cpp
│ ├── parser.cpp
│ ├── ast.cpp
│ ├── types.cpp
│ ├── semantic.cpp
│ ├── codegen.cpp
│ └── interpreter.cpp
├── tests/
│ ├── lexer_test.cpp
│ ├── parser_test.cpp
│ ├── semantic_test.cpp
│ └── integration/
│ ├── factorial.my
│ ├── fibonacci.my
│ └── ...
└── examples/
├── hello.my
└── ...
5.3 The Core Question You’re Answering
How do you transform human-readable source code into executable machine instructions while catching errors and preserving meaning?
This encompasses:
- Tokenizing unstructured text
- Building structured representations
- Verifying semantic correctness
- Translating to target representation
5.4 Concepts You Must Understand First
Before implementing, verify you can answer:
- What is the Chomsky hierarchy and where do programming languages fit?
- Reference: “Engineering a Compiler” Chapter 3
- How does recursive descent parsing work?
- Reference: “Crafting Interpreters” Chapter 6
- What is the difference between syntax and semantics?
- Reference: “Types and Programming Languages” Chapter 1
- How do scopes and closures work?
- Reference: “Engineering a Compiler” Chapter 5
- What is LLVM IR and how is it structured?
- Reference: LLVM Language Reference Manual
5.5 Questions to Guide Your Design
Lexer Design:
- How will you handle Unicode in identifiers and strings?
- How will you track line and column numbers?
- How will you handle unterminated strings or invalid characters?
Parser Design:
- How will you handle operator precedence?
- How will you recover from syntax errors?
- How will you represent optional AST children?
Type System:
- How will you represent function types?
- How will you handle type inference?
- How will you report type errors helpfully?
Code Generation:
- How will you handle local variables in LLVM IR?
- How will you implement control flow?
- How will you call external functions (like print)?
5.6 Thinking Exercise
Before coding, trace through compilation of:
fn max(a: int, b: int) -> int {
if a > b {
return a;
}
return b;
}
Draw:
- Token stream
- AST structure
- Symbol table after parsing
- Type annotations on AST nodes
- LLVM IR (or pseudocode)
5.7 Hints in Layers
Hint 1 - Getting Started: Start with the lexer. Get it fully working before moving to the parser. Test with simple inputs first.
Hint 2 - Parser Structure: Each grammar rule becomes a method. Parse expressions bottom-up (highest precedence first), statements top-down.
Hint 3 - Type Checking: Use the Visitor pattern. Each AST node type has a corresponding check method that returns the node’s type.
Hint 4 - LLVM IR: Study the Kaleidoscope tutorial. Create a module, function, basic blocks, then emit instructions.
5.8 The Interview Questions They’ll Ask
- “Explain the phases of compilation.”
- Expected: Lexing, parsing, semantic analysis, optimization, code generation
- “How do you implement operator precedence?”
- Expected: Pratt parsing or precedence climbing in recursive descent
- “What is the difference between static and dynamic typing?”
- Expected: Static checked at compile time, dynamic at runtime
- “How does name resolution work with nested scopes?”
- Expected: Scope chain, shadowing, capture for closures
- “What is SSA form and why is it useful?”
- Expected: Single Static Assignment, enables optimizations
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Lexing/Parsing | “Crafting Interpreters” by Robert Nystrom | Parts 1-2 |
| Formal Theory | “Engineering a Compiler” by Cooper & Torczon | Chapters 2-4 |
| Type Systems | “Types and Programming Languages” by Pierce | Chapters 1-11 |
| Semantic Analysis | “Engineering a Compiler” by Cooper & Torczon | Chapters 5-6 |
| Code Generation | LLVM Tutorial (Kaleidoscope) | All chapters |
| Optimization | “Engineering a Compiler” by Cooper & Torczon | Chapters 8-10 |
5.10 Implementation Phases
Phase 1: Lexer (Week 1-2)
- Token types and structure
- Character stream processing
- Keyword recognition
- Literal parsing (int, float, string)
- Error handling
Phase 2: Parser (Week 3-5)
- AST node definitions
- Recursive descent for expressions
- Statement parsing
- Function declarations
- Error recovery (synchronization)
Phase 3: Semantic Analysis (Week 6-8)
- Symbol table implementation
- Scope management
- Type checking
- Name resolution
- Error reporting
Phase 4: Code Generation (Week 9-12)
- LLVM setup or interpreter
- Expression evaluation
- Control flow
- Function calls
- Runtime library
Phase 5: Polish (Week 13-16)
- Better error messages
- More language features
- Optimization
- Testing suite
5.11 Key Implementation Decisions
| Decision | Options | Recommendation |
|---|---|---|
| Backend | LLVM, Interpreter, Custom bytecode | Start with interpreter, add LLVM |
| AST ownership | unique_ptr, shared_ptr, arena | unique_ptr for simplicity |
| Error handling | Exceptions, Result type | Return error type for parser |
| Type representation | Inheritance, variant | Inheritance with TypeKind enum |
6. Testing Strategy
Unit Tests
// Lexer tests
TEST(Lexer, TokenizesKeywords) {
Lexer lexer("fn let if else while return");
EXPECT_EQ(lexer.nextToken().type, TokenType::FN);
EXPECT_EQ(lexer.nextToken().type, TokenType::LET);
EXPECT_EQ(lexer.nextToken().type, TokenType::IF);
EXPECT_EQ(lexer.nextToken().type, TokenType::ELSE);
EXPECT_EQ(lexer.nextToken().type, TokenType::WHILE);
EXPECT_EQ(lexer.nextToken().type, TokenType::RETURN);
}
TEST(Lexer, TokenizesNumbers) {
Lexer lexer("42 3.14 0xFF");
auto t1 = lexer.nextToken();
EXPECT_EQ(t1.type, TokenType::INTEGER);
EXPECT_EQ(std::get<int64_t>(t1.literal), 42);
auto t2 = lexer.nextToken();
EXPECT_EQ(t2.type, TokenType::FLOAT);
EXPECT_DOUBLE_EQ(std::get<double>(t2.literal), 3.14);
}
// Parser tests
TEST(Parser, ParsesBinaryExpr) {
Parser parser("1 + 2 * 3");
auto expr = parser.parseExpression();
// Should be 1 + (2 * 3), not (1 + 2) * 3
auto* binOp = dynamic_cast<BinaryExpr*>(expr.get());
ASSERT_NE(binOp, nullptr);
EXPECT_EQ(binOp->op.type, TokenType::PLUS);
auto* right = dynamic_cast<BinaryExpr*>(binOp->right.get());
ASSERT_NE(right, nullptr);
EXPECT_EQ(right->op.type, TokenType::STAR);
}
// Type checker tests
TEST(TypeChecker, TypeMismatch) {
auto ast = parse("let x: int = \"hello\";");
TypeChecker checker;
auto errors = checker.check(ast);
EXPECT_EQ(errors.size(), 1);
EXPECT_TRUE(errors[0].message.contains("type mismatch"));
}
Integration Tests
#!/bin/bash
# test_programs.sh
for file in tests/integration/*.my; do
expected="${file%.my}.expected"
if ./mycompiler "$file" --interpret > /tmp/output.txt 2>&1; then
if diff -q /tmp/output.txt "$expected" > /dev/null; then
echo "PASS: $file"
else
echo "FAIL: $file (output mismatch)"
diff /tmp/output.txt "$expected"
fi
else
echo "FAIL: $file (compilation error)"
fi
done
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Left recursion | Stack overflow | Grammar has left recursion | Rewrite as iteration |
| Dangling else | Wrong parse | Ambiguous grammar | Prefer innermost if |
| Off-by-one | Wrong location | Line/col tracking error | Test with multi-line input |
| Memory leak | Growing memory | AST nodes not freed | Use unique_ptr |
| Type confusion | Wrong codegen | AST type field not set | Check after semantic pass |
Debugging Techniques
// AST pretty printer
class ASTPrinter : public ASTVisitor {
int indent = 0;
void print(const std::string& s) {
std::cout << std::string(indent * 2, ' ') << s << '\n';
}
public:
void visitBinaryExpr(BinaryExpr& e) override {
print("BinaryExpr " + e.op.lexeme);
indent++;
e.left->accept(*this);
e.right->accept(*this);
indent--;
}
// ... other visit methods
};
// Use: ASTPrinter{}.visit(*ast);
8. Extensions & Challenges
After completing the base project:
- Structs and methods: User-defined types with associated functions
- Generics: Type parameters like
fn identity<T>(x: T) -> T - Closures: Anonymous functions that capture variables
- Pattern matching:
matchexpressions - Modules: Import system with namespaces
- Macros: Hygienic compile-time code generation
- Incremental compilation: Only recompile changed files
- LSP server: IDE support (autocomplete, go-to-definition)
- Debugging info: DWARF emission for gdb/lldb
- Optimization passes: Dead code elimination, constant folding
9. Real-World Connections
| Your Implementation | Production Compiler |
|---|---|
| Hand-written lexer | Clang’s lexer |
| Recursive descent | GCC, Clang, Rust |
| Symbol tables | All compilers |
| Type checking | TypeScript, Rust, Swift |
| LLVM IR | Clang, Rust, Swift, Julia |
Industry Examples:
- Clang/LLVM: C/C++ frontend used by Apple, Google
- Rust: Advanced type system with borrow checker
- TypeScript: Type inference and structural typing
- Swift: Protocol-oriented with value semantics
10. Resources
Primary Resources
- Crafting Interpreters - Free online book
- LLVM Kaleidoscope Tutorial - Official LLVM tutorial
- Rust Compiler Dev Guide - Real compiler internals
Academic Resources
- “Engineering a Compiler” by Cooper & Torczon
- “Types and Programming Languages” by Benjamin Pierce
- “Modern Compiler Implementation in ML” by Appel
Video Resources
11. Self-Assessment Checklist
Before considering this project complete:
- Can you explain the difference between lexical and syntactic analysis?
- Can you trace token stream for any input?
- Can you draw AST for complex expressions?
- Can you explain how scoping works?
- Can you describe the type checking algorithm?
- Does your compiler produce correct output for all test cases?
- Do error messages include line/column numbers?
- Can you add a new operator in under 30 minutes?
12. Submission / Completion Criteria
Your project is complete when:
- Lexer Is Complete
- Recognizes all token types
- Handles edge cases (unterminated strings, etc.)
- Reports errors with location
- Parser Is Complete
- Parses all language constructs
- Builds correct AST
- Recovers from errors
- Semantic Analysis Works
- Name resolution finds all symbols
- Type checking catches errors
- All AST nodes have types
- Code Generation Works
- Produces executable code
- Handles all expressions and statements
- Functions call correctly
- Test Suite Passes
- Unit tests for each phase
- Integration tests for complete programs
- Error case tests
Deliverables:
- Source code with clear organization
- Test suite with good coverage
- Example programs demonstrating features
- Documentation of language syntax