Project 10: C++ Compiler Frontend

Build a complete compiler frontend for a subset of C++ (or your own language) including lexical analysis, parsing, AST construction, semantic analysis, and LLVM IR code generation.


Quick Reference

Attribute Value
Difficulty Master
Time Estimate 3-6 months
Language C++
Prerequisites All previous projects, theory of computation, LLVM basics
Key Topics Lexing, parsing, AST, type checking, symbol tables, LLVM IR

Table of Contents


1. Learning Objectives

By completing this project, you will:

  1. Master lexical analysis: Convert source code into tokens using finite automata
  2. Implement recursive descent parsing: Build AST from tokens using context-free grammars
  3. Design AST node hierarchies: Create type-safe, visitable syntax trees
  4. Build symbol tables: Manage scopes, declarations, and name resolution
  5. Implement type checking: Verify type compatibility and perform inference
  6. Generate intermediate representation: Produce LLVM IR or your own bytecode
  7. Understand compiler optimization: Learn how frontends enable backend optimization
  8. Debug language implementations: Develop techniques for diagnosing compiler bugs

2. Theoretical Foundation

2.1 Core Concepts

Compilation Pipeline

+-----------------------------------------------------------------------+
|                      COMPILATION PIPELINE                              |
+-----------------------------------------------------------------------+
|                                                                        |
|  Source Code                                                           |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Lexical Analysis |  "fn main() { let x = 5; }"                      |
|  |    (Lexer)       |  --> [FN, IDENT("main"), LPAREN, RPAREN, ...]   |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Syntax Analysis  |  Tokens --> Abstract Syntax Tree                 |
|  |    (Parser)      |                                                  |
|  +------------------+            FnDecl                                |
|       |                         /     \                                |
|       |                      name    body                              |
|       |                     "main"   BlockStmt                         |
|       |                               |                                |
|       v                             VarDecl                            |
|  +------------------+              /    \                              |
|  | Semantic Analysis|           "x"    IntLit(5)                       |
|  | (Type Checking)  |                                                  |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | IR Generation    |  AST --> LLVM IR / Bytecode                      |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Optimization     |  (LLVM handles this)                             |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Code Generation  |  IR --> Machine Code                             |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  Executable                                                            |
|                                                                        |
+-----------------------------------------------------------------------+

Lexical Analysis

The lexer converts character streams into tokens:

+-----------------------------------------------------------------------+
|                      LEXICAL ANALYSIS                                  |
+-----------------------------------------------------------------------+
|                                                                        |
|  Input: "let x = 42 + y;"                                              |
|                                                                        |
|  Output: [                                                             |
|    Token { type: LET,        lexeme: "let", line: 1, col: 1 },         |
|    Token { type: IDENTIFIER, lexeme: "x",   line: 1, col: 5 },         |
|    Token { type: EQUAL,      lexeme: "=",   line: 1, col: 7 },         |
|    Token { type: INTEGER,    lexeme: "42",  line: 1, col: 9 },         |
|    Token { type: PLUS,       lexeme: "+",   line: 1, col: 12 },        |
|    Token { type: IDENTIFIER, lexeme: "y",   line: 1, col: 14 },        |
|    Token { type: SEMICOLON,  lexeme: ";",   line: 1, col: 15 },        |
|    Token { type: EOF,        lexeme: "",    line: 1, col: 16 }         |
|  ]                                                                     |
|                                                                        |
|  Lexer State Machine (simplified for identifiers):                     |
|                                                                        |
|        [a-zA-Z_]                                                       |
|  (Start) ---------> (Identifier) ------+                               |
|                          ^             | [^a-zA-Z0-9_]                 |
|                          |             v                               |
|                     [a-zA-Z0-9_]   (Emit IDENTIFIER)                   |
|                                                                        |
|  Token Types:                                                          |
|  - Keywords: fn, let, if, else, while, return, struct, true, false     |
|  - Identifiers: [a-zA-Z_][a-zA-Z0-9_]*                                  |
|  - Literals: integers, floats, strings, characters                     |
|  - Operators: +, -, *, /, ==, !=, <, >, <=, >=, &&, ||, !              |
|  - Delimiters: (, ), {, }, [, ], ,, ;, :, ->                           |
|                                                                        |
+-----------------------------------------------------------------------+

Recursive Descent Parsing

Each grammar rule becomes a parsing function:

+-----------------------------------------------------------------------+
|                    RECURSIVE DESCENT PARSING                           |
+-----------------------------------------------------------------------+
|                                                                        |
|  Grammar (EBNF-like):                                                  |
|                                                                        |
|  program     = declaration*                                            |
|  declaration = fnDecl | varDecl | statement                            |
|  fnDecl      = "fn" IDENT "(" params? ")" ("->" type)? block           |
|  varDecl     = "let" IDENT (":" type)? ("=" expr)? ";"                 |
|  statement   = exprStmt | ifStmt | whileStmt | returnStmt | block      |
|  block       = "{" declaration* "}"                                    |
|  exprStmt    = expr ";"                                                |
|  ifStmt      = "if" expr block ("else" block)?                         |
|  whileStmt   = "while" expr block                                      |
|  returnStmt  = "return" expr? ";"                                      |
|                                                                        |
|  Expression Precedence (lowest to highest):                            |
|  assignment  = IDENT "=" assignment | logic_or                         |
|  logic_or    = logic_and ("||" logic_and)*                             |
|  logic_and   = equality ("&&" equality)*                               |
|  equality    = comparison (("==" | "!=") comparison)*                  |
|  comparison  = term (("<" | ">" | "<=" | ">=") term)*                  |
|  term        = factor (("+" | "-") factor)*                            |
|  factor      = unary (("*" | "/") unary)*                              |
|  unary       = ("!" | "-") unary | call                                |
|  call        = primary ("(" args? ")" | "." IDENT)*                    |
|  primary     = INTEGER | FLOAT | STRING | "true" | "false"             |
|              | IDENT | "(" expr ")"                                    |
|                                                                        |
|  Parsing Code Pattern:                                                 |
|                                                                        |
|  unique_ptr<Expr> Parser::parseTerm() {                                |
|      auto left = parseFactor();                                        |
|      while (match(PLUS) || match(MINUS)) {                             |
|          Token op = previous();                                        |
|          auto right = parseFactor();                                   |
|          left = make_unique<BinaryExpr>(move(left), op, move(right));  |
|      }                                                                 |
|      return left;                                                      |
|  }                                                                     |
|                                                                        |
+-----------------------------------------------------------------------+

Abstract Syntax Tree

+-----------------------------------------------------------------------+
|                    ABSTRACT SYNTAX TREE                                |
+-----------------------------------------------------------------------+
|                                                                        |
|  Source: fn factorial(n: int) -> int {                                 |
|              if n <= 1 { return 1; }                                   |
|              return n * factorial(n - 1);                              |
|          }                                                             |
|                                                                        |
|  AST:                                                                  |
|                                                                        |
|                      FnDecl                                            |
|                    /   |   \                                           |
|                  name params body                                      |
|                   |     |     \                                        |
|              "factorial" |   BlockStmt                                 |
|                         |    /      \                                  |
|                  Param("n", int)    statements                         |
|                                     /         \                        |
|                               IfStmt        ReturnStmt                 |
|                              /   |   \           |                     |
|                           cond  then  else   BinaryExpr(*)             |
|                            |     |    null    /       \                |
|                  BinaryExpr(<=)  |        VarExpr    CallExpr          |
|                   /      \       |          "n"     /      \           |
|               VarExpr  IntLit  ReturnStmt        "factorial" args      |
|                 "n"      1        |                          |         |
|                               IntLit(1)              BinaryExpr(-)     |
|                                                       /       \        |
|                                                   VarExpr   IntLit     |
|                                                     "n"       1        |
|                                                                        |
|  Node Base Class:                                                      |
|  struct ASTNode {                                                      |
|      SourceLocation location;                                          |
|      virtual ~ASTNode() = default;                                     |
|      virtual void accept(Visitor& v) = 0;                              |
|  };                                                                    |
|                                                                        |
+-----------------------------------------------------------------------+

Symbol Tables and Scoping

+-----------------------------------------------------------------------+
|                    SYMBOL TABLES AND SCOPING                           |
+-----------------------------------------------------------------------+
|                                                                        |
|  Source:                                                               |
|  fn main() {                                                           |
|      let x = 10;        // x declared in main's scope                  |
|      {                                                                 |
|          let y = 20;    // y declared in inner scope                   |
|          let x = 30;    // shadows outer x                             |
|          print(x + y);  // 30 + 20 = 50                                |
|      }                                                                 |
|      print(x);          // 10 (inner x is gone)                        |
|  }                                                                     |
|                                                                        |
|  Scope Chain:                                                          |
|                                                                        |
|  +-- Global Scope -------------------------------------------------+   |
|  |  main: FnType                                                   |   |
|  |  print: FnType(int) -> void                                     |   |
|  |                                                                  |   |
|  |  +-- main() Scope -------------------------------------------+  |   |
|  |  |  x: int = 10                                              |  |   |
|  |  |                                                            |  |   |
|  |  |  +-- Inner Block Scope --------------------------------+  |  |   |
|  |  |  |  y: int = 20                                        |  |  |   |
|  |  |  |  x: int = 30  (shadows outer x)                     |  |  |   |
|  |  |  +-----------------------------------------------------+  |  |   |
|  |  +------------------------------------------------------------+  |   |
|  +------------------------------------------------------------------+   |
|                                                                        |
|  Lookup Algorithm:                                                     |
|  1. Check current scope                                                |
|  2. If not found, check parent scope                                   |
|  3. Continue until global scope                                        |
|  4. If still not found, report "undefined variable" error              |
|                                                                        |
|  Symbol Entry:                                                         |
|  struct Symbol {                                                       |
|      string name;                                                      |
|      Type* type;                                                       |
|      SymbolKind kind;  // Variable, Function, Type, Parameter          |
|      bool isMutable;                                                   |
|      SourceLocation declLocation;                                      |
|  };                                                                    |
|                                                                        |
+-----------------------------------------------------------------------+

2.2 Why This Matters

Building a compiler frontend teaches you:

  • How programming languages work: From syntax to semantics
  • Software engineering at scale: Compilers are complex, well-structured systems
  • Understanding optimizations: Know what makes code fast
  • Language design tradeoffs: Why languages are designed the way they are

Industry relevance:

  • Compiler engineers are highly valued (Google, Apple, Meta, Microsoft)
  • Understanding compilers improves code quality
  • DSLs (Domain-Specific Languages) use compiler techniques
  • IDE features (autocomplete, refactoring) rely on frontend analysis

2.3 Historical Context

1950s: First compilers (FORTRAN by Backus)
1960s: Formal language theory (Chomsky hierarchy)
1970s: Parser generators (yacc, lex)
1980s: LLVM precursors, optimization theory
1990s: JIT compilation (Java, JavaScript)
2000s: LLVM project begins
2010s: Clang becomes production-ready
2020s: Rust, Swift, Julia use LLVM

Key milestones:

  • FORTRAN (1957): First optimizing compiler
  • ALGOL (1960): Introduced BNF grammar notation
  • C (1972): Influenced by compiler simplicity
  • LLVM (2000): Modern modular compiler infrastructure
  • Clang (2007): C/C++ frontend for LLVM

2.4 Common Misconceptions

Misconception Reality
“Compilers are too complex to build” A simple language frontend is ~3000 lines
“You need a CS degree” Self-taught with the right resources
“Parser generators are required” Hand-written parsers are often cleaner
“Type checking is the hard part” It’s systematic once you understand it
“LLVM is too complex” The IR is well-documented and learnable

3. Project Specification

3.1 What You Will Build

A compiler frontend for a simple typed language featuring:

  1. Lexer: Tokenizes source code with error recovery
  2. Parser: Builds AST using recursive descent
  3. Type System: Static typing with inference
  4. Semantic Analyzer: Name resolution, type checking
  5. Code Generator: LLVM IR or tree-walking interpreter

3.2 Functional Requirements

Requirement Description
FR-1 Tokenize source with keywords, identifiers, literals, operators
FR-2 Parse expressions with correct precedence and associativity
FR-3 Parse statements: let, if, while, return, blocks
FR-4 Parse functions with typed parameters and return types
FR-5 Build scoped symbol tables for name resolution
FR-6 Type check expressions and statements
FR-7 Generate executable code (LLVM IR or interpreter)
FR-8 Report errors with source location and helpful messages

3.3 Non-Functional Requirements

Requirement Description
NFR-1 Clear error messages with line/column numbers
NFR-2 Continue parsing after syntax errors (error recovery)
NFR-3 Compile 10,000 lines in under 1 second
NFR-4 Memory-safe AST with no leaks
NFR-5 Extensible architecture for new features

3.4 Example Usage / Output

Source file (test.my):

fn factorial(n: int) -> int {
    if n <= 1 {
        return 1;
    }
    return n * factorial(n - 1);
}

fn main() -> int {
    let result = factorial(5);
    print(result);
    return 0;
}

Compilation output:

$ ./mycompiler test.my
[Lexer] 47 tokens
[Parser] AST constructed
[Semantic] Type checking complete
[Codegen] LLVM IR generated
[LLVM] Compiling to native...
[Done] Output: test

$ ./test
120

Error reporting:

$ ./mycompiler bad.my
error[E001]: type mismatch in binary expression
  --> bad.my:5:12
   |
 5 |     return x + "hello";
   |            ^~~~~~~~~~
   |            expected 'int', found 'string'
   |
help: cannot add 'int' and 'string'

error[E002]: undefined variable 'y'
  --> bad.my:8:5
   |
 8 |     y = 10;
   |     ^
   |     variable 'y' was not declared in this scope

Interpreter mode:

$ ./mycompiler test.my --interpret
120

3.5 Real World Outcome

After completing this project, you will have:

  1. Working compiler: Compiles source to executable
  2. Deep understanding: How code becomes machine instructions
  3. Transferable skills: Applicable to DSLs, query languages, configs
  4. Portfolio piece: Impressive for compiler or language teams

You should be able to:

  • Read and understand production compiler source code
  • Design new language features
  • Implement syntax highlighters and linters
  • Build domain-specific languages

4. Solution Architecture

4.1 High-Level Design

+-----------------------------------------------------------------------+
|                    COMPILER ARCHITECTURE                               |
+-----------------------------------------------------------------------+
|                                                                        |
|  +------------------------------------------------------------------+  |
|  |                        Driver                                     |  |
|  |  - Command-line parsing                                          |  |
|  |  - File I/O                                                       |  |
|  |  - Error reporting                                                |  |
|  +------------------------------------------------------------------+  |
|                              |                                         |
|       +----------------------+----------------------+                  |
|       v                      v                      v                  |
|  +----------+          +----------+          +----------+              |
|  |  Lexer   |    -->   |  Parser  |    -->   | Semantic |              |
|  |          |  tokens  |          |   AST    | Analyzer |              |
|  +----------+          +----------+          +----------+              |
|                                                    |                   |
|                                                    v                   |
|                        +--------------------------+                    |
|                        |                          |                    |
|                        v                          v                    |
|                  +----------+              +----------+                |
|                  | Codegen  |              |Interpreter|               |
|                  | (LLVM)   |              |  (Tree)   |               |
|                  +----------+              +----------+                |
|                        |                                               |
|                        v                                               |
|                  Executable                                            |
|                                                                        |
+-----------------------------------------------------------------------+

4.2 Key Components

Token Types

enum class TokenType {
    // Literals
    INTEGER, FLOAT, STRING, CHAR,

    // Identifiers and keywords
    IDENTIFIER,
    FN, LET, IF, ELSE, WHILE, FOR, RETURN,
    STRUCT, ENUM, IMPL, TRAIT,
    TRUE, FALSE, NULL_LIT,

    // Types
    INT, FLOAT_TYPE, BOOL, VOID, STRING_TYPE,

    // Operators
    PLUS, MINUS, STAR, SLASH, PERCENT,
    EQUAL, EQUAL_EQUAL, BANG_EQUAL,
    LESS, LESS_EQUAL, GREATER, GREATER_EQUAL,
    AMPERSAND_AMPERSAND, PIPE_PIPE, BANG,

    // Delimiters
    LPAREN, RPAREN, LBRACE, RBRACE, LBRACKET, RBRACKET,
    COMMA, SEMICOLON, COLON, ARROW, DOT,

    // Special
    END_OF_FILE, ERROR
};

struct Token {
    TokenType type;
    std::string lexeme;
    std::variant<int64_t, double, std::string> literal;
    int line;
    int column;
};

AST Node Hierarchy

// Base classes
struct ASTNode {
    SourceLocation loc;
    virtual ~ASTNode() = default;
    virtual void accept(ASTVisitor& visitor) = 0;
};

struct Expr : ASTNode {
    Type* resolvedType = nullptr;  // Set during type checking
};

struct Stmt : ASTNode {};

struct Decl : ASTNode {};

// Expression nodes
struct IntLiteral : Expr {
    int64_t value;
};

struct FloatLiteral : Expr {
    double value;
};

struct StringLiteral : Expr {
    std::string value;
};

struct BoolLiteral : Expr {
    bool value;
};

struct VarExpr : Expr {
    std::string name;
    Symbol* resolvedSymbol = nullptr;
};

struct BinaryExpr : Expr {
    std::unique_ptr<Expr> left;
    Token op;
    std::unique_ptr<Expr> right;
};

struct UnaryExpr : Expr {
    Token op;
    std::unique_ptr<Expr> operand;
};

struct CallExpr : Expr {
    std::unique_ptr<Expr> callee;
    std::vector<std::unique_ptr<Expr>> arguments;
};

struct AssignExpr : Expr {
    std::string name;
    std::unique_ptr<Expr> value;
};

// Statement nodes
struct ExprStmt : Stmt {
    std::unique_ptr<Expr> expression;
};

struct BlockStmt : Stmt {
    std::vector<std::unique_ptr<Stmt>> statements;
};

struct IfStmt : Stmt {
    std::unique_ptr<Expr> condition;
    std::unique_ptr<Stmt> thenBranch;
    std::unique_ptr<Stmt> elseBranch;  // nullable
};

struct WhileStmt : Stmt {
    std::unique_ptr<Expr> condition;
    std::unique_ptr<Stmt> body;
};

struct ReturnStmt : Stmt {
    std::unique_ptr<Expr> value;  // nullable
};

// Declaration nodes
struct VarDecl : Decl {
    std::string name;
    Type* typeAnnotation;  // nullable (inference)
    std::unique_ptr<Expr> initializer;
    bool isMutable;
};

struct FnDecl : Decl {
    std::string name;
    std::vector<Parameter> parameters;
    Type* returnType;
    std::unique_ptr<BlockStmt> body;
};

struct Parameter {
    std::string name;
    Type* type;
};

4.3 Data Structures

Type System

// Type representation
enum class TypeKind {
    PRIMITIVE,    // int, float, bool, void
    FUNCTION,     // (args) -> return
    STRUCT,       // { field: type, ... }
    ARRAY,        // [type]
    POINTER,      // *type
    REFERENCE     // &type
};

struct Type {
    TypeKind kind;

    virtual ~Type() = default;
    virtual bool equals(const Type& other) const = 0;
    virtual std::string toString() const = 0;
};

struct PrimitiveType : Type {
    enum Primitive { INT, FLOAT, BOOL, VOID, STRING };
    Primitive primitive;
};

struct FunctionType : Type {
    std::vector<Type*> paramTypes;
    Type* returnType;
};

struct StructType : Type {
    std::string name;
    std::vector<std::pair<std::string, Type*>> fields;
};

// Type context (owns all types, ensures uniqueness)
class TypeContext {
    std::vector<std::unique_ptr<Type>> types;

public:
    PrimitiveType* getIntType();
    PrimitiveType* getFloatType();
    PrimitiveType* getBoolType();
    PrimitiveType* getVoidType();
    FunctionType* getFunctionType(std::vector<Type*> params, Type* ret);
};

Symbol Table

struct Symbol {
    std::string name;
    Type* type;
    enum Kind { VARIABLE, FUNCTION, PARAMETER, TYPE };
    Kind kind;
    SourceLocation declLocation;
    bool isInitialized = false;
    bool isMutable = true;
};

class Scope {
    std::unordered_map<std::string, Symbol> symbols;
    Scope* parent;

public:
    Scope(Scope* parent = nullptr) : parent(parent) {}

    bool declare(const std::string& name, Symbol sym);
    Symbol* lookup(const std::string& name);      // Current scope only
    Symbol* resolve(const std::string& name);     // Up scope chain
};

class SymbolTable {
    std::vector<std::unique_ptr<Scope>> scopes;
    Scope* current;

public:
    void pushScope();
    void popScope();
    bool declare(const std::string& name, Symbol sym);
    Symbol* resolve(const std::string& name);
};

4.4 Algorithm Overview

Type Checking Algorithm

+-----------------------------------------------------------------------+
|                    TYPE CHECKING ALGORITHM                             |
+-----------------------------------------------------------------------+
|                                                                        |
|  For each node in AST (post-order traversal):                          |
|                                                                        |
|  IntLiteral      -> type = int                                         |
|  FloatLiteral    -> type = float                                       |
|  BoolLiteral     -> type = bool                                        |
|  StringLiteral   -> type = string                                      |
|                                                                        |
|  VarExpr         -> lookup in symbol table, type = symbol.type         |
|                                                                        |
|  BinaryExpr:                                                           |
|    1. Check left and right types                                       |
|    2. Verify operator valid for types                                  |
|    3. Determine result type                                            |
|       +, -, *, /  : (int, int) -> int, (float, float) -> float         |
|       <, >, <=, >=: (numeric, numeric) -> bool                         |
|       ==, !=     : (T, T) -> bool                                      |
|       &&, ||     : (bool, bool) -> bool                                |
|                                                                        |
|  UnaryExpr:                                                            |
|    !  : bool -> bool                                                   |
|    -  : numeric -> numeric                                             |
|                                                                        |
|  CallExpr:                                                             |
|    1. Get function type                                                |
|    2. Check argument count matches                                     |
|    3. Check each argument type matches parameter type                  |
|    4. Result type = function's return type                             |
|                                                                        |
|  AssignExpr:                                                           |
|    1. Lookup variable                                                  |
|    2. Check mutability                                                 |
|    3. Check value type matches variable type                           |
|                                                                        |
|  IfStmt:                                                               |
|    1. Condition must be bool                                           |
|    2. Check then/else branches                                         |
|                                                                        |
|  WhileStmt:                                                            |
|    1. Condition must be bool                                           |
|    2. Check body                                                       |
|                                                                        |
|  ReturnStmt:                                                           |
|    1. Get enclosing function's return type                             |
|    2. Check value type matches (or void if no value)                   |
|                                                                        |
|  VarDecl:                                                              |
|    1. If type annotation, use it                                       |
|    2. Else infer from initializer                                      |
|    3. Add to symbol table                                              |
|                                                                        |
|  FnDecl:                                                               |
|    1. Create function type                                             |
|    2. Add to symbol table                                              |
|    3. Push scope, add parameters                                       |
|    4. Check body                                                       |
|    5. Verify all paths return (or void)                                |
|                                                                        |
+-----------------------------------------------------------------------+

5. Implementation Guide

5.1 Development Environment Setup

# Install LLVM (optional, for code generation)
# Ubuntu/Debian
sudo apt install llvm-15-dev clang-15

# macOS
brew install llvm@15

# Create project
mkdir -p my-compiler/{src,include,tests}
cd my-compiler

# CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(MyCompiler CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

# Optional LLVM
find_package(LLVM 15 CONFIG)
if(LLVM_FOUND)
    message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
    add_definitions(-DHAS_LLVM)
    include_directories(${LLVM_INCLUDE_DIRS})
    link_directories(${LLVM_LIBRARY_DIRS})
endif()

add_executable(mycompiler
    src/main.cpp
    src/lexer.cpp
    src/parser.cpp
    src/ast.cpp
    src/types.cpp
    src/semantic.cpp
    src/codegen.cpp
    src/interpreter.cpp
)

if(LLVM_FOUND)
    llvm_map_components_to_libnames(LLVM_LIBS core native)
    target_link_libraries(mycompiler ${LLVM_LIBS})
endif()
EOF

5.2 Project Structure

my-compiler/
├── CMakeLists.txt
├── include/
│   ├── lexer.hpp
│   ├── token.hpp
│   ├── parser.hpp
│   ├── ast.hpp
│   ├── types.hpp
│   ├── semantic.hpp
│   ├── codegen.hpp
│   ├── interpreter.hpp
│   └── error.hpp
├── src/
│   ├── main.cpp
│   ├── lexer.cpp
│   ├── parser.cpp
│   ├── ast.cpp
│   ├── types.cpp
│   ├── semantic.cpp
│   ├── codegen.cpp
│   └── interpreter.cpp
├── tests/
│   ├── lexer_test.cpp
│   ├── parser_test.cpp
│   ├── semantic_test.cpp
│   └── integration/
│       ├── factorial.my
│       ├── fibonacci.my
│       └── ...
└── examples/
    ├── hello.my
    └── ...

5.3 The Core Question You’re Answering

How do you transform human-readable source code into executable machine instructions while catching errors and preserving meaning?

This encompasses:

  • Tokenizing unstructured text
  • Building structured representations
  • Verifying semantic correctness
  • Translating to target representation

5.4 Concepts You Must Understand First

Before implementing, verify you can answer:

  1. What is the Chomsky hierarchy and where do programming languages fit?
    • Reference: “Engineering a Compiler” Chapter 3
  2. How does recursive descent parsing work?
    • Reference: “Crafting Interpreters” Chapter 6
  3. What is the difference between syntax and semantics?
    • Reference: “Types and Programming Languages” Chapter 1
  4. How do scopes and closures work?
    • Reference: “Engineering a Compiler” Chapter 5
  5. What is LLVM IR and how is it structured?
    • Reference: LLVM Language Reference Manual

5.5 Questions to Guide Your Design

Lexer Design:

  • How will you handle Unicode in identifiers and strings?
  • How will you track line and column numbers?
  • How will you handle unterminated strings or invalid characters?

Parser Design:

  • How will you handle operator precedence?
  • How will you recover from syntax errors?
  • How will you represent optional AST children?

Type System:

  • How will you represent function types?
  • How will you handle type inference?
  • How will you report type errors helpfully?

Code Generation:

  • How will you handle local variables in LLVM IR?
  • How will you implement control flow?
  • How will you call external functions (like print)?

5.6 Thinking Exercise

Before coding, trace through compilation of:

fn max(a: int, b: int) -> int {
    if a > b {
        return a;
    }
    return b;
}

Draw:

  1. Token stream
  2. AST structure
  3. Symbol table after parsing
  4. Type annotations on AST nodes
  5. LLVM IR (or pseudocode)

5.7 Hints in Layers

Hint 1 - Getting Started: Start with the lexer. Get it fully working before moving to the parser. Test with simple inputs first.

Hint 2 - Parser Structure: Each grammar rule becomes a method. Parse expressions bottom-up (highest precedence first), statements top-down.

Hint 3 - Type Checking: Use the Visitor pattern. Each AST node type has a corresponding check method that returns the node’s type.

Hint 4 - LLVM IR: Study the Kaleidoscope tutorial. Create a module, function, basic blocks, then emit instructions.

5.8 The Interview Questions They’ll Ask

  1. “Explain the phases of compilation.”
    • Expected: Lexing, parsing, semantic analysis, optimization, code generation
  2. “How do you implement operator precedence?”
    • Expected: Pratt parsing or precedence climbing in recursive descent
  3. “What is the difference between static and dynamic typing?”
    • Expected: Static checked at compile time, dynamic at runtime
  4. “How does name resolution work with nested scopes?”
    • Expected: Scope chain, shadowing, capture for closures
  5. “What is SSA form and why is it useful?”
    • Expected: Single Static Assignment, enables optimizations

5.9 Books That Will Help

Topic Book Chapter
Lexing/Parsing “Crafting Interpreters” by Robert Nystrom Parts 1-2
Formal Theory “Engineering a Compiler” by Cooper & Torczon Chapters 2-4
Type Systems “Types and Programming Languages” by Pierce Chapters 1-11
Semantic Analysis “Engineering a Compiler” by Cooper & Torczon Chapters 5-6
Code Generation LLVM Tutorial (Kaleidoscope) All chapters
Optimization “Engineering a Compiler” by Cooper & Torczon Chapters 8-10

5.10 Implementation Phases

Phase 1: Lexer (Week 1-2)

  • Token types and structure
  • Character stream processing
  • Keyword recognition
  • Literal parsing (int, float, string)
  • Error handling

Phase 2: Parser (Week 3-5)

  • AST node definitions
  • Recursive descent for expressions
  • Statement parsing
  • Function declarations
  • Error recovery (synchronization)

Phase 3: Semantic Analysis (Week 6-8)

  • Symbol table implementation
  • Scope management
  • Type checking
  • Name resolution
  • Error reporting

Phase 4: Code Generation (Week 9-12)

  • LLVM setup or interpreter
  • Expression evaluation
  • Control flow
  • Function calls
  • Runtime library

Phase 5: Polish (Week 13-16)

  • Better error messages
  • More language features
  • Optimization
  • Testing suite

5.11 Key Implementation Decisions

Decision Options Recommendation
Backend LLVM, Interpreter, Custom bytecode Start with interpreter, add LLVM
AST ownership unique_ptr, shared_ptr, arena unique_ptr for simplicity
Error handling Exceptions, Result type Return error type for parser
Type representation Inheritance, variant Inheritance with TypeKind enum

6. Testing Strategy

Unit Tests

// Lexer tests
TEST(Lexer, TokenizesKeywords) {
    Lexer lexer("fn let if else while return");
    EXPECT_EQ(lexer.nextToken().type, TokenType::FN);
    EXPECT_EQ(lexer.nextToken().type, TokenType::LET);
    EXPECT_EQ(lexer.nextToken().type, TokenType::IF);
    EXPECT_EQ(lexer.nextToken().type, TokenType::ELSE);
    EXPECT_EQ(lexer.nextToken().type, TokenType::WHILE);
    EXPECT_EQ(lexer.nextToken().type, TokenType::RETURN);
}

TEST(Lexer, TokenizesNumbers) {
    Lexer lexer("42 3.14 0xFF");
    auto t1 = lexer.nextToken();
    EXPECT_EQ(t1.type, TokenType::INTEGER);
    EXPECT_EQ(std::get<int64_t>(t1.literal), 42);

    auto t2 = lexer.nextToken();
    EXPECT_EQ(t2.type, TokenType::FLOAT);
    EXPECT_DOUBLE_EQ(std::get<double>(t2.literal), 3.14);
}

// Parser tests
TEST(Parser, ParsesBinaryExpr) {
    Parser parser("1 + 2 * 3");
    auto expr = parser.parseExpression();

    // Should be 1 + (2 * 3), not (1 + 2) * 3
    auto* binOp = dynamic_cast<BinaryExpr*>(expr.get());
    ASSERT_NE(binOp, nullptr);
    EXPECT_EQ(binOp->op.type, TokenType::PLUS);

    auto* right = dynamic_cast<BinaryExpr*>(binOp->right.get());
    ASSERT_NE(right, nullptr);
    EXPECT_EQ(right->op.type, TokenType::STAR);
}

// Type checker tests
TEST(TypeChecker, TypeMismatch) {
    auto ast = parse("let x: int = \"hello\";");
    TypeChecker checker;
    auto errors = checker.check(ast);

    EXPECT_EQ(errors.size(), 1);
    EXPECT_TRUE(errors[0].message.contains("type mismatch"));
}

Integration Tests

#!/bin/bash
# test_programs.sh

for file in tests/integration/*.my; do
    expected="${file%.my}.expected"
    if ./mycompiler "$file" --interpret > /tmp/output.txt 2>&1; then
        if diff -q /tmp/output.txt "$expected" > /dev/null; then
            echo "PASS: $file"
        else
            echo "FAIL: $file (output mismatch)"
            diff /tmp/output.txt "$expected"
        fi
    else
        echo "FAIL: $file (compilation error)"
    fi
done

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Left recursion Stack overflow Grammar has left recursion Rewrite as iteration
Dangling else Wrong parse Ambiguous grammar Prefer innermost if
Off-by-one Wrong location Line/col tracking error Test with multi-line input
Memory leak Growing memory AST nodes not freed Use unique_ptr
Type confusion Wrong codegen AST type field not set Check after semantic pass

Debugging Techniques

// AST pretty printer
class ASTPrinter : public ASTVisitor {
    int indent = 0;

    void print(const std::string& s) {
        std::cout << std::string(indent * 2, ' ') << s << '\n';
    }

public:
    void visitBinaryExpr(BinaryExpr& e) override {
        print("BinaryExpr " + e.op.lexeme);
        indent++;
        e.left->accept(*this);
        e.right->accept(*this);
        indent--;
    }
    // ... other visit methods
};

// Use: ASTPrinter{}.visit(*ast);

8. Extensions & Challenges

After completing the base project:

  1. Structs and methods: User-defined types with associated functions
  2. Generics: Type parameters like fn identity<T>(x: T) -> T
  3. Closures: Anonymous functions that capture variables
  4. Pattern matching: match expressions
  5. Modules: Import system with namespaces
  6. Macros: Hygienic compile-time code generation
  7. Incremental compilation: Only recompile changed files
  8. LSP server: IDE support (autocomplete, go-to-definition)
  9. Debugging info: DWARF emission for gdb/lldb
  10. Optimization passes: Dead code elimination, constant folding

9. Real-World Connections

Your Implementation Production Compiler
Hand-written lexer Clang’s lexer
Recursive descent GCC, Clang, Rust
Symbol tables All compilers
Type checking TypeScript, Rust, Swift
LLVM IR Clang, Rust, Swift, Julia

Industry Examples:

  • Clang/LLVM: C/C++ frontend used by Apple, Google
  • Rust: Advanced type system with borrow checker
  • TypeScript: Type inference and structural typing
  • Swift: Protocol-oriented with value semantics

10. Resources

Primary Resources

Academic Resources

  • “Engineering a Compiler” by Cooper & Torczon
  • “Types and Programming Languages” by Benjamin Pierce
  • “Modern Compiler Implementation in ML” by Appel

Video Resources


11. Self-Assessment Checklist

Before considering this project complete:

  • Can you explain the difference between lexical and syntactic analysis?
  • Can you trace token stream for any input?
  • Can you draw AST for complex expressions?
  • Can you explain how scoping works?
  • Can you describe the type checking algorithm?
  • Does your compiler produce correct output for all test cases?
  • Do error messages include line/column numbers?
  • Can you add a new operator in under 30 minutes?

12. Submission / Completion Criteria

Your project is complete when:

  1. Lexer Is Complete
    • Recognizes all token types
    • Handles edge cases (unterminated strings, etc.)
    • Reports errors with location
  2. Parser Is Complete
    • Parses all language constructs
    • Builds correct AST
    • Recovers from errors
  3. Semantic Analysis Works
    • Name resolution finds all symbols
    • Type checking catches errors
    • All AST nodes have types
  4. Code Generation Works
    • Produces executable code
    • Handles all expressions and statements
    • Functions call correctly
  5. Test Suite Passes
    • Unit tests for each phase
    • Integration tests for complete programs
    • Error case tests

Deliverables:

  • Source code with clear organization
  • Test suite with good coverage
  • Example programs demonstrating features
  • Documentation of language syntax