Project 10: C++ Compiler Frontend

Build a complete compiler frontend for a subset of C++ (or your own language) including lexical analysis, parsing, AST construction, semantic analysis, and LLVM IR code generation.

Quick Reference

Attribute	Value
Difficulty	Master
Time Estimate	3-6 months
Language	C++
Prerequisites	All previous projects, theory of computation, LLVM basics
Key Topics	Lexing, parsing, AST, type checking, symbol tables, LLVM IR

1. Learning Objectives
2. Theoretical Foundation
3. Project Specification
4. Solution Architecture
5. Implementation Guide
6. Testing Strategy
7. Common Pitfalls & Debugging
8. Extensions & Challenges
9. Real-World Connections
10. Resources
11. Self-Assessment Checklist
12. Submission / Completion Criteria

1. Learning Objectives

By completing this project, you will:

Master lexical analysis: Convert source code into tokens using finite automata
Implement recursive descent parsing: Build AST from tokens using context-free grammars
Design AST node hierarchies: Create type-safe, visitable syntax trees
Build symbol tables: Manage scopes, declarations, and name resolution
Implement type checking: Verify type compatibility and perform inference
Generate intermediate representation: Produce LLVM IR or your own bytecode
Understand compiler optimization: Learn how frontends enable backend optimization
Debug language implementations: Develop techniques for diagnosing compiler bugs

2. Theoretical Foundation

2.1 Core Concepts

Compilation Pipeline

+-----------------------------------------------------------------------+
|                      COMPILATION PIPELINE                              |
+-----------------------------------------------------------------------+
|                                                                        |
|  Source Code                                                           |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Lexical Analysis |  "fn main() { let x = 5; }"                      |
|  |    (Lexer)       |  --> [FN, IDENT("main"), LPAREN, RPAREN, ...]   |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Syntax Analysis  |  Tokens --> Abstract Syntax Tree                 |
|  |    (Parser)      |                                                  |
|  +------------------+            FnDecl                                |
|       |                         /     \                                |
|       |                      name    body                              |
|       |                     "main"   BlockStmt                         |
|       |                               |                                |
|       v                             VarDecl                            |
|  +------------------+              /    \                              |
|  | Semantic Analysis|           "x"    IntLit(5)                       |
|  | (Type Checking)  |                                                  |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | IR Generation    |  AST --> LLVM IR / Bytecode                      |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Optimization     |  (LLVM handles this)                             |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  +------------------+                                                  |
|  | Code Generation  |  IR --> Machine Code                             |
|  +------------------+                                                  |
|       |                                                                |
|       v                                                                |
|  Executable                                                            |
|                                                                        |
+-----------------------------------------------------------------------+

Lexical Analysis

The lexer converts character streams into tokens:

+-----------------------------------------------------------------------+
|                      LEXICAL ANALYSIS                                  |
+-----------------------------------------------------------------------+
|                                                                        |
|  Input: "let x = 42 + y;"                                              |
|                                                                        |
|  Output: [                                                             |
|    Token { type: LET,        lexeme: "let", line: 1, col: 1 },         |
|    Token { type: IDENTIFIER, lexeme: "x",   line: 1, col: 5 },         |
|    Token { type: EQUAL,      lexeme: "=",   line: 1, col: 7 },         |
|    Token { type: INTEGER,    lexeme: "42",  line: 1, col: 9 },         |
|    Token { type: PLUS,       lexeme: "+",   line: 1, col: 12 },        |
|    Token { type: IDENTIFIER, lexeme: "y",   line: 1, col: 14 },        |
|    Token { type: SEMICOLON,  lexeme: ";",   line: 1, col: 15 },        |
|    Token { type: EOF,        lexeme: "",    line: 1, col: 16 }         |
|  ]                                                                     |
|                                                                        |
|  Lexer State Machine (simplified for identifiers):                     |
|                                                                        |
|        [a-zA-Z_]                                                       |
|  (Start) ---------> (Identifier) ------+                               |
|                          ^             | [^a-zA-Z0-9_]                 |
|                          |             v                               |
|                     [a-zA-Z0-9_]   (Emit IDENTIFIER)                   |
|                                                                        |
|  Token Types:                                                          |
|  - Keywords: fn, let, if, else, while, return, struct, true, false     |
|  - Identifiers: [a-zA-Z_][a-zA-Z0-9_]*                                  |
|  - Literals: integers, floats, strings, characters                     |
|  - Operators: +, -, *, /, ==, !=, <, >, <=, >=, &&, ||, !              |
|  - Delimiters: (, ), {, }, [, ], ,, ;, :, ->                           |
|                                                                        |
+-----------------------------------------------------------------------+

Recursive Descent Parsing

Each grammar rule becomes a parsing function:

+-----------------------------------------------------------------------+
|                    RECURSIVE DESCENT PARSING                           |
+-----------------------------------------------------------------------+
|                                                                        |
|  Grammar (EBNF-like):                                                  |
|                                                                        |
|  program     = declaration*                                            |
|  declaration = fnDecl | varDecl | statement                            |
|  fnDecl      = "fn" IDENT "(" params? ")" ("->" type)? block           |
|  varDecl     = "let" IDENT (":" type)? ("=" expr)? ";"                 |
|  statement   = exprStmt | ifStmt | whileStmt | returnStmt | block      |
|  block       = "{" declaration* "}"                                    |
|  exprStmt    = expr ";"                                                |
|  ifStmt      = "if" expr block ("else" block)?                         |
|  whileStmt   = "while" expr block                                      |
|  returnStmt  = "return" expr? ";"                                      |
|                                                                        |
|  Expression Precedence (lowest to highest):                            |
|  assignment  = IDENT "=" assignment | logic_or                         |
|  logic_or    = logic_and ("||" logic_and)*                             |
|  logic_and   = equality ("&&" equality)*                               |
|  equality    = comparison (("==" | "!=") comparison)*                  |
|  comparison  = term (("<" | ">" | "<=" | ">=") term)*                  |
|  term        = factor (("+" | "-") factor)*                            |
|  factor      = unary (("*" | "/") unary)*                              |
|  unary       = ("!" | "-") unary | call                                |
|  call        = primary ("(" args? ")" | "." IDENT)*                    |
|  primary     = INTEGER | FLOAT | STRING | "true" | "false"             |
|              | IDENT | "(" expr ")"                                    |
|                                                                        |
|  Parsing Code Pattern:                                                 |
|                                                                        |
|  unique_ptr<Expr> Parser::parseTerm() {                                |
|      auto left = parseFactor();                                        |
|      while (match(PLUS) || match(MINUS)) {                             |
|          Token op = previous();                                        |
|          auto right = parseFactor();                                   |
|          left = make_unique<BinaryExpr>(move(left), op, move(right));  |
|      }                                                                 |
|      return left;                                                      |
|  }                                                                     |
|                                                                        |
+-----------------------------------------------------------------------+

Abstract Syntax Tree

+-----------------------------------------------------------------------+
|                    ABSTRACT SYNTAX TREE                                |
+-----------------------------------------------------------------------+
|                                                                        |
|  Source: fn factorial(n: int) -> int {                                 |
|              if n <= 1 { return 1; }                                   |
|              return n * factorial(n - 1);                              |
|          }                                                             |
|                                                                        |
|  AST:                                                                  |
|                                                                        |
|                      FnDecl                                            |
|                    /   |   \                                           |
|                  name params body                                      |
|                   |     |     \                                        |
|              "factorial" |   BlockStmt                                 |
|                         |    /      \                                  |
|                  Param("n", int)    statements                         |
|                                     /         \                        |
|                               IfStmt        ReturnStmt                 |
|                              /   |   \           |                     |
|                           cond  then  else   BinaryExpr(*)             |
|                            |     |    null    /       \                |
|                  BinaryExpr(<=)  |        VarExpr    CallExpr          |
|                   /      \       |          "n"     /      \           |
|               VarExpr  IntLit  ReturnStmt        "factorial" args      |
|                 "n"      1        |                          |         |
|                               IntLit(1)              BinaryExpr(-)     |
|                                                       /       \        |
|                                                   VarExpr   IntLit     |
|                                                     "n"       1        |
|                                                                        |
|  Node Base Class:                                                      |
|  struct ASTNode {                                                      |
|      SourceLocation location;                                          |
|      virtual ~ASTNode() = default;                                     |
|      virtual void accept(Visitor& v) = 0;                              |
|  };                                                                    |
|                                                                        |
+-----------------------------------------------------------------------+

Symbol Tables and Scoping

+-----------------------------------------------------------------------+
|                    SYMBOL TABLES AND SCOPING                           |
+-----------------------------------------------------------------------+
|                                                                        |
|  Source:                                                               |
|  fn main() {                                                           |
|      let x = 10;        // x declared in main's scope                  |
|      {                                                                 |
|          let y = 20;    // y declared in inner scope                   |
|          let x = 30;    // shadows outer x                             |
|          print(x + y);  // 30 + 20 = 50                                |
|      }                                                                 |
|      print(x);          // 10 (inner x is gone)                        |
|  }                                                                     |
|                                                                        |
|  Scope Chain:                                                          |
|                                                                        |
|  +-- Global Scope -------------------------------------------------+   |
|  |  main: FnType                                                   |   |
|  |  print: FnType(int) -> void                                     |   |
|  |                                                                  |   |
|  |  +-- main() Scope -------------------------------------------+  |   |
|  |  |  x: int = 10                                              |  |   |
|  |  |                                                            |  |   |
|  |  |  +-- Inner Block Scope --------------------------------+  |  |   |
|  |  |  |  y: int = 20                                        |  |  |   |
|  |  |  |  x: int = 30  (shadows outer x)                     |  |  |   |
|  |  |  +-----------------------------------------------------+  |  |   |
|  |  +------------------------------------------------------------+  |   |
|  +------------------------------------------------------------------+   |
|                                                                        |
|  Lookup Algorithm:                                                     |
|  1. Check current scope                                                |
|  2. If not found, check parent scope                                   |
|  3. Continue until global scope                                        |
|  4. If still not found, report "undefined variable" error              |
|                                                                        |
|  Symbol Entry:                                                         |
|  struct Symbol {                                                       |
|      string name;                                                      |
|      Type* type;                                                       |
|      SymbolKind kind;  // Variable, Function, Type, Parameter          |
|      bool isMutable;                                                   |
|      SourceLocation declLocation;                                      |
|  };                                                                    |
|                                                                        |
+-----------------------------------------------------------------------+

2.2 Why This Matters

Building a compiler frontend teaches you:

How programming languages work: From syntax to semantics
Software engineering at scale: Compilers are complex, well-structured systems
Understanding optimizations: Know what makes code fast
Language design tradeoffs: Why languages are designed the way they are

Industry relevance:

Compiler engineers are highly valued (Google, Apple, Meta, Microsoft)
Understanding compilers improves code quality
DSLs (Domain-Specific Languages) use compiler techniques
IDE features (autocomplete, refactoring) rely on frontend analysis

2.3 Historical Context

1950s: First compilers (FORTRAN by Backus)
1960s: Formal language theory (Chomsky hierarchy)
1970s: Parser generators (yacc, lex)
1980s: LLVM precursors, optimization theory
1990s: JIT compilation (Java, JavaScript)
2000s: LLVM project begins
2010s: Clang becomes production-ready
2020s: Rust, Swift, Julia use LLVM

Key milestones:

FORTRAN (1957): First optimizing compiler
ALGOL (1960): Introduced BNF grammar notation
C (1972): Influenced by compiler simplicity
LLVM (2000): Modern modular compiler infrastructure
Clang (2007): C/C++ frontend for LLVM

2.4 Common Misconceptions

Misconception	Reality
“Compilers are too complex to build”	A simple language frontend is ~3000 lines
“You need a CS degree”	Self-taught with the right resources
“Parser generators are required”	Hand-written parsers are often cleaner
“Type checking is the hard part”	It’s systematic once you understand it
“LLVM is too complex”	The IR is well-documented and learnable

3. Project Specification

3.1 What You Will Build

A compiler frontend for a simple typed language featuring:

Lexer: Tokenizes source code with error recovery
Parser: Builds AST using recursive descent
Type System: Static typing with inference
Semantic Analyzer: Name resolution, type checking
Code Generator: LLVM IR or tree-walking interpreter

3.2 Functional Requirements

Requirement	Description
FR-1	Tokenize source with keywords, identifiers, literals, operators
FR-2	Parse expressions with correct precedence and associativity
FR-3	Parse statements: let, if, while, return, blocks
FR-4	Parse functions with typed parameters and return types
FR-5	Build scoped symbol tables for name resolution
FR-6	Type check expressions and statements
FR-7	Generate executable code (LLVM IR or interpreter)
FR-8	Report errors with source location and helpful messages

3.3 Non-Functional Requirements

Requirement	Description
NFR-1	Clear error messages with line/column numbers
NFR-2	Continue parsing after syntax errors (error recovery)
NFR-3	Compile 10,000 lines in under 1 second
NFR-4	Memory-safe AST with no leaks
NFR-5	Extensible architecture for new features

3.4 Example Usage / Output

Source file (test.my):

fn factorial(n: int) -> int {
    if n <= 1 {
        return 1;
    }
    return n * factorial(n - 1);
}

fn main() -> int {
    let result = factorial(5);
    print(result);
    return 0;
}

Compilation output:

$ ./mycompiler test.my
[Lexer] 47 tokens
[Parser] AST constructed
[Semantic] Type checking complete
[Codegen] LLVM IR generated
[LLVM] Compiling to native...
[Done] Output: test

$ ./test
120

Error reporting:

$ ./mycompiler bad.my
error[E001]: type mismatch in binary expression
  --> bad.my:5:12
   |
 5 |     return x + "hello";
   |            ^~~~~~~~~~
   |            expected 'int', found 'string'
   |
help: cannot add 'int' and 'string'

error[E002]: undefined variable 'y'
  --> bad.my:8:5
   |
 8 |     y = 10;
   |     ^
   |     variable 'y' was not declared in this scope

Interpreter mode:

$ ./mycompiler test.my --interpret
120

3.5 Real World Outcome

After completing this project, you will have:

Working compiler: Compiles source to executable
Deep understanding: How code becomes machine instructions
Transferable skills: Applicable to DSLs, query languages, configs
Portfolio piece: Impressive for compiler or language teams

You should be able to:

Read and understand production compiler source code
Design new language features
Implement syntax highlighters and linters
Build domain-specific languages

4. Solution Architecture

4.1 High-Level Design

+-----------------------------------------------------------------------+
|                    COMPILER ARCHITECTURE                               |
+-----------------------------------------------------------------------+
|                                                                        |
|  +------------------------------------------------------------------+  |
|  |                        Driver                                     |  |
|  |  - Command-line parsing                                          |  |
|  |  - File I/O                                                       |  |
|  |  - Error reporting                                                |  |
|  +------------------------------------------------------------------+  |
|                              |                                         |
|       +----------------------+----------------------+                  |
|       v                      v                      v                  |
|  +----------+          +----------+          +----------+              |
|  |  Lexer   |    -->   |  Parser  |    -->   | Semantic |              |
|  |          |  tokens  |          |   AST    | Analyzer |              |
|  +----------+          +----------+          +----------+              |
|                                                    |                   |
|                                                    v                   |
|                        +--------------------------+                    |
|                        |                          |                    |
|                        v                          v                    |
|                  +----------+              +----------+                |
|                  | Codegen  |              |Interpreter|               |
|                  | (LLVM)   |              |  (Tree)   |               |
|                  +----------+              +----------+                |
|                        |                                               |
|                        v                                               |
|                  Executable                                            |
|                                                                        |
+-----------------------------------------------------------------------+

4.2 Key Components

Token Types

enum class TokenType {
    // Literals
    INTEGER, FLOAT, STRING, CHAR,

    // Identifiers and keywords
    IDENTIFIER,
    FN, LET, IF, ELSE, WHILE, FOR, RETURN,
    STRUCT, ENUM, IMPL, TRAIT,
    TRUE, FALSE, NULL_LIT,

    // Types
    INT, FLOAT_TYPE, BOOL, VOID, STRING_TYPE,

    // Operators
    PLUS, MINUS, STAR, SLASH, PERCENT,
    EQUAL, EQUAL_EQUAL, BANG_EQUAL,
    LESS, LESS_EQUAL, GREATER, GREATER_EQUAL,
    AMPERSAND_AMPERSAND, PIPE_PIPE, BANG,

    // Delimiters
    LPAREN, RPAREN, LBRACE, RBRACE, LBRACKET, RBRACKET,
    COMMA, SEMICOLON, COLON, ARROW, DOT,

    // Special
    END_OF_FILE, ERROR
};

struct Token {
    TokenType type;
    std::string lexeme;
    std::variant<int64_t, double, std::string> literal;
    int line;
    int column;
};

AST Node Hierarchy

// Base classes
struct ASTNode {
    SourceLocation loc;
    virtual ~ASTNode() = default;
    virtual void accept(ASTVisitor& visitor) = 0;
};

struct Expr : ASTNode {
    Type* resolvedType = nullptr;  // Set during type checking
};

struct Stmt : ASTNode {};

struct Decl : ASTNode {};

// Expression nodes
struct IntLiteral : Expr {
    int64_t value;
};

struct FloatLiteral : Expr {
    double value;
};

struct StringLiteral : Expr {
    std::string value;
};

struct BoolLiteral : Expr {
    bool value;
};

struct VarExpr : Expr {
    std::string name;
    Symbol* resolvedSymbol = nullptr;
};

struct BinaryExpr : Expr {
    std::unique_ptr<Expr> left;
    Token op;
    std::unique_ptr<Expr> right;
};

struct UnaryExpr : Expr {
    Token op;
    std::unique_ptr<Expr> operand;
};

struct CallExpr : Expr {
    std::unique_ptr<Expr> callee;
    std::vector<std::unique_ptr<Expr>> arguments;
};

struct AssignExpr : Expr {
    std::string name;
    std::unique_ptr<Expr> value;
};

// Statement nodes
struct ExprStmt : Stmt {
    std::unique_ptr<Expr> expression;
};

struct BlockStmt : Stmt {
    std::vector<std::unique_ptr<Stmt>> statements;
};

struct IfStmt : Stmt {
    std::unique_ptr<Expr> condition;
    std::unique_ptr<Stmt> thenBranch;
    std::unique_ptr<Stmt> elseBranch;  // nullable
};

struct WhileStmt : Stmt {
    std::unique_ptr<Expr> condition;
    std::unique_ptr<Stmt> body;
};

struct ReturnStmt : Stmt {
    std::unique_ptr<Expr> value;  // nullable
};

// Declaration nodes
struct VarDecl : Decl {
    std::string name;
    Type* typeAnnotation;  // nullable (inference)
    std::unique_ptr<Expr> initializer;
    bool isMutable;
};

struct FnDecl : Decl {
    std::string name;
    std::vector<Parameter> parameters;
    Type* returnType;
    std::unique_ptr<BlockStmt> body;
};

struct Parameter {
    std::string name;
    Type* type;
};

4.3 Data Structures

Type System

// Type representation
enum class TypeKind {
    PRIMITIVE,    // int, float, bool, void
    FUNCTION,     // (args) -> return
    STRUCT,       // { field: type, ... }
    ARRAY,        // [type]
    POINTER,      // *type
    REFERENCE     // &type
};

struct Type {
    TypeKind kind;

    virtual ~Type() = default;
    virtual bool equals(const Type& other) const = 0;
    virtual std::string toString() const = 0;
};

struct PrimitiveType : Type {
    enum Primitive { INT, FLOAT, BOOL, VOID, STRING };
    Primitive primitive;
};

struct FunctionType : Type {
    std::vector<Type*> paramTypes;
    Type* returnType;
};

struct StructType : Type {
    std::string name;
    std::vector<std::pair<std::string, Type*>> fields;
};

// Type context (owns all types, ensures uniqueness)
class TypeContext {
    std::vector<std::unique_ptr<Type>> types;

public:
    PrimitiveType* getIntType();
    PrimitiveType* getFloatType();
    PrimitiveType* getBoolType();
    PrimitiveType* getVoidType();
    FunctionType* getFunctionType(std::vector<Type*> params, Type* ret);
};

Symbol Table

struct Symbol {
    std::string name;
    Type* type;
    enum Kind { VARIABLE, FUNCTION, PARAMETER, TYPE };
    Kind kind;
    SourceLocation declLocation;
    bool isInitialized = false;
    bool isMutable = true;
};

class Scope {
    std::unordered_map<std::string, Symbol> symbols;
    Scope* parent;

public:
    Scope(Scope* parent = nullptr) : parent(parent) {}

    bool declare(const std::string& name, Symbol sym);
    Symbol* lookup(const std::string& name);      // Current scope only
    Symbol* resolve(const std::string& name);     // Up scope chain
};

class SymbolTable {
    std::vector<std::unique_ptr<Scope>> scopes;
    Scope* current;

public:
    void pushScope();
    void popScope();
    bool declare(const std::string& name, Symbol sym);
    Symbol* resolve(const std::string& name);
};

4.4 Algorithm Overview

Type Checking Algorithm

+-----------------------------------------------------------------------+
|                    TYPE CHECKING ALGORITHM                             |
+-----------------------------------------------------------------------+
|                                                                        |
|  For each node in AST (post-order traversal):                          |
|                                                                        |
|  IntLiteral      -> type = int                                         |
|  FloatLiteral    -> type = float                                       |
|  BoolLiteral     -> type = bool                                        |
|  StringLiteral   -> type = string                                      |
|                                                                        |
|  VarExpr         -> lookup in symbol table, type = symbol.type         |
|                                                                        |
|  BinaryExpr:                                                           |
|    1. Check left and right types                                       |
|    2. Verify operator valid for types                                  |
|    3. Determine result type                                            |
|       +, -, *, /  : (int, int) -> int, (float, float) -> float         |
|       <, >, <=, >=: (numeric, numeric) -> bool                         |
|       ==, !=     : (T, T) -> bool                                      |
|       &&, ||     : (bool, bool) -> bool                                |
|                                                                        |
|  UnaryExpr:                                                            |
|    !  : bool -> bool                                                   |
|    -  : numeric -> numeric                                             |
|                                                                        |
|  CallExpr:                                                             |
|    1. Get function type                                                |
|    2. Check argument count matches                                     |
|    3. Check each argument type matches parameter type                  |
|    4. Result type = function's return type                             |
|                                                                        |
|  AssignExpr:                                                           |
|    1. Lookup variable                                                  |
|    2. Check mutability                                                 |
|    3. Check value type matches variable type                           |
|                                                                        |
|  IfStmt:                                                               |
|    1. Condition must be bool                                           |
|    2. Check then/else branches                                         |
|                                                                        |
|  WhileStmt:                                                            |
|    1. Condition must be bool                                           |
|    2. Check body                                                       |
|                                                                        |
|  ReturnStmt:                                                           |
|    1. Get enclosing function's return type                             |
|    2. Check value type matches (or void if no value)                   |
|                                                                        |
|  VarDecl:                                                              |
|    1. If type annotation, use it                                       |
|    2. Else infer from initializer                                      |
|    3. Add to symbol table                                              |
|                                                                        |
|  FnDecl:                                                               |
|    1. Create function type                                             |
|    2. Add to symbol table                                              |
|    3. Push scope, add parameters                                       |
|    4. Check body                                                       |
|    5. Verify all paths return (or void)                                |
|                                                                        |
+-----------------------------------------------------------------------+

5. Implementation Guide

5.1 Development Environment Setup

# Install LLVM (optional, for code generation)
# Ubuntu/Debian
sudo apt install llvm-15-dev clang-15

# macOS
brew install llvm@15

# Create project
mkdir -p my-compiler/{src,include,tests}
cd my-compiler

# CMakeLists.txt
cat > CMakeLists.txt << 'EOF'
cmake_minimum_required(VERSION 3.16)
project(MyCompiler CXX)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

# Optional LLVM
find_package(LLVM 15 CONFIG)
if(LLVM_FOUND)
    message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
    add_definitions(-DHAS_LLVM)
    include_directories(${LLVM_INCLUDE_DIRS})
    link_directories(${LLVM_LIBRARY_DIRS})
endif()

add_executable(mycompiler
    src/main.cpp
    src/lexer.cpp
    src/parser.cpp
    src/ast.cpp
    src/types.cpp
    src/semantic.cpp
    src/codegen.cpp
    src/interpreter.cpp
)

if(LLVM_FOUND)
    llvm_map_components_to_libnames(LLVM_LIBS core native)
    target_link_libraries(mycompiler ${LLVM_LIBS})
endif()
EOF

5.2 Project Structure

my-compiler/
├── CMakeLists.txt
├── include/
│   ├── lexer.hpp
│   ├── token.hpp
│   ├── parser.hpp
│   ├── ast.hpp
│   ├── types.hpp
│   ├── semantic.hpp
│   ├── codegen.hpp
│   ├── interpreter.hpp
│   └── error.hpp
├── src/
│   ├── main.cpp
│   ├── lexer.cpp
│   ├── parser.cpp
│   ├── ast.cpp
│   ├── types.cpp
│   ├── semantic.cpp
│   ├── codegen.cpp
│   └── interpreter.cpp
├── tests/
│   ├── lexer_test.cpp
│   ├── parser_test.cpp
│   ├── semantic_test.cpp
│   └── integration/
│       ├── factorial.my
│       ├── fibonacci.my
│       └── ...
└── examples/
    ├── hello.my
    └── ...

5.3 The Core Question You’re Answering

How do you transform human-readable source code into executable machine instructions while catching errors and preserving meaning?

This encompasses:

Tokenizing unstructured text
Building structured representations
Verifying semantic correctness
Translating to target representation

5.4 Concepts You Must Understand First

Before implementing, verify you can answer:

What is the Chomsky hierarchy and where do programming languages fit?
- Reference: “Engineering a Compiler” Chapter 3
How does recursive descent parsing work?
- Reference: “Crafting Interpreters” Chapter 6
What is the difference between syntax and semantics?
- Reference: “Types and Programming Languages” Chapter 1
How do scopes and closures work?
- Reference: “Engineering a Compiler” Chapter 5
What is LLVM IR and how is it structured?
- Reference: LLVM Language Reference Manual

5.5 Questions to Guide Your Design

Lexer Design:

How will you handle Unicode in identifiers and strings?
How will you track line and column numbers?
How will you handle unterminated strings or invalid characters?

Parser Design:

How will you handle operator precedence?
How will you recover from syntax errors?
How will you represent optional AST children?

Type System:

How will you represent function types?
How will you handle type inference?
How will you report type errors helpfully?

Code Generation:

How will you handle local variables in LLVM IR?
How will you implement control flow?
How will you call external functions (like print)?

5.6 Thinking Exercise

Before coding, trace through compilation of:

fn max(a: int, b: int) -> int {
    if a > b {
        return a;
    }
    return b;
}

Draw:

Token stream
AST structure
Symbol table after parsing
Type annotations on AST nodes
LLVM IR (or pseudocode)

5.7 Hints in Layers

Hint 1 - Getting Started: Start with the lexer. Get it fully working before moving to the parser. Test with simple inputs first.

Hint 2 - Parser Structure: Each grammar rule becomes a method. Parse expressions bottom-up (highest precedence first), statements top-down.

Hint 3 - Type Checking: Use the Visitor pattern. Each AST node type has a corresponding check method that returns the node’s type.

Hint 4 - LLVM IR: Study the Kaleidoscope tutorial. Create a module, function, basic blocks, then emit instructions.

5.8 The Interview Questions They’ll Ask

“Explain the phases of compilation.”
- Expected: Lexing, parsing, semantic analysis, optimization, code generation
“How do you implement operator precedence?”
- Expected: Pratt parsing or precedence climbing in recursive descent
“What is the difference between static and dynamic typing?”
- Expected: Static checked at compile time, dynamic at runtime
“How does name resolution work with nested scopes?”
- Expected: Scope chain, shadowing, capture for closures
“What is SSA form and why is it useful?”
- Expected: Single Static Assignment, enables optimizations

5.9 Books That Will Help

Topic	Book	Chapter
Lexing/Parsing	“Crafting Interpreters” by Robert Nystrom	Parts 1-2
Formal Theory	“Engineering a Compiler” by Cooper & Torczon	Chapters 2-4
Type Systems	“Types and Programming Languages” by Pierce	Chapters 1-11
Semantic Analysis	“Engineering a Compiler” by Cooper & Torczon	Chapters 5-6
Code Generation	LLVM Tutorial (Kaleidoscope)	All chapters
Optimization	“Engineering a Compiler” by Cooper & Torczon	Chapters 8-10

5.10 Implementation Phases

Phase 1: Lexer (Week 1-2)

Token types and structure
Character stream processing
Keyword recognition
Literal parsing (int, float, string)
Error handling

Phase 2: Parser (Week 3-5)

AST node definitions
Recursive descent for expressions
Statement parsing
Function declarations
Error recovery (synchronization)

Phase 3: Semantic Analysis (Week 6-8)

Symbol table implementation
Scope management
Type checking
Name resolution
Error reporting

Phase 4: Code Generation (Week 9-12)

LLVM setup or interpreter
Expression evaluation
Control flow
Function calls
Runtime library

Phase 5: Polish (Week 13-16)

Better error messages
More language features
Optimization
Testing suite

5.11 Key Implementation Decisions

Decision	Options	Recommendation
Backend	LLVM, Interpreter, Custom bytecode	Start with interpreter, add LLVM
AST ownership	unique_ptr, shared_ptr, arena	unique_ptr for simplicity
Error handling	Exceptions, Result type	Return error type for parser
Type representation	Inheritance, variant	Inheritance with TypeKind enum

6. Testing Strategy

Unit Tests

// Lexer tests
TEST(Lexer, TokenizesKeywords) {
    Lexer lexer("fn let if else while return");
    EXPECT_EQ(lexer.nextToken().type, TokenType::FN);
    EXPECT_EQ(lexer.nextToken().type, TokenType::LET);
    EXPECT_EQ(lexer.nextToken().type, TokenType::IF);
    EXPECT_EQ(lexer.nextToken().type, TokenType::ELSE);
    EXPECT_EQ(lexer.nextToken().type, TokenType::WHILE);
    EXPECT_EQ(lexer.nextToken().type, TokenType::RETURN);
}

TEST(Lexer, TokenizesNumbers) {
    Lexer lexer("42 3.14 0xFF");
    auto t1 = lexer.nextToken();
    EXPECT_EQ(t1.type, TokenType::INTEGER);
    EXPECT_EQ(std::get<int64_t>(t1.literal), 42);

    auto t2 = lexer.nextToken();
    EXPECT_EQ(t2.type, TokenType::FLOAT);
    EXPECT_DOUBLE_EQ(std::get<double>(t2.literal), 3.14);
}

// Parser tests
TEST(Parser, ParsesBinaryExpr) {
    Parser parser("1 + 2 * 3");
    auto expr = parser.parseExpression();

    // Should be 1 + (2 * 3), not (1 + 2) * 3
    auto* binOp = dynamic_cast<BinaryExpr*>(expr.get());
    ASSERT_NE(binOp, nullptr);
    EXPECT_EQ(binOp->op.type, TokenType::PLUS);

    auto* right = dynamic_cast<BinaryExpr*>(binOp->right.get());
    ASSERT_NE(right, nullptr);
    EXPECT_EQ(right->op.type, TokenType::STAR);
}

// Type checker tests
TEST(TypeChecker, TypeMismatch) {
    auto ast = parse("let x: int = \"hello\";");
    TypeChecker checker;
    auto errors = checker.check(ast);

    EXPECT_EQ(errors.size(), 1);
    EXPECT_TRUE(errors[0].message.contains("type mismatch"));
}

Integration Tests

#!/bin/bash
# test_programs.sh

for file in tests/integration/*.my; do
    expected="${file%.my}.expected"
    if ./mycompiler "$file" --interpret > /tmp/output.txt 2>&1; then
        if diff -q /tmp/output.txt "$expected" > /dev/null; then
            echo "PASS: $file"
        else
            echo "FAIL: $file (output mismatch)"
            diff /tmp/output.txt "$expected"
        fi
    else
        echo "FAIL: $file (compilation error)"
    fi
done

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Left recursion	Stack overflow	Grammar has left recursion	Rewrite as iteration
Dangling else	Wrong parse	Ambiguous grammar	Prefer innermost if
Off-by-one	Wrong location	Line/col tracking error	Test with multi-line input
Memory leak	Growing memory	AST nodes not freed	Use unique_ptr
Type confusion	Wrong codegen	AST type field not set	Check after semantic pass

Debugging Techniques

// AST pretty printer
class ASTPrinter : public ASTVisitor {
    int indent = 0;

    void print(const std::string& s) {
        std::cout << std::string(indent * 2, ' ') << s << '\n';
    }

public:
    void visitBinaryExpr(BinaryExpr& e) override {
        print("BinaryExpr " + e.op.lexeme);
        indent++;
        e.left->accept(*this);
        e.right->accept(*this);
        indent--;
    }
    // ... other visit methods
};

// Use: ASTPrinter{}.visit(*ast);

8. Extensions & Challenges

After completing the base project:

Structs and methods: User-defined types with associated functions
Generics: Type parameters like fn identity<T>(x: T) -> T
Closures: Anonymous functions that capture variables
Pattern matching: match expressions
Modules: Import system with namespaces
Macros: Hygienic compile-time code generation
Incremental compilation: Only recompile changed files
LSP server: IDE support (autocomplete, go-to-definition)
Debugging info: DWARF emission for gdb/lldb
Optimization passes: Dead code elimination, constant folding

9. Real-World Connections

Your Implementation	Production Compiler
Hand-written lexer	Clang’s lexer
Recursive descent	GCC, Clang, Rust
Symbol tables	All compilers
Type checking	TypeScript, Rust, Swift
LLVM IR	Clang, Rust, Swift, Julia

Industry Examples:

Clang/LLVM: C/C++ frontend used by Apple, Google
Rust: Advanced type system with borrow checker
TypeScript: Type inference and structural typing
Swift: Protocol-oriented with value semantics

10. Resources

Primary Resources

Crafting Interpreters - Free online book
LLVM Kaleidoscope Tutorial - Official LLVM tutorial
Rust Compiler Dev Guide - Real compiler internals

Academic Resources

“Engineering a Compiler” by Cooper & Torczon
“Types and Programming Languages” by Benjamin Pierce
“Modern Compiler Implementation in ML” by Appel

Video Resources

11. Self-Assessment Checklist

Before considering this project complete:

Can you explain the difference between lexical and syntactic analysis?
Can you trace token stream for any input?
Can you draw AST for complex expressions?
Can you explain how scoping works?
Can you describe the type checking algorithm?
Does your compiler produce correct output for all test cases?
Do error messages include line/column numbers?
Can you add a new operator in under 30 minutes?

12. Submission / Completion Criteria

Your project is complete when:

Lexer Is Complete
- Recognizes all token types
- Handles edge cases (unterminated strings, etc.)
- Reports errors with location
Parser Is Complete
- Parses all language constructs
- Builds correct AST
- Recovers from errors
Semantic Analysis Works
- Name resolution finds all symbols
- Type checking catches errors
- All AST nodes have types
Code Generation Works
- Produces executable code
- Handles all expressions and statements
- Functions call correctly
Test Suite Passes
- Unit tests for each phase
- Integration tests for complete programs
- Error case tests

Deliverables:

Source code with clear organization
Test suite with good coverage
Example programs demonstrating features
Documentation of language syntax