Project 13: Compiler Frontend
Build a lexer and parser for a small language and output an AST.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 4-6 weeks |
| Language | C |
| Prerequisites | Parsing, data structures |
| Key Topics | Lexing, parsing, AST, semantics |
1. Learning Objectives
By completing this project, you will:
- Implement a tokenizer for identifiers, numbers, and symbols.
- Build a recursive-descent parser for a small grammar.
- Construct an AST with node types.
- Add basic semantic checks (undefined variables, types).
2. Theoretical Foundation
2.1 Core Concepts
- Lexing: Convert raw text into tokens.
- Parsing: Convert tokens into a syntax tree.
- AST: Tree structure representing program meaning.
2.2 Why This Matters
Compilers teach disciplined parsing, tree manipulation, and error reporting. These skills apply to interpreters, configs, and DSLs.
2.3 Historical Context / Background
Classic compiler frontends follow the lexer-parser-AST pipeline. Many modern tools still use these foundations.
2.4 Common Misconceptions
- “Parsing is just splitting”: Grammar structure matters.
- “AST equals parse tree”: AST is usually simplified.
3. Project Specification
3.1 What You Will Build
A frontend for a tiny language:
let x = 3 + 4 * 2;
print x;
Features: variables, arithmetic, assignment, print.
3.2 Functional Requirements
- Tokenize identifiers, numbers, operators, and keywords.
- Parse expressions with precedence.
- Build an AST for statements.
- Report syntax errors with line/column.
3.3 Non-Functional Requirements
- Reliability: Clear error messages.
- Usability: Print AST in readable form.
- Maintainability: Clean separation of lexer and parser.
3.4 Example Usage / Output
$ ./frontend program.lang
(AST)
Assign(x,
Add(Number(3),
Mul(Number(4), Number(2))))
Print(Var(x))
3.5 Real World Outcome
You can parse a small language into a structured AST and explain how a compiler processes source code.
4. Solution Architecture
4.1 High-Level Design
source -> lexer -> tokens -> parser -> AST -> printer
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Lexer | Tokenize input | Track line/col |
| Parser | Build AST | Recursive descent |
| AST nodes | Represent language | Tagged union |
| Error reporter | Syntax errors | Include context |
4.3 Data Structures
typedef enum { TOK_NUM, TOK_ID, TOK_PLUS, TOK_STAR, TOK_LET } TokenType;
typedef struct {
TokenType type;
char *lexeme;
int line;
int col;
} Token;
4.4 Algorithm Overview
Key Algorithm: Recursive descent
parse_exprhandles addition/subtraction.parse_termhandles multiplication/division.parse_factorhandles numbers/identifiers.
Complexity Analysis:
- Time: O(n) tokens
- Space: O(n) AST size
5. Implementation Guide
5.1 Development Environment Setup
cc -Wall -Wextra -O2 -g -o frontend lexer.c parser.c ast.c
5.2 Project Structure
frontend/
├── src/
│ ├── lexer.c
│ ├── parser.c
│ ├── ast.c
│ └── frontend.c
├── tests/
│ └── test_frontend.sh
└── README.md
5.3 The Core Question You’re Answering
“How do I turn raw source text into structured meaning?”
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Token types
- What is a token and why separate lexing?
- Precedence
- Why does
*bind tighter than+?
- Why does
- AST design
- How do you represent statements and expressions?
5.5 Questions to Guide Your Design
Before implementing, think through these:
- Will you use a single token buffer or streaming lexer?
- How will you store string lexemes safely?
- What errors should halt parsing?
5.6 Thinking Exercise
Grammar Design
Write the grammar rules for expr, term, and factor. How do you prevent left recursion?
5.7 The Interview Questions They’ll Ask
Prepare to answer these:
- “What is the difference between lexing and parsing?”
- “How do you implement operator precedence?”
- “How do you represent AST nodes in C?”
5.8 Hints in Layers
Hint 1: Implement the lexer first Make sure tokens are correct before parsing.
Hint 2: Parse expressions only Add statements later.
Hint 3: Add error recovery Skip to next semicolon on error.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Parsing | “Crafting Interpreters” | Ch. 5-8 |
| Compilers | “Engineering a Compiler” | Frontend chapters |
5.10 Implementation Phases
Phase 1: Foundation (7-10 days)
Goals:
- Lexer and tokens
Tasks:
- Tokenize identifiers, numbers, operators.
Checkpoint: Token stream matches input.
Phase 2: Core Functionality (10-14 days)
Goals:
- Expression parsing and AST
Tasks:
- Implement recursive descent for expressions.
- Build AST nodes.
Checkpoint: AST prints correctly.
Phase 3: Statements and Semantics (7-10 days)
Goals:
- Variables and statements
Tasks:
- Add
letandprint. - Track variables and validate usage.
Checkpoint: Errors for undefined variables.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Parser | Recursive descent vs parser generator | Recursive descent | Educational |
| AST storage | Heap nodes vs arena | Heap nodes | Simpler to implement |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Lexer Tests | Token correctness | Numbers, identifiers |
| Parser Tests | AST shape | Expression precedence |
| Error Tests | Syntax errors | Missing semicolon |
6.2 Critical Test Cases
- Operator precedence:
3+4*2parses correctly. - Bad tokens: Illegal characters produce errors.
- Undefined variable: Proper error message.
6.3 Test Data
let x = 1 + 2 * 3;
print x;
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Left recursion | Infinite recursion | Refactor grammar |
| Token buffer misuse | Skipped tokens | Implement peek/advance |
| Leaking AST nodes | Memory leaks | Free AST on exit |
7.2 Debugging Strategies
- Print tokens before parsing.
- Print AST with indentation.
7.3 Performance Traps
None critical at this scale; focus on correctness and clarity.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add unary operators.
- Add parentheses in expressions.
8.2 Intermediate Extensions
- Add
ifandwhile. - Implement function calls.
8.3 Advanced Extensions
- Generate bytecode.
- Add type checking for int/bool.
9. Real-World Connections
9.1 Industry Applications
- Compilers: Frontends translate code to IR.
- DSLs: Many config languages use similar parsing.
9.2 Related Open Source Projects
- TinyCC: Small C compiler.
9.3 Interview Relevance
Compiler frontends demonstrate algorithmic reasoning and system design.
10. Resources
10.1 Essential Reading
- “Crafting Interpreters” - Ch. 5-8
- “Engineering a Compiler” - Frontend chapters
10.2 Video Resources
- Compiler construction lectures
10.3 Tools & Documentation
- Flex/Bison docs (for comparison)
10.4 Related Projects in This Series
- Build System: Feeds source files into compilation.
- Debugger: Uses symbol info from compiler output.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain lexing vs parsing.
- I can implement precedence parsing.
- I can design an AST.
11.2 Implementation
- Lexer and parser work correctly.
- AST output matches expectations.
- Errors are clear and precise.
11.3 Growth
- I can generate bytecode.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Lexer + parser for expressions.
Full Completion:
- Statements and AST printing.
Excellence (Going Above & Beyond):
- Bytecode generation and semantic checks.
This guide was generated from C_PROGRAMMING_COMPLETE_MASTERY.md. For the complete learning path, see the parent directory.