LLVM LEARNING PROJECTS
Learning LLVM Through Real-World Projects
Core Concept Analysis
LLVM is a compiler infrastructure that has revolutionized how compilers are built. To truly understand it, you need to grapple with these fundamental building blocks:
| Concept | What It Is | Why It Matters |
|---|---|---|
| LLVM IR | A typed, SSA-based intermediate representation | The “universal assembly language” that all LLVM-based languages compile to |
| Front-end | Lexing, parsing, AST generation | How source code becomes structured data |
| Optimization Passes | Modular transformations on IR | Where the “magic” of compiler optimization happens |
| Back-end/Code Generation | IR → machine code | How abstract code becomes executable instructions |
| Clang Tooling | APIs for analyzing/transforming C/C++ | Building developer tools on top of industrial-strength parsing |
Project 1: Calculator Language → LLVM IR Compiler
- File: LLVM_LEARNING_PROJECTS.md
- Programming Language: C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Compilers / LLVM
- Software or Tool: LLVM Core
- Main Book: “Writing a C Compiler” by Nora Sandler
What you’ll build: A compiler for a simple expression language that generates LLVM IR and produces a native executable you can run.
Why it teaches LLVM: This is the “hello world” of LLVM—you’ll touch every layer: lexing input, building an AST, emitting LLVM IR via the C++ API, and watching llc turn it into machine code. You’ll understand why LLVM IR exists and how it serves as the bridge between languages and machines.
Core challenges you’ll face:
- Designing a grammar and building a recursive descent parser (maps to front-end architecture)
- Using LLVM’s
IRBuilderto emit typed IR instructions (maps to LLVM IR semantics) - Understanding SSA form and why variables work differently in IR (maps to SSA fundamentals)
- Linking with LLVM libraries and managing the build (maps to LLVM toolchain integration)
Resources for key challenges:
- “Writing a C Compiler” by Nora Sandler, Ch. 1-3 - Clear progression from lexer to code generation
- LLVM’s Kaleidoscope Tutorial (Ch. 1-3) - The canonical introduction, but better after you’ve struggled a bit
- “Engineering a Compiler” by Cooper & Torczon, Ch. 4 - Deep dive into intermediate representations
Key Concepts:
- Lexical Analysis: “Writing a C Compiler” Ch. 1 - Nora Sandler
- Recursive Descent Parsing: “Compilers: Principles and Practice” Ch. 4 - Parag H. Dave
- SSA Form: “Engineering a Compiler” Ch. 9 - Cooper & Torczon
- LLVM IRBuilder API: LLVM Programmer’s Manual - llvm.org/docs
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Comfortable with C++, basic understanding of what compilers do
Real world outcome: You’ll write programs like:
let x = 5 + 3 * 2;
print(x);
Compile them with YOUR compiler, and run the resulting binary to see 11 printed to the terminal. You can inspect the generated .ll file to see exactly what IR your compiler produced.
Learning milestones:
- Lexer + Parser working → You understand how source text becomes structured data
- First IR emitted → You grasp LLVM IR syntax and SSA form
- Native binary runs → You’ve connected front-end to back-end through LLVM’s infrastructure
Project 2: Custom Clang Static Analysis Checker
- File: LLVM_LEARNING_PROJECTS.md
- Programming Language: C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Static Analysis / Compilers
- Software or Tool: Clang LibTooling
- Main Book: “Clean Code” by Robert C. Martin (for understanding what to check)
What you’ll build: A custom linter that analyzes C/C++ source code and reports specific code patterns (e.g., “function too long”, “potential null dereference”, “deprecated API usage”).
Why it teaches LLVM/Clang: Clang’s static analyzer is built on the same infrastructure as the compiler. You’ll navigate the AST (Abstract Syntax Tree) that Clang produces, understand how real compilers represent code internally, and learn the visitor pattern that powers code analysis tools.
Core challenges you’ll face:
- Setting up a Clang tool with
LibTooling(maps to Clang infrastructure) - Navigating
RecursiveASTVisitorto find code patterns (maps to AST structure) - Extracting source location information for diagnostics (maps to source management)
- Handling the complexity of C++ AST nodes (maps to language representation)
Resources for key challenges:
- Clang documentation: “LibTooling” and “How to write RecursiveASTVisitor” - Official but essential
- “Clang Tidy: How to write your own checks” - LLVM YouTube - Visual walkthrough of the process
Key Concepts:
- AST Traversal: Clang Internals Manual - llvm.org/docs/ClangInternals
- Visitor Pattern: “Design Patterns” Ch. 5 - Gamma et al.
- C/C++ Grammar Complexity: “The C++ Programming Language” Appendix A - Bjarne Stroustrup
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Solid C++ knowledge, familiarity with visitor pattern helpful
Real world outcome: Run your checker on any C/C++ codebase and get output like:
src/utils.cpp:47:1: warning: function 'processData' has 150 lines (max recommended: 50)
src/main.cpp:23:5: warning: calling deprecated API 'oldFunction', use 'newFunction' instead
Found 12 issues in 8 files.
This is exactly how production tools like clang-tidy work.
Learning milestones:
- Tool compiles and runs on source → You understand Clang’s build system and tooling setup
- AST traversal finds patterns → You can navigate compiler data structures
- Actionable warnings emitted → You’ve built a real developer tool
Project 3: LLVM Optimization Pass
- File: LLVM_LEARNING_PROJECTS.md
- Main Programming Language: C++
- Alternative Programming Languages: Rust, C
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold” (Educational/Personal Brand)
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: Compilers, Optimization
- Software or Tool: LLVM, Clang
- Main Book: “Engineering a Compiler” - Cooper & Torczon
What you’ll build: A custom optimization pass that transforms LLVM IR to produce faster/smaller code for a specific pattern (e.g., strength reduction, dead code elimination for a custom pattern, or loop unrolling heuristics).
Why it teaches LLVM: Optimization passes are the heart of LLVM’s power. By writing one yourself, you’ll understand how compilers reason about code, what information is available at the IR level, and how transformations must preserve program semantics while improving performance.
Core challenges you’ll face:
- Registering a pass with LLVM’s new PassManager (maps to pass infrastructure)
- Analyzing IR to identify optimization opportunities (maps to IR analysis)
- Safely modifying IR while maintaining correctness (maps to transformation safety)
- Measuring the impact of your optimization (maps to benchmarking)
Resources for key challenges:
- “Writing an LLVM Pass” - LLVM official docs (updated for new PassManager)
- “LLVM Code Generation” by Quentin Colombet - Deep dive into LLVM’s optimization pipeline
- “Engineering a Compiler” Ch. 8-10 - Theoretical foundation for optimization
Key Concepts:
- Data-flow Analysis: “Engineering a Compiler” Ch. 9 - Cooper & Torczon
- SSA-based Optimization: “Engineering a Compiler” Ch. 9 - Cooper & Torczon
- LLVM Pass Manager: LLVM Programmer’s Manual - llvm.org/docs
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Completed Project 1 or equivalent LLVM IR familiarity
Real world outcome: You’ll compile a benchmark program with and without your pass:
$ clang -O2 benchmark.c -o baseline
$ clang -O2 -fpass-plugin=./MyPass.so benchmark.c -o optimized
$ time ./baseline # 2.3s
$ time ./optimized # 1.8s (22% faster!)
You can prove your optimization works with measurable performance improvements.
Learning milestones:
- Pass registered and runs → You understand LLVM’s modular architecture
- IR correctly transformed → You can reason about program semantics at IR level
- Measurable speedup achieved → You’ve done what professional compiler engineers do
Project 4: JIT-Compiled REPL for a Toy Language
- File: LLVM_LEARNING_PROJECTS.md
- Programming Language: C++
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 4: Expert
- Knowledge Area: Virtual Machines / JIT
- Software or Tool: LLVM ORC JIT
- Main Book: “Engineering a Compiler” by Cooper & Torczon
What you’ll build: An interactive Read-Eval-Print Loop where you type expressions, they’re JIT-compiled to native code, executed, and results printed—all in milliseconds.
Why it teaches LLVM: LLVM’s JIT (Just-In-Time) compilation via ORC is what powers Julia’s performance, parts of JavaScript engines, and database query compilation. You’ll understand dynamic code generation and the difference between AOT and JIT compilation.
Core challenges you’ll face:
- Setting up LLVM’s ORC JIT engine (maps to JIT infrastructure)
- Managing symbols and linking at runtime (maps to dynamic linking)
- Handling the JIT compilation lifecycle (maps to execution engine)
- Integrating with the host language (calling C functions from JIT code) (maps to FFI)
Resources for key challenges:
- LLVM Kaleidoscope Tutorial Ch. 4 - “Adding JIT and Optimizer Support” - Essential starting point
- “Building a JIT in LLVM” - LLVM official docs for ORC
- “How Julia Uses LLVM” - Various talks by Jameson Nash - Real-world JIT at scale
Key Concepts:
- JIT vs AOT Compilation: “Engineering a Compiler” Ch. 1 - Cooper & Torczon
- Dynamic Symbol Resolution: “Computer Systems: A Programmer’s Perspective” Ch. 7 - Bryant & O’Hallaron
- ORC JIT Architecture: LLVM ORC Design Document - llvm.org/docs
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Completed Project 1, comfortable with C++ and memory management
Real world outcome:
> 2 + 3 * 4
14
> def add(x, y) x + y
defined function 'add'
> add(10, 20)
30
> def fib(n) if n < 2 then n else fib(n-1) + fib(n-2)
defined function 'fib'
> fib(35)
9227465 (computed in 0.8s with native performance!)
An interactive programming environment with the speed of compiled code.
Learning milestones:
- JIT engine initialized → You understand LLVM’s execution environment
- Code executes on-demand → You’ve bridged compile-time and runtime
- Functions callable across REPL entries → You’ve managed symbol resolution dynamically
Project 5: Source-to-Source Refactoring Tool
- File: LLVM_LEARNING_PROJECTS.md
- Main Programming Language: C++
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool” (Solo-Preneur Potential)
- Difficulty: Level 3: Advanced (The Engineer)
- Knowledge Area: Compilers, Code Transformation
- Software or Tool: LLVM, Clang, libTooling
- Main Book: “Refactoring: Improving the Design of Existing Code” - Martin Fowler
What you’ll build: A tool that automatically transforms C/C++ code—renaming functions, modernizing syntax (e.g., NULL → nullptr), or applying custom transformations across an entire codebase.
Why it teaches LLVM/Clang: This is how real refactoring tools work. You’ll use Clang’s Rewriter to modify source code while preserving formatting, comments, and correctness. This teaches the difference between AST manipulation and source text manipulation.
Core challenges you’ll face:
- Using
ASTMatchersto find specific code patterns (maps to pattern matching) - Applying source rewrites without breaking code (maps to source preservation)
- Handling macros and preprocessor complexity (maps to preprocessing)
- Processing multiple files with consistent transformations (maps to tooling at scale)
Resources for key challenges:
- “Clang AST Matchers Tutorial” - LLVM docs - DSL for finding code patterns
- “clang-rename” source code - How LLVM’s own tools do it
- “Refactoring: Improving the Design of Existing Code” by Martin Fowler - The “why” behind refactoring
Key Concepts:
- AST Matching: Clang AST Matchers Reference - clang.llvm.org/docs
- Source Rewriting: Clang Rewriter Class Documentation - clang.llvm.org/doxygen
- Safe Refactoring: “Refactoring” Ch. 1-3 - Martin Fowler
Difficulty: Intermediate-Advanced Time estimate: 2 weeks Prerequisites: Completed Project 2 or familiar with Clang AST
Real world outcome:
$ ./my-modernizer --transform=nullptr-upgrade ./src/
Processed 47 files:
- Replaced 234 instances of 'NULL' with 'nullptr'
- Replaced 12 instances of '0' (null pointer context) with 'nullptr'
Your tool can process real codebases and produce valid, improved code.
Learning milestones:
- Patterns matched in AST → You can express code queries declaratively
- Single file transformed correctly → You understand source location management
- Codebase-wide transformation works → You’ve built production-quality tooling
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Calculator → LLVM IR | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Custom Static Checker | Intermediate | 1-2 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Optimization Pass | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| JIT REPL | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Refactoring Tool | Intermediate-Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
Recommendation
Start with Project 1 (Calculator → LLVM IR). Here’s why:
- It’s the canonical path - Almost everyone who learns LLVM starts here, which means the most resources and community help exist
- Tangible output immediately - You’ll have a working compiler that produces executables within days
- Foundation for everything else - Projects 3 and 4 directly build on the IR knowledge you’ll gain
If you’re already comfortable with compiler basics and want something immediately practical, Project 2 (Static Checker) is excellent—you’ll have a useful tool within a week that you can run on real codebases.
Final Capstone Project: A Complete Programming Language
What you’ll build: A full programming language implementation with:
- Custom syntax (your language design)
- Type checking
- Multiple optimization passes
- Both AOT and JIT compilation modes
- Integration with C libraries (FFI)
- A standard library with basic I/O
Why it’s the ultimate LLVM project: This synthesizes everything: front-end design, IR generation, optimization, JIT, and tooling. Languages like Rust, Swift, and Julia are built on LLVM—you’ll understand how.
Core challenges you’ll face:
- Designing a coherent type system (maps to type theory basics)
- Implementing semantic analysis (maps to compiler middle-end)
- Building a debug info generator for GDB/LLDB support (maps to DWARF format)
- Creating a build system and package format (maps to language ecosystem)
- Writing meaningful error messages (maps to UX of compilers)
Resources for key challenges:
- “Crafting Interpreters” by Bob Nystrom - The best book on language implementation, period
- “Writing a C Compiler” by Nora Sandler - Practical C-to-assembly, adaptable to C-to-LLVM
- “Types and Programming Languages” by Benjamin Pierce - If you want rigorous type system knowledge
- “Language Implementation Patterns” by Terence Parr - Patterns you’ll use repeatedly
Key Concepts:
- Language Design: “Crafting Interpreters” full book - Bob Nystrom
- Type Systems: “Types and Programming Languages” Ch. 1-11 - Benjamin Pierce
- Error Recovery: “Engineering a Compiler” Ch. 3 - Cooper & Torczon
- Debug Information: DWARF Debugging Standard - dwarfstd.org
Difficulty: Advanced Time estimate: 1-3 months Prerequisites: Complete Projects 1 and either 3 or 4
Real world outcome:
$ cat hello.mylang
fn main() {
let name = "World";
print("Hello, " + name + "!");
}
$ mylang build hello.mylang -o hello
$ ./hello
Hello, World!
$ mylang run hello.mylang # JIT mode
Hello, World!
$ mylang check hello.mylang # Type checking only
✓ No errors found
You’ll have created a language that others can actually use to write programs.
Learning milestones:
- Parser and type checker complete → You’ve built a language front-end
- Compiled programs run correctly → Your code generation works
- JIT mode with acceptable latency → You’ve mastered LLVM’s execution engine
- Someone else writes a program in your language → You’ve created something real
Getting Started Checklist
Before diving in, ensure you have:
- LLVM/Clang installed (version 15+ recommended):
brew install llvmor build from source - CMake familiarity (LLVM uses CMake extensively)
- C++17 comfort (LLVM’s codebase uses modern C++)
llvm-configin your PATH (for linking)- A test C file to examine: run
clang -emit-llvm -S test.c -o test.lland read the IR
The LLVM documentation is notoriously dense but comprehensive—expect to reference it constantly. The Kaleidoscope tutorial is your friend for the first project.