Project 4: Simple JIT Compiler for a Bytecode VM
Build a Just-In-Time compiler that dynamically translates bytecode to native x86-64 machine code at runtime, achieving dramatic speedups for hot code paths - the same technique that powers QEMU’s TCG, the JVM, V8, and LuaJIT.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 3-4 weeks |
| Language | C |
| Alternative Languages | C++, Rust |
| Prerequisites | Projects 1-3 (emulation basics), x86-64 assembly, understanding of stack machines |
| Key Topics | JIT compilation, dynamic code generation, executable memory (mmap), hot path detection, code cache management |
| Coolness Level | Level 5: Pure Magic (Super Cool) |
| Portfolio Value | Resume Gold |
| Main Book | “Language Implementation Patterns” by Terence Parr |
1. Learning Objectives
By completing this project, you will:
- Understand JIT compilation fundamentals: Explain the difference between interpretation and JIT compilation, and when each approach is appropriate
- Master runtime code generation: Write C code that generates executable x86-64 machine code at runtime using mmap with PROT_EXEC
- Implement hot path detection: Build a profiling mechanism that identifies frequently-executed code worth compiling
- Design a code cache: Manage compiled code fragments with proper invalidation and memory limits
- Handle stack-to-register mapping: Transform stack-based bytecode operations into efficient register-based machine code
- Implement interpreter/JIT switching: Seamlessly transition between interpreted and compiled execution
- Build the foundation for understanding QEMU’s TCG: This is exactly how QEMU’s Tiny Code Generator works for cross-architecture emulation
2. Theoretical Foundation
2.1 What is JIT Compilation?
A Just-In-Time (JIT) compiler translates code at runtime rather than ahead-of-time. Unlike traditional compilers that produce executables before the program runs, a JIT compiles code while the program executes.
COMPILATION STRATEGIES COMPARISON
Ahead-of-Time (AOT) Compilation:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Source Code │───►│ Compiler │───►│ Executable │
│ (foo.c) │ │ (gcc/clang)│ │ (foo.exe) │
└─────────────┘ └──────────────┘ └─────────────┘
│
Compile Time
│
─────────────────────────────────────────────────────
│
Run Time
│
▼
┌─────────────────────┐
│ Native Execution │
│ (Already compiled)│
└─────────────────────┘
Just-In-Time (JIT) Compilation:
┌─────────────┐ ┌──────────────┐
│ Source Code │───►│ Bytecode │
│ (App.java)│ │ Compiler │
└─────────────┘ └──────┬───────┘
│
Compile Time (AOT to bytecode)
│
▼
┌─────────────────────┐
│ Bytecode │
│ (App.class) │
└─────────────────────┘
─────────────────────────────────────────────────────
│
Run Time
│
▼
┌─────────────────────────────────────────────────────┐
│ JVM / VM Runtime │
│ │
│ ┌───────────────────┐ ┌────────────────────┐ │
│ │ Interpreter │ │ JIT Compiler │ │
│ │ (Cold code path) │ │ (Hot code path) │ │
│ └─────────┬─────────┘ └──────────┬─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────┐ ┌────────────────────┐ │
│ │ Execute slowly │ │ Compile to native │ │
│ │ (but starts │ │ code, then execute │ │
│ │ immediately) │ │ at full speed │ │
│ └────────────────┘ └────────────────────┘ │
└─────────────────────────────────────────────────────┘
2.2 Why JIT Compilation?
The key insight is that not all code is equally important. Most programs spend 90% of their time in 10% of the code (hot paths). JIT compilation optimizes this by:
- Start Fast: Begin interpreting immediately - no compilation delay
- Profile: Identify which code runs frequently (hot paths)
- Compile Hot Paths: Convert frequently-executed bytecode to native code
- Run Fast: Execute compiled code at near-native speed
EXECUTION TIME DISTRIBUTION
Program Execution:
────────────────────────────────────────────────────────────────────
Time →
┌──────────┬───────────────────────────────────────────────────────┐
│Startup │ Main Execution │
│ (10%) │ (90%) │
└──────────┴───────────────────────────────────────────────────────┘
Code Path Distribution:
┌─────────────────────────────────────────────────┐
│ │
│ Hot Paths (10% of code) │
│ ┌─────────────────────────────────────────┐ │
│ │ ████████████████████████████████████ │ │ 90% of time
│ └─────────────────────────────────────────┘ │
│ │
│ Cold Paths (90% of code) │
│ ┌─────────────────────────────────────────┐ │
│ │ ████ │ │ 10% of time
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────┘
JIT Strategy: Only compile the hot paths (10% of code, 90% of time)
- Interpret cold paths (acceptable speed for rare code)
- Maximum benefit with minimal compilation overhead
2.3 How QEMU’s TCG Uses JIT
QEMU (Quick EMUlator) uses its Tiny Code Generator (TCG) to translate guest architecture instructions to host architecture code at runtime. This is the same concept we’re implementing:
QEMU TCG Architecture
Guest Code (ARM) TCG (Intermediate) Host Code (x86-64)
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ADD R0, R1, R2 │ │ load_i32 t0, r1 │ │ mov eax, [rbx+4]│
│ SUB R0, R0, #1 │ ───► │ load_i32 t1, r2 │ ───► │ mov ecx, [rbx+8]│
│ BNE loop │ │ add_i32 t2,t0,t1│ │ add eax, ecx │
└─────────────────┘ │ store_i32 r0,t2 │ │ sub eax, 1 │
│ ... │ │ jne loop │
└─────────────────┘ └─────────────────┘
OUR PROJECT: Simplified Version
Bytecode (Stack-based) Direct Translation Host Code (x86-64)
┌─────────────────┐ ┌─────────────────┐
│ PUSH 10 │ │ push 10 │
│ PUSH 20 │ No intermediate! │ push 20 │
│ ADD │ ─────────────────► │ pop rcx │
│ STORE 0 │ (Direct bytecode │ pop rax │
│ JNZ -4 │ to x86-64) │ add rax, rcx │
└─────────────────┘ │ mov [vars], rax │
│ test rax, rax │
│ jnz loop │
└─────────────────┘
2.4 The Stack Machine Model
Our bytecode VM uses a stack-based architecture (like the JVM, Python VM, and WebAssembly). Understanding this is crucial:
STACK MACHINE vs REGISTER MACHINE
Stack Machine (Our Bytecode): Register Machine (x86-64):
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ │ │ │
│ a + b * c │ │ a + b * c │
│ │ │ │
│ Bytecode: │ │ x86-64: │
│ PUSH a ; [a] │ │ mov rax, [a] │
│ PUSH b ; [a, b] │ │ mov rcx, [b] │
│ PUSH c ; [a, b, c] │ │ mov rdx, [c] │
│ MUL ; [a, b*c] │ │ imul rcx, rdx ; rcx=b*c │
│ ADD ; [a+b*c] │ │ add rax, rcx ; result │
│ │ │ │
│ Operands implicit (stack) │ │ Operands explicit (regs) │
│ Simple bytecode format │ │ Complex instruction encoding│
│ Easy to compile from source │ │ Hard to compile from source │
│ Slow to interpret │ │ Fast to execute │
│ │ │ │
└──────────────────────────────┘ └──────────────────────────────┘
JIT CHALLENGE: Convert stack operations to register operations
2.5 Executable Memory: The mmap Magic
The key to JIT compilation is creating memory that the CPU can execute. This requires special permissions:
MEMORY PERMISSIONS
Normal heap allocation (malloc):
┌─────────────────────────────────────────────────────┐
│ void* p = malloc(4096); │
│ Permissions: READ + WRITE │
│ Cannot execute code here - CPU will fault! │
└─────────────────────────────────────────────────────┘
Executable allocation (mmap):
┌─────────────────────────────────────────────────────┐
│ void* code = mmap(NULL, 4096, │
│ PROT_READ | PROT_WRITE | PROT_EXEC│
│ MAP_PRIVATE | MAP_ANONYMOUS, │
│ -1, 0); │
│ │
│ Permissions: READ + WRITE + EXECUTE │
│ CAN execute code here - CPU will run it! │
└─────────────────────────────────────────────────────┘
SECURITY NOTE:
- W^X (Write XOR Execute) is a security principle
- Many systems restrict RWX memory
- Production JITs: write code, then mprotect() to RX
2.6 Hot Path Detection
Not every piece of code deserves JIT compilation. We need to identify hot paths:
HOT PATH DETECTION STRATEGIES
Strategy 1: Invocation Counting (Method-based JIT)
┌─────────────────────────────────────────────────────┐
│ function_call_count[func_id]++; │
│ if (function_call_count[func_id] > THRESHOLD) { │
│ compile_function(func_id); │
│ } │
│ │
│ Used by: JVM (HotSpot), V8 (initial tier) │
└─────────────────────────────────────────────────────┘
Strategy 2: Loop Detection (Trace-based JIT)
┌─────────────────────────────────────────────────────┐
│ // Backward branch = potential loop │
│ if (branch_target < current_pc) { │
│ backward_branch_count[current_pc]++; │
│ if (backward_branch_count[current_pc] > THRESH) │
│ compile_trace_from(branch_target); │
│ } │
│ │
│ Used by: LuaJIT, TraceMonkey (old Firefox) │
└─────────────────────────────────────────────────────┘
OUR PROJECT: Uses Strategy 2 (Loop Detection)
- Simple to implement
- Highly effective (loops are where programs spend time)
- Threshold: 1000 backward branches before compilation
2.7 The JIT Compilation Pipeline
JIT COMPILATION PIPELINE
┌─────────────────────────────────────────────────────────────────┐
│ JIT COMPILER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. HOT PATH DETECTION │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Interpreter tracks execution counts │ │
│ │ Backward branch counter > 1000? → Trigger JIT │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 2. TRACE RECORDING │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Record bytecode sequence from loop start │ │
│ │ Stop at: loop end, function call, or threshold │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 3. CODE GENERATION │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ For each bytecode instruction: │ │
│ │ - Map stack operations to registers │ │
│ │ - Emit corresponding x86-64 bytes │ │
│ │ - Handle control flow (branches, loops) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 4. CODE INSTALLATION │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ mmap() executable memory │ │
│ │ Copy generated code │ │
│ │ Update dispatch table: bytecode_addr → native_addr │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 5. EXECUTION │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Next time interpreter hits this address: │ │
│ │ → Jump to native code instead of interpreting │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
2.8 Why This Matters for Virtualization
JIT compilation is the heart of software-based virtualization:
| System | JIT Technique | Purpose |
|---|---|---|
| QEMU TCG | Binary translation | Run ARM/MIPS/RISC-V code on x86 |
| Rosetta 2 | Binary translation | Run x86 apps on Apple Silicon |
| JVM HotSpot | Method JIT | Run Java bytecode at native speed |
| V8/SpiderMonkey | Tiered JIT | Run JavaScript at native speed |
| LuaJIT | Trace JIT | Run Lua at C-like speed |
| WebAssembly | AOT + JIT | Run web apps at near-native speed |
| eBPF | Verification + JIT | Run safe code in Linux kernel |
Understanding JIT compilation means understanding how ALL of these systems work internally.
2.9 Common Misconceptions
Misconception 1: “JIT is always faster than interpretation”
- Reality: JIT has compilation overhead. For code that runs once, interpretation is faster. JIT only wins for frequently-executed code.
Misconception 2: “JIT generates optimal code”
- Reality: JIT must balance compilation time vs. code quality. It generates “good enough” code quickly, not optimal code slowly.
Misconception 3: “Writing a JIT is extremely complex”
- Reality: A basic JIT (like this project) is ~1000 lines. Production JITs are complex because of optimizations, not the core concept.
Misconception 4: “You need an IR (Intermediate Representation)”
- Reality: QEMU’s TCG uses an IR, but simple JITs can translate directly. We’ll translate bytecode directly to x86-64.
Misconception 5: “JIT code must be position-independent”
- Reality: We know exactly where our code will be. Absolute addresses are fine (and simpler).
3. Project Specification
3.1 What You Will Build
A complete JIT compiler system that:
- Interprets a stack-based bytecode VM with 14 instructions
- Detects hot loops using backward branch counting
- JIT compiles hot paths to native x86-64 machine code
- Manages a code cache with memory limits
- Seamlessly switches between interpreted and JIT-compiled execution
- Achieves 10-50x speedup on computational loops
3.2 Bytecode Instruction Set
Your VM will support these 14 instructions:
| Opcode | Instruction | Stack Effect | Description |
|---|---|---|---|
| 0x01 | PUSH n |
[] -> [n] | Push immediate value onto stack |
| 0x02 | POP |
[a] -> [] | Discard top of stack |
| 0x03 | DUP |
[a] -> [a, a] | Duplicate top of stack |
| 0x10 | ADD |
[a, b] -> [a+b] | Add top two values |
| 0x11 | SUB |
[a, b] -> [a-b] | Subtract (second - top) |
| 0x12 | MUL |
[a, b] -> [a*b] | Multiply top two values |
| 0x13 | DIV |
[a, b] -> [a/b] | Integer divide |
| 0x20 | LOAD addr |
[] -> [mem[addr]] | Load from memory |
| 0x21 | STORE addr |
[a] -> [] | Store to memory |
| 0x30 | JMP offset |
[] -> [] | Unconditional jump |
| 0x31 | JZ offset |
[a] -> [] | Jump if zero |
| 0x32 | JNZ offset |
[a] -> [] | Jump if not zero |
| 0x40 | CALL addr |
[] -> [] | Call subroutine |
| 0x41 | RET |
[] -> [] | Return from subroutine |
| 0xFF | HALT |
[] -> [] | Stop execution |
3.3 Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR1 | Interpreter correctly executes all 14 bytecode instructions | Must Have |
| FR2 | Hot path detection triggers after 1000 backward branches | Must Have |
| FR3 | JIT compiles arithmetic operations (ADD, SUB, MUL, DIV) | Must Have |
| FR4 | JIT compiles control flow (JMP, JZ, JNZ) | Must Have |
| FR5 | JIT compiles memory operations (LOAD, STORE, PUSH, POP) | Must Have |
| FR6 | Code cache limits memory usage to configurable size | Should Have |
| FR7 | Deoptimization falls back to interpreter gracefully | Should Have |
| FR8 | Performance logging shows compilation events | Should Have |
| FR9 | CALL/RET handled (interpret or compile) | Could Have |
| FR10 | Multiple hot regions can be compiled independently | Could Have |
3.4 Non-Functional Requirements
| Requirement | Target | Measurement |
|---|---|---|
| Interpreter speed | > 10M bytecode ops/sec | Benchmark tight loop |
| JIT speedup | > 10x over interpreter | Compare Mandelbrot times |
| Compilation time | < 1ms for typical loop | Measure with clock_gettime |
| Code cache size | Configurable, default 64KB | Track allocations |
| Memory overhead | < 100 bytes per cached region | Profile memory usage |
3.5 Example Usage / Output
$ ./jitvm mandelbrot.bc
[JIT] Detected hot loop at bytecode offset 0x42
[JIT] Compiling 23 bytecode ops to 156 bytes of x86-64
[JIT] Code cached at 0x7f4a12340000
++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ +++++++++++++++++++
++++++++++++++++++++ +++++++++++++++++
+++++++++++++++++++ ++++++++++++++++
++++++++++++++++++ +++++++++++++++
++++++++++++++ ++ +++++++++++++
+++++++++++ +++++++++++
+++++++++++ +++++++++++
++++++++++++++ ++ +++++++++++++
++++++++++++++++++ +++++++++++++++
+++++++++++++++++++ ++++++++++++++++
++++++++++++++++++++ +++++++++++++++++
++++++++++++++++++++ +++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++
Execution time: 0.089s (JIT enabled)
Execution time: 4.231s (interpreter only)
Speedup: 47.5x
3.6 Real World Outcome
When you complete this project:
- You will have built a working JIT compiler - actual native code generation
- You will understand QEMU’s TCG at a fundamental level
- You will know how V8, JVM HotSpot, and LuaJIT work internally
- You will have resume-worthy evidence of advanced systems programming skills
- You will be able to explain JIT compilation in technical interviews
- You will have built the foundation for hardware-assisted virtualization projects
4. Solution Architecture
4.1 High-Level Design
┌──────────────────────────────────────────────────────────────────────────┐
│ JIT VM ARCHITECTURE │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ BYTECODE LOADER │ │
│ │ Load .bc file → Parse header → Validate → Store in code[] │ │
│ └───────────────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐│
│ │ EXECUTION ENGINE ││
│ │ ││
│ │ ┌─────────────────────────────────────────────────────────────────┐ ││
│ │ │ DISPATCH LOOP │ ││
│ │ │ │ ││
│ │ │ while (running) { │ ││
│ │ │ if (jit_code[pc] != NULL) { │ ││
│ │ │ // Execute JIT-compiled native code │ ││
│ │ │ jit_code[pc](&vm_state); │ ││
│ │ │ } else { │ ││
│ │ │ // Interpret bytecode │ ││
│ │ │ interpret_one(&vm_state); │ ││
│ │ │ check_hot_path(pc); │ ││
│ │ │ } │ ││
│ │ │ } │ ││
│ │ │ │ ││
│ │ └─────────────────────────────────────────────────────────────────┘ ││
│ │ │ │ ││
│ │ ┌────────────┘ └────────────┐ ││
│ │ ▼ ▼ ││
│ │ ┌──────────────────────────┐ ┌──────────────────────────┐ ││
│ │ │ INTERPRETER │ │ JIT COMPILER │ ││
│ │ │ │ │ │ ││
│ │ │ switch (opcode) { │ │ 1. Trace recording │ ││
│ │ │ case PUSH: ... │ │ 2. Bytecode → x86-64 │ ││
│ │ │ case ADD: ... │ │ 3. Emit to code cache │ ││
│ │ │ case JNZ: ... │ │ 4. Update dispatch │ ││
│ │ │ } │ │ │ ││
│ │ │ │ │ │ ││
│ │ └──────────────────────────┘ └──────────────────────────┘ ││
│ │ │ ││
│ │ ▼ ││
│ │ ┌──────────────────────────┐ ││
│ │ │ CODE CACHE │ ││
│ │ │ │ ││
│ │ │ mmap() RWX memory │ ││
│ │ │ [region1][region2]... │ ││
│ │ │ Max 64KB, LRU eviction │ ││
│ │ │ │ ││
│ │ └──────────────────────────┘ ││
│ │ ││
│ └───────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐│
│ │ VM STATE ││
│ │ ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ Stack │ │ Memory │ │ Registers │ ││
│ │ │ [0..255] │ │ [0..4095] │ │ PC, SP, FP │ ││
│ │ └──────────────┘ └──────────────┘ └──────────────┘ ││
│ │ ││
│ └───────────────────────────────────────────────────────────────────────┘│
│ │
└───────────────────────────────────────────────────────────────────────────┘
4.2 Key Components
| Component | Responsibility | Performance Requirement |
|---|---|---|
| Bytecode Loader | Parse and validate bytecode file | O(n) for n bytes |
| Interpreter | Execute bytecode instructions | > 10M ops/sec |
| Hot Path Detector | Track backward branch counts | O(1) per branch |
| JIT Compiler | Translate bytecode to x86-64 | < 1ms per region |
| Code Emitter | Write x86-64 machine code bytes | O(n) for n instructions |
| Code Cache | Store and manage compiled code | O(1) lookup |
| Dispatch Table | Map bytecode address to native | O(1) lookup |
4.3 Data Structures
// VM State - passed to JIT-compiled code
typedef struct {
int64_t stack[256]; // Operand stack
int stack_top; // Stack pointer
int64_t memory[4096]; // VM memory (variables)
uint8_t* code; // Bytecode array
int pc; // Program counter
int running; // Execution flag
} VMState;
// Hot path tracking
typedef struct {
int execution_count; // Number of times executed
int bytecode_start; // Starting PC of region
int bytecode_end; // Ending PC of region
void* native_code; // Pointer to compiled code
size_t code_size; // Size of compiled code
} CompiledRegion;
// Code cache management
typedef struct {
void* memory; // mmap'd executable memory
size_t capacity; // Total size
size_t used; // Bytes used
CompiledRegion regions[256];
int num_regions;
} CodeCache;
// Dispatch table
typedef void (*JitFunc)(VMState*);
JitFunc dispatch[65536]; // Index by bytecode PC
4.4 x86-64 Code Generation
The JIT compiler translates each bytecode instruction to x86-64. Here’s the mapping:
BYTECODE TO x86-64 TRANSLATION
PUSH immediate:
Bytecode: [0x01] [value:8]
x86-64: mov rax, <value> ; 48 B8 <8 bytes>
push rax ; 50
POP:
Bytecode: [0x02]
x86-64: add rsp, 8 ; 48 83 C4 08
ADD:
Bytecode: [0x10]
x86-64: pop rcx ; 59
pop rax ; 58
add rax, rcx ; 48 01 C8
push rax ; 50
SUB:
Bytecode: [0x11]
x86-64: pop rcx ; 59
pop rax ; 58
sub rax, rcx ; 48 29 C8
push rax ; 50
MUL:
Bytecode: [0x12]
x86-64: pop rcx ; 59
pop rax ; 58
imul rax, rcx ; 48 0F AF C1
push rax ; 50
LOAD addr:
Bytecode: [0x20] [addr:2]
x86-64: mov rax, [rdi + offset] ; 48 8B 87 <offset>
push rax ; 50
(where rdi points to vm->memory, offset = addr * 8)
STORE addr:
Bytecode: [0x21] [addr:2]
x86-64: pop rax ; 58
mov [rdi + offset], rax ; 48 89 87 <offset>
JNZ offset (backward branch - loop):
Bytecode: [0x32] [offset:2]
x86-64: pop rax ; 58
test rax, rax ; 48 85 C0
jnz <relative_addr> ; 0F 85 <4 bytes>
4.5 Algorithm Overview
ALGORITHM: JIT Compilation Pipeline
function interpret_with_jit(vm):
while vm.running:
pc = vm.pc
// Check if JIT code exists for this address
if dispatch[pc] != NULL:
dispatch[pc](&vm)
continue
// Interpret one instruction
opcode = vm.code[pc]
switch opcode:
case PUSH: ...
case ADD: ...
case JNZ:
// HOT PATH DETECTION
target = vm.pc + offset
if target < pc: // Backward branch (loop)
branch_count[pc]++
if branch_count[pc] > JIT_THRESHOLD:
compile_region(target, pc)
function compile_region(start_pc, end_pc):
// 1. Allocate code buffer
buffer = alloc_executable(4096)
emit_ptr = buffer
// 2. Emit prologue
emit_prologue(&emit_ptr)
// 3. Translate each bytecode instruction
for pc = start_pc to end_pc:
opcode = bytecode[pc]
switch opcode:
case PUSH:
emit_push_immediate(&emit_ptr, get_immediate(pc))
case ADD:
emit_add(&emit_ptr)
case JNZ:
emit_conditional_jump(&emit_ptr, target)
...
// 4. Emit epilogue
emit_epilogue(&emit_ptr)
// 5. Install in dispatch table
dispatch[start_pc] = buffer
// 6. Log compilation
log("[JIT] Compiled %d bytes at %p", emit_ptr - buffer, buffer)
5. Implementation Guide
5.1 Development Environment Setup
# Install required tools
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install gcc gdb nasm hexdump
# macOS
brew install gcc nasm
# Verify installation
gcc --version # Need GCC 7+ for C11
nasm --version # For understanding x86 encoding
gdb --version # For debugging
# Create project structure
mkdir -p jitvm/{src,include,tests,bytecode}
cd jitvm
5.2 Project Structure
jitvm/
├── Makefile
├── src/
│ ├── main.c # Entry point, argument parsing
│ ├── vm.c # VM state management
│ ├── interpreter.c # Bytecode interpreter
│ ├── jit.c # JIT compiler core
│ ├── codegen.c # x86-64 code generation
│ ├── codecache.c # Code cache management
│ └── loader.c # Bytecode loader
├── include/
│ ├── vm.h # VM structures
│ ├── opcodes.h # Bytecode definitions
│ ├── jit.h # JIT interfaces
│ └── codegen.h # Code generation helpers
├── tests/
│ ├── test_interpreter.c # Unit tests
│ └── test_jit.c # JIT tests
├── bytecode/
│ ├── simple.bc # Simple test program
│ ├── fibonacci.bc # Fibonacci benchmark
│ └── mandelbrot.bc # Mandelbrot set renderer
└── tools/
└── assembler.py # Simple bytecode assembler
5.3 The Core Question You’re Answering
“How do you generate executable machine code at runtime, and when is it worth the compilation overhead?”
This question captures the essence of JIT compilation:
- Runtime code generation: Using mmap with PROT_EXEC to create executable memory, then writing valid x86-64 instruction bytes
- When it’s worth it: Only for hot paths - code that executes frequently enough that compilation cost is amortized
This is the same question answered by every JIT system from the JVM to QEMU’s TCG.
5.4 Concepts You Must Understand First
Before starting implementation, verify your understanding:
| Concept | Self-Assessment Question | Where to Learn |
|---|---|---|
| Stack machines | How does PUSH 5; PUSH 3; ADD leave the stack? |
“Language Implementation Patterns” Ch. 5 |
| x86-64 registers | What’s the difference between RAX, EAX, AX, AL? | Intel SDM Vol. 1, Ch. 3 |
| x86-64 encoding | What bytes encode mov rax, 0x1234567890ABCDEF? |
Intel SDM Vol. 2 |
| Memory mapping | What does mmap with PROT_EXEC do? | “TLPI” Ch. 49 |
| Calling conventions | How are arguments passed in x86-64 Linux/macOS? | System V ABI |
| Branch prediction | Why are backward branches usually predicted taken? | CS:APP Ch. 5 |
5.5 Questions to Guide Your Design
Interpreter Design:
- How will you represent the bytecode instruction set?
- What data structure will hold the VM stack?
- How will you handle control flow (jumps, branches)?
Hot Path Detection:
- What threshold triggers compilation?
- How will you identify loop boundaries?
- What data structure tracks execution counts?
Code Generation:
- How will you map VM stack operations to x86-64 registers?
- How will you handle memory operands (LOAD/STORE)?
- How will you encode conditional branches?
Code Cache:
- How will you allocate executable memory?
- What’s your eviction policy when cache is full?
- How will you invalidate stale code (if needed)?
Integration:
- How does the interpreter transfer control to JIT code?
- How does JIT code return control to the interpreter?
- How do you handle operations JIT doesn’t support?
5.6 Thinking Exercise
Before writing any code, trace through this bytecode execution:
Program: Sum 1 to 100
Address Bytecode Stack (after) Memory
0x00 PUSH 0 [0]
0x02 STORE 0 [] M[0]=0 (sum)
0x05 PUSH 1 [1]
0x07 STORE 1 [] M[1]=1 (counter)
0x0A LOAD 0 [0] <- LOOP START
0x0D LOAD 1 [0, 1]
0x10 ADD [1]
0x11 STORE 0 [] M[0]=1
0x14 LOAD 1 [1]
0x17 PUSH 1 [1, 1]
0x19 ADD [2]
0x1A DUP [2, 2]
0x1B STORE 1 [2] M[1]=2
0x1E PUSH 100 [2, 100]
0x21 SUB [-98]
0x22 JNZ -0x18 [] Jump to 0x0A if != 0
0x25 LOAD 0 [5050]
0x28 HALT
Questions to answer:
- How many times does the loop execute?
- Which bytecode addresses constitute the “hot region”?
- What x86-64 instructions would you emit for
LOAD 0; LOAD 1; ADD; STORE 0? - How would you encode
JNZ -0x18in x86-64?
5.7 Hints in Layers
Use these progressive hints only when stuck.
Hint 1: Starting with the Interpreter
Start with a working interpreter before adding JIT:
void interpret(VMState* vm) {
while (vm->running) {
uint8_t opcode = vm->code[vm->pc++];
switch (opcode) {
case OP_PUSH: {
int64_t value = read_i64(vm); // Read 8-byte immediate
vm->stack[vm->stack_top++] = value;
break;
}
case OP_ADD: {
int64_t b = vm->stack[--vm->stack_top];
int64_t a = vm->stack[--vm->stack_top];
vm->stack[vm->stack_top++] = a + b;
break;
}
case OP_JNZ: {
int16_t offset = read_i16(vm); // Read 2-byte offset
int64_t cond = vm->stack[--vm->stack_top];
if (cond != 0) {
vm->pc += offset - 3; // -3 for instruction size
}
break;
}
case OP_HALT:
vm->running = 0;
break;
// ... other cases
}
}
}
Hint 2: Hot Path Detection
Track backward branches to find loops:
#define JIT_THRESHOLD 1000
int branch_counts[65536] = {0}; // Per-PC branch counter
void check_hot_path(VMState* vm, int target_pc) {
int current_pc = vm->pc;
// Only track backward branches (loops)
if (target_pc < current_pc) {
int loop_header = target_pc;
branch_counts[loop_header]++;
if (branch_counts[loop_header] >= JIT_THRESHOLD) {
printf("[JIT] Hot loop detected at 0x%04X\n", loop_header);
compile_region(vm, loop_header, current_pc);
branch_counts[loop_header] = 0; // Reset counter
}
}
}
Hint 3: Allocating Executable Memory
Use mmap to create RWX memory:
#include <sys/mman.h>
void* alloc_executable(size_t size) {
void* mem = mmap(
NULL, // Let kernel choose address
size, // Size in bytes
PROT_READ | PROT_WRITE | PROT_EXEC, // RWX permissions
MAP_PRIVATE | MAP_ANONYMOUS, // Private, not file-backed
-1, // No file descriptor
0 // Offset
);
if (mem == MAP_FAILED) {
perror("mmap");
return NULL;
}
return mem;
}
void free_executable(void* mem, size_t size) {
munmap(mem, size);
}
Security Note: On some systems (macOS with SIP, selinux), you may need to:
-
Write code first (PROT_READ PROT_WRITE) -
Then mprotect() to (PROT_READ PROT_EXEC)
// Safer two-step approach
void* mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ... write code to mem ...
mprotect(mem, size, PROT_READ | PROT_EXEC);
Hint 4: Code Generation Basics
Emit x86-64 bytes directly to a buffer:
typedef struct {
uint8_t* buffer;
uint8_t* ptr;
size_t size;
} CodeBuffer;
// Emit single byte
void emit8(CodeBuffer* cb, uint8_t b) {
*cb->ptr++ = b;
}
// Emit 32-bit value (little-endian)
void emit32(CodeBuffer* cb, uint32_t v) {
*cb->ptr++ = v & 0xFF;
*cb->ptr++ = (v >> 8) & 0xFF;
*cb->ptr++ = (v >> 16) & 0xFF;
*cb->ptr++ = (v >> 24) & 0xFF;
}
// Emit 64-bit value (little-endian)
void emit64(CodeBuffer* cb, uint64_t v) {
emit32(cb, v & 0xFFFFFFFF);
emit32(cb, v >> 32);
}
// Example: mov rax, immediate64
void emit_mov_rax_imm64(CodeBuffer* cb, int64_t value) {
emit8(cb, 0x48); // REX.W prefix
emit8(cb, 0xB8); // mov rax, imm64
emit64(cb, value);
}
// Example: push rax
void emit_push_rax(CodeBuffer* cb) {
emit8(cb, 0x50);
}
// Example: pop rax
void emit_pop_rax(CodeBuffer* cb) {
emit8(cb, 0x58);
}
// Example: pop rcx
void emit_pop_rcx(CodeBuffer* cb) {
emit8(cb, 0x59);
}
// Example: add rax, rcx
void emit_add_rax_rcx(CodeBuffer* cb) {
emit8(cb, 0x48); // REX.W
emit8(cb, 0x01); // add r/m64, r64
emit8(cb, 0xC8); // ModRM: rcx -> rax
}
Hint 5: Compiling a Simple Operation
Translate bytecode ADD to x86-64:
// Bytecode: ADD
// Stack: [a, b] -> [a + b]
// x86-64: pop rcx; pop rax; add rax, rcx; push rax
void compile_add(CodeBuffer* cb) {
// pop rcx (second operand)
emit8(cb, 0x59);
// pop rax (first operand)
emit8(cb, 0x58);
// add rax, rcx
emit8(cb, 0x48); // REX.W
emit8(cb, 0x01); // ADD r/m64, r64
emit8(cb, 0xC8); // ModRM: rcx -> rax
// push rax (result)
emit8(cb, 0x50);
}
For PUSH immediate:
// Bytecode: PUSH value
// Stack: [] -> [value]
// x86-64: mov rax, value; push rax
void compile_push(CodeBuffer* cb, int64_t value) {
// mov rax, imm64
emit8(cb, 0x48); // REX.W
emit8(cb, 0xB8); // mov rax, imm64
emit64(cb, value);
// push rax
emit8(cb, 0x50);
}
Hint 6: Handling LOAD/STORE
The VM’s memory array is passed via register (rdi in System V ABI):
// Before calling JIT code:
// rdi = &vm->memory[0] (base of VM memory)
// Bytecode: LOAD addr
// Load from VM memory, push to hardware stack
void compile_load(CodeBuffer* cb, int addr) {
int offset = addr * 8; // int64_t array
// mov rax, [rdi + offset]
emit8(cb, 0x48); // REX.W
emit8(cb, 0x8B); // mov r64, r/m64
emit8(cb, 0x87); // ModRM: [rdi + disp32]
emit32(cb, offset);
// push rax
emit8(cb, 0x50);
}
// Bytecode: STORE addr
// Pop from hardware stack, store to VM memory
void compile_store(CodeBuffer* cb, int addr) {
int offset = addr * 8;
// pop rax
emit8(cb, 0x58);
// mov [rdi + offset], rax
emit8(cb, 0x48); // REX.W
emit8(cb, 0x89); // mov r/m64, r64
emit8(cb, 0x87); // ModRM: [rdi + disp32]
emit32(cb, offset);
}
Hint 7: Compiling Conditional Branches
JNZ (Jump if Not Zero) requires careful handling:
// Bytecode: JNZ offset
// Pop value, jump if != 0
// For loops: offset is negative (backward jump)
void compile_jnz(CodeBuffer* cb, int target_offset, uint8_t* loop_start) {
// pop rax (condition)
emit8(cb, 0x58);
// test rax, rax
emit8(cb, 0x48); // REX.W
emit8(cb, 0x85); // test r/m64, r64
emit8(cb, 0xC0); // ModRM: rax, rax
// jnz rel32
emit8(cb, 0x0F);
emit8(cb, 0x85);
// Calculate relative offset to loop start
// JNZ is 6 bytes total, so calculate from end of instruction
uint8_t* jnz_end = cb->ptr + 4; // After we emit the 4-byte offset
int32_t rel = (int32_t)(loop_start - jnz_end);
emit32(cb, rel);
}
Key insight: x86-64 relative jumps are calculated from the END of the instruction, not the beginning.
Hint 8: Complete Compilation Example
Compiling a simple loop:
void compile_region(VMState* vm, int start_pc, int end_pc) {
CodeBuffer cb;
cb.buffer = alloc_executable(4096);
cb.ptr = cb.buffer;
// Remember start for loop jumps
uint8_t* loop_start = cb.ptr;
// Emit prologue: save callee-saved registers, get vm_memory pointer
// Assume: rdi = &vm->memory[0] (passed by caller)
// Translate bytecode
int pc = start_pc;
while (pc <= end_pc) {
uint8_t opcode = vm->code[pc++];
switch (opcode) {
case OP_PUSH: {
int64_t value = read_i64_at(vm, pc);
pc += 8;
compile_push(&cb, value);
break;
}
case OP_ADD:
compile_add(&cb);
break;
case OP_LOAD: {
int addr = read_i16_at(vm, pc);
pc += 2;
compile_load(&cb, addr);
break;
}
case OP_STORE: {
int addr = read_i16_at(vm, pc);
pc += 2;
compile_store(&cb, addr);
break;
}
case OP_JNZ: {
compile_jnz(&cb, 0, loop_start);
pc += 2; // Skip offset bytes
break;
}
// ... other cases
}
}
// Emit epilogue: ret
emit8(&cb, 0xC3); // ret
// Calculate size
size_t code_size = cb.ptr - cb.buffer;
// Install in dispatch table
typedef void (*JitFunc)(int64_t* memory);
dispatch[start_pc] = (JitFunc)cb.buffer;
printf("[JIT] Compiled %zu bytes at %p\n", code_size, cb.buffer);
}
Hint 9: Calling JIT Code from Interpreter
The dispatch loop calls JIT code:
typedef void (*JitFunc)(int64_t* memory);
JitFunc dispatch[65536] = {NULL};
void run_with_jit(VMState* vm) {
while (vm->running) {
int pc = vm->pc;
// Check for JIT code
if (dispatch[pc] != NULL) {
// Call JIT code
// Pass pointer to VM memory
dispatch[pc](vm->memory);
// JIT code handles the loop internally
// and returns when loop exits
// We need to sync VM state...
// (This is simplified - real impl needs more care)
} else {
// Interpret as usual
interpret_one(vm);
}
}
}
Note: A complete implementation needs to handle:
- Syncing VM stack with hardware stack
- Updating VM PC after JIT returns
- Handling exits from JIT (loop exits, exceptions)
Hint 10: Full x86-64 Instruction Reference
Reference for all instructions you need:
INSTRUCTION ENCODING NOTES
─────────────────────────────────────────────────────────────────
mov rax, imm64 48 B8 [8 bytes] Load 64-bit immediate
mov rax, [rdi+off] 48 8B 87 [4 bytes] Load from memory
mov [rdi+off], rax 48 89 87 [4 bytes] Store to memory
push rax 50 Push to stack
pop rax 58 Pop from stack
pop rcx 59 Pop to rcx
pop rdx 5A Pop to rdx
add rax, rcx 48 01 C8 rax = rax + rcx
sub rax, rcx 48 29 C8 rax = rax - rcx
imul rax, rcx 48 0F AF C1 rax = rax * rcx
cqo 48 99 Sign-extend rax to rdx:rax
idiv rcx 48 F7 F9 rdx:rax / rcx, result in rax
test rax, rax 48 85 C0 Set flags based on rax
cmp rax, rcx 48 39 C8 Set flags: rax - rcx
jmp rel32 E9 [4 bytes] Unconditional jump
jnz rel32 0F 85 [4 bytes] Jump if not zero (ZF=0)
jz rel32 0F 84 [4 bytes] Jump if zero (ZF=1)
ret C3 Return from function
Relative addressing: All jump offsets are calculated from the END of the instruction.
5.8 The Interview Questions They’ll Ask
After completing this project, you should be able to answer:
- “What is JIT compilation and when is it better than interpretation?”
- Expected: JIT compiles code at runtime, better for hot paths that execute frequently
- Bonus: Discuss compilation overhead vs. execution speed tradeoff
- “How do you generate executable code at runtime?”
- Expected: mmap with PROT_EXEC, write machine code bytes, call via function pointer
- Bonus: Discuss W^X security, mprotect separation
- “How do you detect hot paths?”
- Expected: Backward branch counting (loops), method invocation counting
- Bonus: Discuss tracing JITs vs. method JITs
- “How does QEMU’s TCG relate to JIT compilation?”
- Expected: TCG translates guest ISA to host ISA at runtime using JIT techniques
- Bonus: Discuss IR-based translation, code caching
- “What’s the difference between a stack machine and a register machine?”
- Expected: Stack machines use implicit stack, register machines use explicit registers
- Bonus: Discuss why bytecode is often stack-based (simpler) but native code is register-based (faster)
- “How do you handle branches in JIT-compiled code?”
- Expected: Calculate relative offsets from instruction end, emit conditional jumps
- Bonus: Discuss branch prediction, trace compilation
- “What are the challenges of JIT code cache management?”
- Expected: Memory limits, invalidation on code changes, fragmentation
- Bonus: Discuss GC integration, OSR (on-stack replacement)
- “How would you debug JIT-generated code?”
- Expected: Dump generated bytes, use gdb to disassemble, step through execution
- Bonus: Discuss JIT map files, perf integration
5.9 Books That Will Help
| Topic | Book | Specific Chapter | Why It Helps |
|---|---|---|---|
| Stack-based VMs | “Language Implementation Patterns” by Terence Parr | Chapter 10-11 | Core VM design patterns |
| JIT techniques | “Engineering a Compiler” by Cooper & Torczon | Chapter 7, 13 | Code generation and optimization |
| x86-64 encoding | “Modern X86 Assembly” by Daniel Kusswurm | Chapters 1-5 | Instruction encoding reference |
| Memory mapping | “The Linux Programming Interface” by Kerrisk | Chapter 49 | mmap, mprotect details |
| QEMU internals | QEMU documentation | TCG documentation | How production binary translation works |
| LuaJIT | LuaJIT source code | lj_jit.c, lj_asm.c | Real trace JIT implementation |
5.10 Implementation Phases
Phase 1: Bytecode Interpreter (Days 1-3)
Goals:
- Implement all 14 bytecode instructions
- Create bytecode loader
- Test with simple programs
Tasks:
- Define opcodes and bytecode format
- Implement VMState structure
- Write interpreter dispatch loop
- Implement each instruction
- Create test bytecode by hand
- Verify with simple programs (add two numbers, loop to 10)
Checkpoint: Can run sum_to_100.bc and get correct result.
Phase 2: Hot Path Detection (Days 4-5)
Goals:
- Track backward branches
- Identify when threshold is crossed
- Log hot path detection events
Tasks:
- Add branch count array
- Modify JNZ handler to track backward branches
- Add threshold check
- Log when hot path is detected
- Calculate loop region (start_pc, end_pc)
Checkpoint: “Hot loop detected at 0xXX” message appears after 1000 iterations.
Phase 3: Basic Code Generation (Days 6-10)
Goals:
- Allocate executable memory
- Generate code for PUSH, ADD, SUB, MUL
- Execute simplest JIT code
Tasks:
- Implement mmap wrapper for executable memory
- Create CodeBuffer structure
- Write emit_* functions for basic instructions
- Compile simplest case: PUSH; PUSH; ADD
- Call JIT code from interpreter
- Verify result is correct
Checkpoint: JIT-compiled 5 + 3 returns 8.
Phase 4: Full Loop Compilation (Days 11-15)
Goals:
- Compile LOAD, STORE
- Compile JNZ with backward branch
- Handle VM memory access from JIT code
Tasks:
- Pass VM memory pointer to JIT code (via rdi)
- Implement compile_load, compile_store
- Implement compile_jnz with loop back
- Track code generation position for jump targets
- Test complete loop compilation
Checkpoint: Sum 1-to-100 loop runs entirely in JIT code.
Phase 5: Integration & Benchmarking (Days 16-21)
Goals:
- Seamless interpreter/JIT switching
- Code cache management
- Performance measurement
Tasks:
- Implement dispatch table
- Add interpreter fallback for uncompiled code
- Implement code cache with size limit
- Create Mandelbrot benchmark
- Measure and compare interpreter vs. JIT
- Tune JIT threshold
Checkpoint: 10x+ speedup on Mandelbrot, working demo.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Stack representation | VM stack vs. x86 stack | x86 stack | Simpler code generation, hardware optimized |
| Memory passing | Global pointer vs. parameter | Parameter (rdi) | Clean interface, follows ABI |
| Jump handling | Relative vs. absolute | Relative | Standard x86-64 encoding |
| Code cache | Fixed vs. growing | Fixed 64KB | Simpler management, sufficient for demo |
| Deoptimization | Full vs. none | Simple fallback | Keep scope manageable |
6. Testing Strategy
6.1 Unit Tests
| Test Category | What to Test |
|---|---|
| Interpreter | Each opcode in isolation |
| Code emission | Each emit_* function produces correct bytes |
| Code execution | Simple JIT sequences return correct values |
| Hot path detection | Threshold triggers at correct count |
| Memory access | LOAD/STORE work correctly from JIT code |
6.2 Integration Tests
Test 1: Simple Addition
Bytecode: PUSH 5; PUSH 3; ADD; HALT
Expected: Stack contains [8]
Test 2: Loop Counter
Bytecode: PUSH 0; STORE 0; PUSH 10; STORE 1;
LOAD 0; PUSH 1; ADD; DUP; STORE 0;
LOAD 1; PUSH 1; SUB; DUP; STORE 1; JNZ -20;
LOAD 0; HALT
Expected: Memory[0] = 10 (counted 10 times)
Test 3: Sum 1 to N
Run sum_to_100 with JIT disabled: result = 5050
Run sum_to_100 with JIT enabled: result = 5050
Verify: Same result, different speed
Test 4: JIT Triggering
Loop that iterates exactly 1001 times
Verify: JIT compilation message appears at iteration 1000
Test 5: Multiple Hot Regions
Two separate loops in one program
Verify: Both get compiled independently
6.3 Benchmark Tests
Benchmark 1: Sum 1 to 1,000,000
Interpreter: ~2 seconds
JIT: ~0.05 seconds
Expected speedup: 40x+
Benchmark 2: Fibonacci(40)
Interpreter: ~5 seconds
JIT: ~0.2 seconds
Expected speedup: 25x+
Benchmark 3: Mandelbrot 80x24
Interpreter: ~4 seconds
JIT: ~0.1 seconds
Expected speedup: 40x+
7. Common Pitfalls & Debugging
| Problem | Symptom | Cause | Solution | Verification | |
|---|---|---|---|---|---|
| Segfault on JIT call | SIGSEGV immediately | Wrong calling convention | Check rdi points to valid memory | gdb: x/10gx $rdi |
|
| Segfault in JIT code | SIGSEGV during execution | Bad machine code encoding | Disassemble generated code | objdump -D -b binary -m i386:x86-64 dump.bin |
|
| Wrong computation result | Incorrect output | Stack order or encoding error | Add debug prints, trace execution | Compare interpreter vs JIT step by step | |
| mmap fails | MAP_FAILED return | SELinux, SIP restrictions | Use PROT_READ | PROT_WRITE then mprotect | Check dmesg for denials |
| JIT not triggering | Always interprets | Threshold not reached or wrong detection | Add logging to branch handler | Print branch count every 100 iterations | |
| Infinite loop in JIT | Hangs | Jump offset calculated wrong | Print jump target address | gdb breakpoint before jump | |
| Memory corruption | Random crashes | Buffer overflow in emit | Add bounds checking to CodeBuffer | Valgrind: valgrind ./jitvm |
|
| Performance not improved | Same speed | JIT code not being called | Add logging to dispatch | Print “entering JIT” message |
7.1 Debugging Techniques
Disassemble generated code:
# Dump JIT buffer to file
# In code: write(fd, cb.buffer, cb.ptr - cb.buffer);
# Then disassemble:
objdump -D -b binary -m i386:x86-64 jit_dump.bin
GDB debugging:
gdb ./jitvm
(gdb) break *0x7ffff7ff0000 # Break at JIT code address
(gdb) run test.bc
(gdb) disassemble $pc, $pc+50 # Show generated code
(gdb) info registers
(gdb) x/10gx $rdi # Examine VM memory
Add verification:
void verify_emit(CodeBuffer* cb, const char* description) {
printf("[EMIT] %s: ", description);
for (uint8_t* p = cb->buffer; p < cb->ptr; p++) {
printf("%02X ", *p);
}
printf("\n");
}
8. Extensions & Challenges
8.1 Beginner Extensions
- More operations: Add AND, OR, XOR, NOT, MOD
- Better diagnostics: Print disassembly of generated code
- Statistics: Track compilation time, cache hit rate
- Multiple loops: Support multiple hot regions in one program
8.2 Intermediate Extensions
- Function calls: Compile CALL/RET instructions
- Tiered compilation: Fast first-tier JIT, optimized second tier
- Register allocation: Map VM stack slots to x86 registers
- Code cache eviction: LRU policy when cache is full
8.3 Advanced Extensions
- On-stack replacement (OSR): Enter JIT mid-execution
- Deoptimization: Fall back to interpreter when invariants break
- Inline caching: Optimize repeated operations
- Vectorization: Use SIMD for parallel operations
- Different architectures: Add ARM64 backend
9. Real-World Connections
9.1 How Production JITs Differ
| Aspect | This Project | Production JIT (V8, HotSpot) |
|---|---|---|
| Optimization level | None | Multiple tiers, heavy optimization |
| IR usage | Direct translation | SSA-based IR with optimizations |
| Register allocation | Stack-based | Graph coloring, linear scan |
| Garbage collection | None | Full GC integration |
| Debugging support | Minimal | Source maps, profiler integration |
| OSR | None | Full on-stack replacement |
| Code size | ~1000 lines | 100,000+ lines |
9.2 Real-World Systems Using This Technique
QEMU TCG:
- Translates guest ISA to host ISA
- Uses intermediate TCG ops
- Supports dozens of guest/host combinations
- Handles privileged instructions, memory protection
V8 TurboFan:
- JavaScript JIT compiler
- Sea of Nodes IR
- Speculative optimization
- Deoptimization on type changes
LuaJIT:
- Trace-based JIT for Lua
- Records executed traces
- Generates highly optimized code
- Often faster than C for some workloads
HotSpot C2:
- JVM server compiler
- Extensive optimizations
- Profile-guided optimization
- Escape analysis, inlining
9.3 Open Source References
| Project | What to Learn |
|---|---|
| QEMU | tcg/ directory - real binary translator |
| LuaJIT | src/lj_asm.c - trace JIT |
| Copy | Tiny JIT in 1000 lines |
| LibJIT | JIT library for multiple backends |
10. Resources
10.1 Essential Reading
- “Language Implementation Patterns” by Terence Parr - Chapters 10-11 on VMs
- “Engineering a Compiler” by Cooper & Torczon - Chapter 7 on code generation
- Intel 64 and IA-32 Architectures Software Developer’s Manual - Volume 2 for encoding
- QEMU Documentation - TCG ops and translation concepts
10.2 Online Tutorials
10.3 Reference Materials
11. Self-Assessment Checklist
Understanding
- I can explain what JIT compilation is and when it’s beneficial
- I understand how mmap with PROT_EXEC creates executable memory
- I can describe how a stack machine differs from a register machine
- I know how to encode basic x86-64 instructions
- I understand how QEMU’s TCG relates to this project
- I can explain hot path detection strategies
Implementation
- My interpreter correctly executes all bytecode instructions
- Hot path detection triggers after the configured threshold
- JIT compilation produces correct machine code
- Generated code executes and produces correct results
- The dispatch table correctly routes to JIT or interpreter
- Benchmarks show significant speedup (10x+)
Growth
- I profiled my code and understand the performance characteristics
- I can debug JIT-generated code using gdb
- I could extend this to support new instructions
- I could explain my implementation in a technical interview
12. Submission / Completion Criteria
Minimum Viable Completion
- Interpreter executes PUSH, POP, ADD, SUB, MUL, LOAD, STORE, JNZ, HALT correctly
- Hot path detection triggers on backward branches
- At least arithmetic operations are JIT compiled
- JIT code executes and produces correct results
- Demo shows measurable speedup
Full Completion
- All 14 bytecode instructions interpreted correctly
- JIT compiles arithmetic, memory, and control flow operations
- Code cache manages memory with configurable size
- Mandelbrot benchmark shows 10x+ speedup
- Clean interpreter/JIT switching
Excellence (Going Above and Beyond)
- CALL/RET compiled in JIT
- Register allocation optimization
- Multiple compilation tiers
- Deoptimization support
- Performance comparable to LuaJIT-like systems
13. Learning Milestones
Track your progress through these milestones:
- mmap+PROT_EXEC works - Can execute generated code
- Test: Generate
mov rax, 42; retand call it - Verify: Function returns 42
- Test: Generate
- Simple arithmetic JITs correctly - Basic code generation works
- Test: PUSH 5; PUSH 3; ADD returns 8 from JIT code
- Verify: Same result as interpreter
- Loops JIT correctly - Control flow handled
- Test: Loop counting to 100 runs entirely in JIT
- Verify: Memory[0] contains 100 after execution
- 10x speedup on hot loops - Effective JIT
- Test: Sum 1-to-1M benchmark
- Verify: JIT version is 10x+ faster
- Interpreter/JIT switching seamless - Full integration
- Test: Program with cold startup and hot loop
- Verify: Correct result, smooth transition, good performance
This guide was expanded from HYPERVISOR_VIRTUALIZATION_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the main file which covers 15 projects from basic emulation to full hypervisor development.