Project 4: Simple JIT Compiler for a Bytecode VM

Build a Just-In-Time compiler that dynamically translates bytecode to native x86-64 machine code at runtime, achieving dramatic speedups for hot code paths - the same technique that powers QEMU’s TCG, the JVM, V8, and LuaJIT.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 3-4 weeks
Language C
Alternative Languages C++, Rust
Prerequisites Projects 1-3 (emulation basics), x86-64 assembly, understanding of stack machines
Key Topics JIT compilation, dynamic code generation, executable memory (mmap), hot path detection, code cache management
Coolness Level Level 5: Pure Magic (Super Cool)
Portfolio Value Resume Gold
Main Book “Language Implementation Patterns” by Terence Parr

1. Learning Objectives

By completing this project, you will:

  1. Understand JIT compilation fundamentals: Explain the difference between interpretation and JIT compilation, and when each approach is appropriate
  2. Master runtime code generation: Write C code that generates executable x86-64 machine code at runtime using mmap with PROT_EXEC
  3. Implement hot path detection: Build a profiling mechanism that identifies frequently-executed code worth compiling
  4. Design a code cache: Manage compiled code fragments with proper invalidation and memory limits
  5. Handle stack-to-register mapping: Transform stack-based bytecode operations into efficient register-based machine code
  6. Implement interpreter/JIT switching: Seamlessly transition between interpreted and compiled execution
  7. Build the foundation for understanding QEMU’s TCG: This is exactly how QEMU’s Tiny Code Generator works for cross-architecture emulation

2. Theoretical Foundation

2.1 What is JIT Compilation?

A Just-In-Time (JIT) compiler translates code at runtime rather than ahead-of-time. Unlike traditional compilers that produce executables before the program runs, a JIT compiles code while the program executes.

COMPILATION STRATEGIES COMPARISON

Ahead-of-Time (AOT) Compilation:
┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│ Source Code │───►│   Compiler   │───►│ Executable  │
│   (foo.c)   │    │   (gcc/clang)│    │  (foo.exe)  │
└─────────────┘    └──────────────┘    └─────────────┘
                          │
                   Compile Time
                          │
    ─────────────────────────────────────────────────────
                          │
                    Run Time
                          │
                          ▼
              ┌─────────────────────┐
              │   Native Execution  │
              │   (Already compiled)│
              └─────────────────────┘


Just-In-Time (JIT) Compilation:
┌─────────────┐    ┌──────────────┐
│ Source Code │───►│ Bytecode     │
│   (App.java)│    │ Compiler     │
└─────────────┘    └──────┬───────┘
                          │
                   Compile Time (AOT to bytecode)
                          │
                          ▼
              ┌─────────────────────┐
              │     Bytecode        │
              │   (App.class)       │
              └─────────────────────┘
    ─────────────────────────────────────────────────────
                          │
                    Run Time
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│                    JVM / VM Runtime                  │
│                                                      │
│  ┌───────────────────┐     ┌────────────────────┐   │
│  │    Interpreter    │     │   JIT Compiler     │   │
│  │  (Cold code path) │     │  (Hot code path)   │   │
│  └─────────┬─────────┘     └──────────┬─────────┘   │
│            │                          │             │
│            ▼                          ▼             │
│    ┌────────────────┐      ┌────────────────────┐  │
│    │ Execute slowly │      │ Compile to native  │  │
│    │ (but starts    │      │ code, then execute │  │
│    │  immediately)  │      │ at full speed      │  │
│    └────────────────┘      └────────────────────┘  │
└─────────────────────────────────────────────────────┘

2.2 Why JIT Compilation?

The key insight is that not all code is equally important. Most programs spend 90% of their time in 10% of the code (hot paths). JIT compilation optimizes this by:

  1. Start Fast: Begin interpreting immediately - no compilation delay
  2. Profile: Identify which code runs frequently (hot paths)
  3. Compile Hot Paths: Convert frequently-executed bytecode to native code
  4. Run Fast: Execute compiled code at near-native speed
EXECUTION TIME DISTRIBUTION

Program Execution:
────────────────────────────────────────────────────────────────────
Time →

┌──────────┬───────────────────────────────────────────────────────┐
│Startup   │                    Main Execution                      │
│ (10%)    │                        (90%)                           │
└──────────┴───────────────────────────────────────────────────────┘

Code Path Distribution:
┌─────────────────────────────────────────────────┐
│                                                  │
│  Hot Paths (10% of code)                        │
│  ┌─────────────────────────────────────────┐    │
│  │ ████████████████████████████████████    │    │ 90% of time
│  └─────────────────────────────────────────┘    │
│                                                  │
│  Cold Paths (90% of code)                       │
│  ┌─────────────────────────────────────────┐    │
│  │ ████                                    │    │ 10% of time
│  └─────────────────────────────────────────┘    │
│                                                  │
└─────────────────────────────────────────────────┘

JIT Strategy: Only compile the hot paths (10% of code, 90% of time)
             - Interpret cold paths (acceptable speed for rare code)
             - Maximum benefit with minimal compilation overhead

2.3 How QEMU’s TCG Uses JIT

QEMU (Quick EMUlator) uses its Tiny Code Generator (TCG) to translate guest architecture instructions to host architecture code at runtime. This is the same concept we’re implementing:

QEMU TCG Architecture

Guest Code (ARM)              TCG (Intermediate)           Host Code (x86-64)
┌─────────────────┐          ┌─────────────────┐          ┌─────────────────┐
│ ADD R0, R1, R2  │          │ load_i32 t0, r1 │          │ mov eax, [rbx+4]│
│ SUB R0, R0, #1  │    ───►  │ load_i32 t1, r2 │    ───►  │ mov ecx, [rbx+8]│
│ BNE loop        │          │ add_i32 t2,t0,t1│          │ add eax, ecx    │
└─────────────────┘          │ store_i32 r0,t2 │          │ sub eax, 1      │
                             │ ...             │          │ jne loop        │
                             └─────────────────┘          └─────────────────┘

OUR PROJECT: Simplified Version

Bytecode (Stack-based)        Direct Translation           Host Code (x86-64)
┌─────────────────┐                                       ┌─────────────────┐
│ PUSH 10         │                                       │ push 10         │
│ PUSH 20         │          No intermediate!             │ push 20         │
│ ADD             │          ─────────────────►           │ pop rcx         │
│ STORE 0         │          (Direct bytecode             │ pop rax         │
│ JNZ -4          │           to x86-64)                  │ add rax, rcx    │
└─────────────────┘                                       │ mov [vars], rax │
                                                          │ test rax, rax   │
                                                          │ jnz loop        │
                                                          └─────────────────┘

2.4 The Stack Machine Model

Our bytecode VM uses a stack-based architecture (like the JVM, Python VM, and WebAssembly). Understanding this is crucial:

STACK MACHINE vs REGISTER MACHINE

Stack Machine (Our Bytecode):        Register Machine (x86-64):
┌──────────────────────────────┐    ┌──────────────────────────────┐
│                              │    │                              │
│  a + b * c                   │    │  a + b * c                   │
│                              │    │                              │
│  Bytecode:                   │    │  x86-64:                     │
│    PUSH a     ; [a]          │    │    mov rax, [a]              │
│    PUSH b     ; [a, b]       │    │    mov rcx, [b]              │
│    PUSH c     ; [a, b, c]    │    │    mov rdx, [c]              │
│    MUL        ; [a, b*c]     │    │    imul rcx, rdx  ; rcx=b*c  │
│    ADD        ; [a+b*c]      │    │    add rax, rcx   ; result   │
│                              │    │                              │
│  Operands implicit (stack)   │    │  Operands explicit (regs)    │
│  Simple bytecode format      │    │  Complex instruction encoding│
│  Easy to compile from source │    │  Hard to compile from source │
│  Slow to interpret           │    │  Fast to execute             │
│                              │    │                              │
└──────────────────────────────┘    └──────────────────────────────┘

JIT CHALLENGE: Convert stack operations to register operations

2.5 Executable Memory: The mmap Magic

The key to JIT compilation is creating memory that the CPU can execute. This requires special permissions:

MEMORY PERMISSIONS

Normal heap allocation (malloc):
┌─────────────────────────────────────────────────────┐
│ void* p = malloc(4096);                             │
│ Permissions: READ + WRITE                           │
│ Cannot execute code here - CPU will fault!          │
└─────────────────────────────────────────────────────┘

Executable allocation (mmap):
┌─────────────────────────────────────────────────────┐
│ void* code = mmap(NULL, 4096,                       │
│                   PROT_READ | PROT_WRITE | PROT_EXEC│
│                   MAP_PRIVATE | MAP_ANONYMOUS,      │
│                   -1, 0);                           │
│                                                     │
│ Permissions: READ + WRITE + EXECUTE                 │
│ CAN execute code here - CPU will run it!            │
└─────────────────────────────────────────────────────┘

SECURITY NOTE:
- W^X (Write XOR Execute) is a security principle
- Many systems restrict RWX memory
- Production JITs: write code, then mprotect() to RX

2.6 Hot Path Detection

Not every piece of code deserves JIT compilation. We need to identify hot paths:

HOT PATH DETECTION STRATEGIES

Strategy 1: Invocation Counting (Method-based JIT)
┌─────────────────────────────────────────────────────┐
│ function_call_count[func_id]++;                     │
│ if (function_call_count[func_id] > THRESHOLD) {    │
│     compile_function(func_id);                      │
│ }                                                   │
│                                                     │
│ Used by: JVM (HotSpot), V8 (initial tier)          │
└─────────────────────────────────────────────────────┘

Strategy 2: Loop Detection (Trace-based JIT)
┌─────────────────────────────────────────────────────┐
│ // Backward branch = potential loop                 │
│ if (branch_target < current_pc) {                   │
│     backward_branch_count[current_pc]++;            │
│     if (backward_branch_count[current_pc] > THRESH) │
│         compile_trace_from(branch_target);          │
│ }                                                   │
│                                                     │
│ Used by: LuaJIT, TraceMonkey (old Firefox)         │
└─────────────────────────────────────────────────────┘

OUR PROJECT: Uses Strategy 2 (Loop Detection)
- Simple to implement
- Highly effective (loops are where programs spend time)
- Threshold: 1000 backward branches before compilation

2.7 The JIT Compilation Pipeline

JIT COMPILATION PIPELINE

┌─────────────────────────────────────────────────────────────────┐
│                        JIT COMPILER                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. HOT PATH DETECTION                                           │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ Interpreter tracks execution counts                  │     │
│     │ Backward branch counter > 1000? → Trigger JIT       │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  2. TRACE RECORDING                                              │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ Record bytecode sequence from loop start            │     │
│     │ Stop at: loop end, function call, or threshold      │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  3. CODE GENERATION                                              │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ For each bytecode instruction:                       │     │
│     │   - Map stack operations to registers               │     │
│     │   - Emit corresponding x86-64 bytes                 │     │
│     │   - Handle control flow (branches, loops)           │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  4. CODE INSTALLATION                                            │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ mmap() executable memory                            │     │
│     │ Copy generated code                                 │     │
│     │ Update dispatch table: bytecode_addr → native_addr │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  5. EXECUTION                                                    │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ Next time interpreter hits this address:            │     │
│     │   → Jump to native code instead of interpreting    │     │
│     └─────────────────────────────────────────────────────┘     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

2.8 Why This Matters for Virtualization

JIT compilation is the heart of software-based virtualization:

System JIT Technique Purpose
QEMU TCG Binary translation Run ARM/MIPS/RISC-V code on x86
Rosetta 2 Binary translation Run x86 apps on Apple Silicon
JVM HotSpot Method JIT Run Java bytecode at native speed
V8/SpiderMonkey Tiered JIT Run JavaScript at native speed
LuaJIT Trace JIT Run Lua at C-like speed
WebAssembly AOT + JIT Run web apps at near-native speed
eBPF Verification + JIT Run safe code in Linux kernel

Understanding JIT compilation means understanding how ALL of these systems work internally.

2.9 Common Misconceptions

Misconception 1: “JIT is always faster than interpretation”

  • Reality: JIT has compilation overhead. For code that runs once, interpretation is faster. JIT only wins for frequently-executed code.

Misconception 2: “JIT generates optimal code”

  • Reality: JIT must balance compilation time vs. code quality. It generates “good enough” code quickly, not optimal code slowly.

Misconception 3: “Writing a JIT is extremely complex”

  • Reality: A basic JIT (like this project) is ~1000 lines. Production JITs are complex because of optimizations, not the core concept.

Misconception 4: “You need an IR (Intermediate Representation)”

  • Reality: QEMU’s TCG uses an IR, but simple JITs can translate directly. We’ll translate bytecode directly to x86-64.

Misconception 5: “JIT code must be position-independent”

  • Reality: We know exactly where our code will be. Absolute addresses are fine (and simpler).

3. Project Specification

3.1 What You Will Build

A complete JIT compiler system that:

  1. Interprets a stack-based bytecode VM with 14 instructions
  2. Detects hot loops using backward branch counting
  3. JIT compiles hot paths to native x86-64 machine code
  4. Manages a code cache with memory limits
  5. Seamlessly switches between interpreted and JIT-compiled execution
  6. Achieves 10-50x speedup on computational loops

3.2 Bytecode Instruction Set

Your VM will support these 14 instructions:

Opcode Instruction Stack Effect Description
0x01 PUSH n [] -> [n] Push immediate value onto stack
0x02 POP [a] -> [] Discard top of stack
0x03 DUP [a] -> [a, a] Duplicate top of stack
0x10 ADD [a, b] -> [a+b] Add top two values
0x11 SUB [a, b] -> [a-b] Subtract (second - top)
0x12 MUL [a, b] -> [a*b] Multiply top two values
0x13 DIV [a, b] -> [a/b] Integer divide
0x20 LOAD addr [] -> [mem[addr]] Load from memory
0x21 STORE addr [a] -> [] Store to memory
0x30 JMP offset [] -> [] Unconditional jump
0x31 JZ offset [a] -> [] Jump if zero
0x32 JNZ offset [a] -> [] Jump if not zero
0x40 CALL addr [] -> [] Call subroutine
0x41 RET [] -> [] Return from subroutine
0xFF HALT [] -> [] Stop execution

3.3 Functional Requirements

ID Requirement Priority
FR1 Interpreter correctly executes all 14 bytecode instructions Must Have
FR2 Hot path detection triggers after 1000 backward branches Must Have
FR3 JIT compiles arithmetic operations (ADD, SUB, MUL, DIV) Must Have
FR4 JIT compiles control flow (JMP, JZ, JNZ) Must Have
FR5 JIT compiles memory operations (LOAD, STORE, PUSH, POP) Must Have
FR6 Code cache limits memory usage to configurable size Should Have
FR7 Deoptimization falls back to interpreter gracefully Should Have
FR8 Performance logging shows compilation events Should Have
FR9 CALL/RET handled (interpret or compile) Could Have
FR10 Multiple hot regions can be compiled independently Could Have

3.4 Non-Functional Requirements

Requirement Target Measurement
Interpreter speed > 10M bytecode ops/sec Benchmark tight loop
JIT speedup > 10x over interpreter Compare Mandelbrot times
Compilation time < 1ms for typical loop Measure with clock_gettime
Code cache size Configurable, default 64KB Track allocations
Memory overhead < 100 bytes per cached region Profile memory usage

3.5 Example Usage / Output

$ ./jitvm mandelbrot.bc
[JIT] Detected hot loop at bytecode offset 0x42
[JIT] Compiling 23 bytecode ops to 156 bytes of x86-64
[JIT] Code cached at 0x7f4a12340000

++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ +++++++++++++++++++
++++++++++++++++++++  +++++++++++++++++
+++++++++++++++++++    ++++++++++++++++
++++++++++++++++++      +++++++++++++++
++++++++++++++  ++       +++++++++++++
+++++++++++                +++++++++++
+++++++++++                +++++++++++
++++++++++++++  ++       +++++++++++++
++++++++++++++++++      +++++++++++++++
+++++++++++++++++++    ++++++++++++++++
++++++++++++++++++++  +++++++++++++++++
++++++++++++++++++++ +++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++

Execution time: 0.089s (JIT enabled)
Execution time: 4.231s (interpreter only)
Speedup: 47.5x

3.6 Real World Outcome

When you complete this project:

  1. You will have built a working JIT compiler - actual native code generation
  2. You will understand QEMU’s TCG at a fundamental level
  3. You will know how V8, JVM HotSpot, and LuaJIT work internally
  4. You will have resume-worthy evidence of advanced systems programming skills
  5. You will be able to explain JIT compilation in technical interviews
  6. You will have built the foundation for hardware-assisted virtualization projects

4. Solution Architecture

4.1 High-Level Design

┌──────────────────────────────────────────────────────────────────────────┐
│                          JIT VM ARCHITECTURE                              │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                         BYTECODE LOADER                              │ │
│  │   Load .bc file → Parse header → Validate → Store in code[]         │ │
│  └───────────────────────────────────┬─────────────────────────────────┘ │
│                                      │                                    │
│                                      ▼                                    │
│  ┌──────────────────────────────────────────────────────────────────────┐│
│  │                        EXECUTION ENGINE                               ││
│  │                                                                       ││
│  │  ┌─────────────────────────────────────────────────────────────────┐ ││
│  │  │                    DISPATCH LOOP                                 │ ││
│  │  │                                                                  │ ││
│  │  │   while (running) {                                              │ ││
│  │  │       if (jit_code[pc] != NULL) {                               │ ││
│  │  │           // Execute JIT-compiled native code                    │ ││
│  │  │           jit_code[pc](&vm_state);                              │ ││
│  │  │       } else {                                                   │ ││
│  │  │           // Interpret bytecode                                  │ ││
│  │  │           interpret_one(&vm_state);                             │ ││
│  │  │           check_hot_path(pc);                                   │ ││
│  │  │       }                                                          │ ││
│  │  │   }                                                              │ ││
│  │  │                                                                  │ ││
│  │  └─────────────────────────────────────────────────────────────────┘ ││
│  │                     │                           │                     ││
│  │        ┌────────────┘                           └────────────┐        ││
│  │        ▼                                                     ▼        ││
│  │  ┌──────────────────────────┐           ┌──────────────────────────┐ ││
│  │  │      INTERPRETER         │           │      JIT COMPILER         │ ││
│  │  │                          │           │                           │ ││
│  │  │  switch (opcode) {       │           │  1. Trace recording       │ ││
│  │  │    case PUSH: ...        │           │  2. Bytecode → x86-64     │ ││
│  │  │    case ADD: ...         │           │  3. Emit to code cache    │ ││
│  │  │    case JNZ: ...         │           │  4. Update dispatch       │ ││
│  │  │  }                       │           │                           │ ││
│  │  │                          │           │                           │ ││
│  │  └──────────────────────────┘           └──────────────────────────┘ ││
│  │                                                     │                 ││
│  │                                                     ▼                 ││
│  │                                         ┌──────────────────────────┐ ││
│  │                                         │      CODE CACHE          │ ││
│  │                                         │                          │ ││
│  │                                         │  mmap() RWX memory       │ ││
│  │                                         │  [region1][region2]...   │ ││
│  │                                         │  Max 64KB, LRU eviction  │ ││
│  │                                         │                          │ ││
│  │                                         └──────────────────────────┘ ││
│  │                                                                       ││
│  └───────────────────────────────────────────────────────────────────────┘│
│                                                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐│
│  │                          VM STATE                                     ││
│  │                                                                       ││
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                ││
│  │  │   Stack      │  │   Memory     │  │  Registers   │                ││
│  │  │   [0..255]   │  │   [0..4095]  │  │  PC, SP, FP  │                ││
│  │  └──────────────┘  └──────────────┘  └──────────────┘                ││
│  │                                                                       ││
│  └───────────────────────────────────────────────────────────────────────┘│
│                                                                           │
└───────────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component Responsibility Performance Requirement
Bytecode Loader Parse and validate bytecode file O(n) for n bytes
Interpreter Execute bytecode instructions > 10M ops/sec
Hot Path Detector Track backward branch counts O(1) per branch
JIT Compiler Translate bytecode to x86-64 < 1ms per region
Code Emitter Write x86-64 machine code bytes O(n) for n instructions
Code Cache Store and manage compiled code O(1) lookup
Dispatch Table Map bytecode address to native O(1) lookup

4.3 Data Structures

// VM State - passed to JIT-compiled code
typedef struct {
    int64_t stack[256];      // Operand stack
    int stack_top;           // Stack pointer
    int64_t memory[4096];    // VM memory (variables)
    uint8_t* code;           // Bytecode array
    int pc;                  // Program counter
    int running;             // Execution flag
} VMState;

// Hot path tracking
typedef struct {
    int execution_count;     // Number of times executed
    int bytecode_start;      // Starting PC of region
    int bytecode_end;        // Ending PC of region
    void* native_code;       // Pointer to compiled code
    size_t code_size;        // Size of compiled code
} CompiledRegion;

// Code cache management
typedef struct {
    void* memory;            // mmap'd executable memory
    size_t capacity;         // Total size
    size_t used;             // Bytes used
    CompiledRegion regions[256];
    int num_regions;
} CodeCache;

// Dispatch table
typedef void (*JitFunc)(VMState*);
JitFunc dispatch[65536];     // Index by bytecode PC

4.4 x86-64 Code Generation

The JIT compiler translates each bytecode instruction to x86-64. Here’s the mapping:

BYTECODE TO x86-64 TRANSLATION

PUSH immediate:
  Bytecode: [0x01] [value:8]
  x86-64:   mov rax, <value>        ; 48 B8 <8 bytes>
            push rax                 ; 50

POP:
  Bytecode: [0x02]
  x86-64:   add rsp, 8              ; 48 83 C4 08

ADD:
  Bytecode: [0x10]
  x86-64:   pop rcx                 ; 59
            pop rax                 ; 58
            add rax, rcx            ; 48 01 C8
            push rax                ; 50

SUB:
  Bytecode: [0x11]
  x86-64:   pop rcx                 ; 59
            pop rax                 ; 58
            sub rax, rcx            ; 48 29 C8
            push rax                ; 50

MUL:
  Bytecode: [0x12]
  x86-64:   pop rcx                 ; 59
            pop rax                 ; 58
            imul rax, rcx           ; 48 0F AF C1
            push rax                ; 50

LOAD addr:
  Bytecode: [0x20] [addr:2]
  x86-64:   mov rax, [rdi + offset] ; 48 8B 87 <offset>
            push rax                ; 50
  (where rdi points to vm->memory, offset = addr * 8)

STORE addr:
  Bytecode: [0x21] [addr:2]
  x86-64:   pop rax                 ; 58
            mov [rdi + offset], rax ; 48 89 87 <offset>

JNZ offset (backward branch - loop):
  Bytecode: [0x32] [offset:2]
  x86-64:   pop rax                 ; 58
            test rax, rax           ; 48 85 C0
            jnz <relative_addr>     ; 0F 85 <4 bytes>

4.5 Algorithm Overview

ALGORITHM: JIT Compilation Pipeline

function interpret_with_jit(vm):
    while vm.running:
        pc = vm.pc

        // Check if JIT code exists for this address
        if dispatch[pc] != NULL:
            dispatch[pc](&vm)
            continue

        // Interpret one instruction
        opcode = vm.code[pc]
        switch opcode:
            case PUSH: ...
            case ADD: ...
            case JNZ:
                // HOT PATH DETECTION
                target = vm.pc + offset
                if target < pc:  // Backward branch (loop)
                    branch_count[pc]++
                    if branch_count[pc] > JIT_THRESHOLD:
                        compile_region(target, pc)

function compile_region(start_pc, end_pc):
    // 1. Allocate code buffer
    buffer = alloc_executable(4096)
    emit_ptr = buffer

    // 2. Emit prologue
    emit_prologue(&emit_ptr)

    // 3. Translate each bytecode instruction
    for pc = start_pc to end_pc:
        opcode = bytecode[pc]
        switch opcode:
            case PUSH:
                emit_push_immediate(&emit_ptr, get_immediate(pc))
            case ADD:
                emit_add(&emit_ptr)
            case JNZ:
                emit_conditional_jump(&emit_ptr, target)
            ...

    // 4. Emit epilogue
    emit_epilogue(&emit_ptr)

    // 5. Install in dispatch table
    dispatch[start_pc] = buffer

    // 6. Log compilation
    log("[JIT] Compiled %d bytes at %p", emit_ptr - buffer, buffer)

5. Implementation Guide

5.1 Development Environment Setup

# Install required tools
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install gcc gdb nasm hexdump

# macOS
brew install gcc nasm

# Verify installation
gcc --version    # Need GCC 7+ for C11
nasm --version   # For understanding x86 encoding
gdb --version    # For debugging

# Create project structure
mkdir -p jitvm/{src,include,tests,bytecode}
cd jitvm

5.2 Project Structure

jitvm/
├── Makefile
├── src/
│   ├── main.c              # Entry point, argument parsing
│   ├── vm.c                # VM state management
│   ├── interpreter.c       # Bytecode interpreter
│   ├── jit.c               # JIT compiler core
│   ├── codegen.c           # x86-64 code generation
│   ├── codecache.c         # Code cache management
│   └── loader.c            # Bytecode loader
├── include/
│   ├── vm.h                # VM structures
│   ├── opcodes.h           # Bytecode definitions
│   ├── jit.h               # JIT interfaces
│   └── codegen.h           # Code generation helpers
├── tests/
│   ├── test_interpreter.c  # Unit tests
│   └── test_jit.c          # JIT tests
├── bytecode/
│   ├── simple.bc           # Simple test program
│   ├── fibonacci.bc        # Fibonacci benchmark
│   └── mandelbrot.bc       # Mandelbrot set renderer
└── tools/
    └── assembler.py        # Simple bytecode assembler

5.3 The Core Question You’re Answering

“How do you generate executable machine code at runtime, and when is it worth the compilation overhead?”

This question captures the essence of JIT compilation:

  1. Runtime code generation: Using mmap with PROT_EXEC to create executable memory, then writing valid x86-64 instruction bytes
  2. When it’s worth it: Only for hot paths - code that executes frequently enough that compilation cost is amortized

This is the same question answered by every JIT system from the JVM to QEMU’s TCG.

5.4 Concepts You Must Understand First

Before starting implementation, verify your understanding:

Concept Self-Assessment Question Where to Learn
Stack machines How does PUSH 5; PUSH 3; ADD leave the stack? “Language Implementation Patterns” Ch. 5
x86-64 registers What’s the difference between RAX, EAX, AX, AL? Intel SDM Vol. 1, Ch. 3
x86-64 encoding What bytes encode mov rax, 0x1234567890ABCDEF? Intel SDM Vol. 2
Memory mapping What does mmap with PROT_EXEC do? “TLPI” Ch. 49
Calling conventions How are arguments passed in x86-64 Linux/macOS? System V ABI
Branch prediction Why are backward branches usually predicted taken? CS:APP Ch. 5

5.5 Questions to Guide Your Design

Interpreter Design:

  1. How will you represent the bytecode instruction set?
  2. What data structure will hold the VM stack?
  3. How will you handle control flow (jumps, branches)?

Hot Path Detection:

  1. What threshold triggers compilation?
  2. How will you identify loop boundaries?
  3. What data structure tracks execution counts?

Code Generation:

  1. How will you map VM stack operations to x86-64 registers?
  2. How will you handle memory operands (LOAD/STORE)?
  3. How will you encode conditional branches?

Code Cache:

  1. How will you allocate executable memory?
  2. What’s your eviction policy when cache is full?
  3. How will you invalidate stale code (if needed)?

Integration:

  1. How does the interpreter transfer control to JIT code?
  2. How does JIT code return control to the interpreter?
  3. How do you handle operations JIT doesn’t support?

5.6 Thinking Exercise

Before writing any code, trace through this bytecode execution:

Program: Sum 1 to 100
Address  Bytecode        Stack (after)    Memory
0x00     PUSH 0          [0]
0x02     STORE 0         []               M[0]=0 (sum)
0x05     PUSH 1          [1]
0x07     STORE 1         []               M[1]=1 (counter)
0x0A     LOAD 0          [0]              <- LOOP START
0x0D     LOAD 1          [0, 1]
0x10     ADD             [1]
0x11     STORE 0         []               M[0]=1
0x14     LOAD 1          [1]
0x17     PUSH 1          [1, 1]
0x19     ADD             [2]
0x1A     DUP             [2, 2]
0x1B     STORE 1         [2]              M[1]=2
0x1E     PUSH 100        [2, 100]
0x21     SUB             [-98]
0x22     JNZ -0x18       []               Jump to 0x0A if != 0
0x25     LOAD 0          [5050]
0x28     HALT

Questions to answer:

  1. How many times does the loop execute?
  2. Which bytecode addresses constitute the “hot region”?
  3. What x86-64 instructions would you emit for LOAD 0; LOAD 1; ADD; STORE 0?
  4. How would you encode JNZ -0x18 in x86-64?

5.7 Hints in Layers

Use these progressive hints only when stuck.

Hint 1: Starting with the Interpreter

Start with a working interpreter before adding JIT:

void interpret(VMState* vm) {
    while (vm->running) {
        uint8_t opcode = vm->code[vm->pc++];

        switch (opcode) {
            case OP_PUSH: {
                int64_t value = read_i64(vm);  // Read 8-byte immediate
                vm->stack[vm->stack_top++] = value;
                break;
            }
            case OP_ADD: {
                int64_t b = vm->stack[--vm->stack_top];
                int64_t a = vm->stack[--vm->stack_top];
                vm->stack[vm->stack_top++] = a + b;
                break;
            }
            case OP_JNZ: {
                int16_t offset = read_i16(vm);  // Read 2-byte offset
                int64_t cond = vm->stack[--vm->stack_top];
                if (cond != 0) {
                    vm->pc += offset - 3;  // -3 for instruction size
                }
                break;
            }
            case OP_HALT:
                vm->running = 0;
                break;
            // ... other cases
        }
    }
}
Hint 2: Hot Path Detection

Track backward branches to find loops:

#define JIT_THRESHOLD 1000

int branch_counts[65536] = {0};  // Per-PC branch counter

void check_hot_path(VMState* vm, int target_pc) {
    int current_pc = vm->pc;

    // Only track backward branches (loops)
    if (target_pc < current_pc) {
        int loop_header = target_pc;
        branch_counts[loop_header]++;

        if (branch_counts[loop_header] >= JIT_THRESHOLD) {
            printf("[JIT] Hot loop detected at 0x%04X\n", loop_header);
            compile_region(vm, loop_header, current_pc);
            branch_counts[loop_header] = 0;  // Reset counter
        }
    }
}
Hint 3: Allocating Executable Memory

Use mmap to create RWX memory:

#include <sys/mman.h>

void* alloc_executable(size_t size) {
    void* mem = mmap(
        NULL,                           // Let kernel choose address
        size,                           // Size in bytes
        PROT_READ | PROT_WRITE | PROT_EXEC,  // RWX permissions
        MAP_PRIVATE | MAP_ANONYMOUS,    // Private, not file-backed
        -1,                             // No file descriptor
        0                               // Offset
    );

    if (mem == MAP_FAILED) {
        perror("mmap");
        return NULL;
    }

    return mem;
}

void free_executable(void* mem, size_t size) {
    munmap(mem, size);
}

Security Note: On some systems (macOS with SIP, selinux), you may need to:

  1. Write code first (PROT_READ PROT_WRITE)
  2. Then mprotect() to (PROT_READ PROT_EXEC)
// Safer two-step approach
void* mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ... write code to mem ...
mprotect(mem, size, PROT_READ | PROT_EXEC);
Hint 4: Code Generation Basics

Emit x86-64 bytes directly to a buffer:

typedef struct {
    uint8_t* buffer;
    uint8_t* ptr;
    size_t size;
} CodeBuffer;

// Emit single byte
void emit8(CodeBuffer* cb, uint8_t b) {
    *cb->ptr++ = b;
}

// Emit 32-bit value (little-endian)
void emit32(CodeBuffer* cb, uint32_t v) {
    *cb->ptr++ = v & 0xFF;
    *cb->ptr++ = (v >> 8) & 0xFF;
    *cb->ptr++ = (v >> 16) & 0xFF;
    *cb->ptr++ = (v >> 24) & 0xFF;
}

// Emit 64-bit value (little-endian)
void emit64(CodeBuffer* cb, uint64_t v) {
    emit32(cb, v & 0xFFFFFFFF);
    emit32(cb, v >> 32);
}

// Example: mov rax, immediate64
void emit_mov_rax_imm64(CodeBuffer* cb, int64_t value) {
    emit8(cb, 0x48);  // REX.W prefix
    emit8(cb, 0xB8);  // mov rax, imm64
    emit64(cb, value);
}

// Example: push rax
void emit_push_rax(CodeBuffer* cb) {
    emit8(cb, 0x50);
}

// Example: pop rax
void emit_pop_rax(CodeBuffer* cb) {
    emit8(cb, 0x58);
}

// Example: pop rcx
void emit_pop_rcx(CodeBuffer* cb) {
    emit8(cb, 0x59);
}

// Example: add rax, rcx
void emit_add_rax_rcx(CodeBuffer* cb) {
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x01);  // add r/m64, r64
    emit8(cb, 0xC8);  // ModRM: rcx -> rax
}
Hint 5: Compiling a Simple Operation

Translate bytecode ADD to x86-64:

// Bytecode: ADD
// Stack:    [a, b] -> [a + b]
// x86-64:   pop rcx; pop rax; add rax, rcx; push rax

void compile_add(CodeBuffer* cb) {
    // pop rcx (second operand)
    emit8(cb, 0x59);

    // pop rax (first operand)
    emit8(cb, 0x58);

    // add rax, rcx
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x01);  // ADD r/m64, r64
    emit8(cb, 0xC8);  // ModRM: rcx -> rax

    // push rax (result)
    emit8(cb, 0x50);
}

For PUSH immediate:

// Bytecode: PUSH value
// Stack:    [] -> [value]
// x86-64:   mov rax, value; push rax

void compile_push(CodeBuffer* cb, int64_t value) {
    // mov rax, imm64
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0xB8);  // mov rax, imm64
    emit64(cb, value);

    // push rax
    emit8(cb, 0x50);
}
Hint 6: Handling LOAD/STORE

The VM’s memory array is passed via register (rdi in System V ABI):

// Before calling JIT code:
// rdi = &vm->memory[0]  (base of VM memory)

// Bytecode: LOAD addr
// Load from VM memory, push to hardware stack

void compile_load(CodeBuffer* cb, int addr) {
    int offset = addr * 8;  // int64_t array

    // mov rax, [rdi + offset]
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x8B);  // mov r64, r/m64
    emit8(cb, 0x87);  // ModRM: [rdi + disp32]
    emit32(cb, offset);

    // push rax
    emit8(cb, 0x50);
}

// Bytecode: STORE addr
// Pop from hardware stack, store to VM memory

void compile_store(CodeBuffer* cb, int addr) {
    int offset = addr * 8;

    // pop rax
    emit8(cb, 0x58);

    // mov [rdi + offset], rax
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x89);  // mov r/m64, r64
    emit8(cb, 0x87);  // ModRM: [rdi + disp32]
    emit32(cb, offset);
}
Hint 7: Compiling Conditional Branches

JNZ (Jump if Not Zero) requires careful handling:

// Bytecode: JNZ offset
// Pop value, jump if != 0
// For loops: offset is negative (backward jump)

void compile_jnz(CodeBuffer* cb, int target_offset, uint8_t* loop_start) {
    // pop rax (condition)
    emit8(cb, 0x58);

    // test rax, rax
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x85);  // test r/m64, r64
    emit8(cb, 0xC0);  // ModRM: rax, rax

    // jnz rel32
    emit8(cb, 0x0F);
    emit8(cb, 0x85);

    // Calculate relative offset to loop start
    // JNZ is 6 bytes total, so calculate from end of instruction
    uint8_t* jnz_end = cb->ptr + 4;  // After we emit the 4-byte offset
    int32_t rel = (int32_t)(loop_start - jnz_end);
    emit32(cb, rel);
}

Key insight: x86-64 relative jumps are calculated from the END of the instruction, not the beginning.

Hint 8: Complete Compilation Example

Compiling a simple loop:

void compile_region(VMState* vm, int start_pc, int end_pc) {
    CodeBuffer cb;
    cb.buffer = alloc_executable(4096);
    cb.ptr = cb.buffer;

    // Remember start for loop jumps
    uint8_t* loop_start = cb.ptr;

    // Emit prologue: save callee-saved registers, get vm_memory pointer
    // Assume: rdi = &vm->memory[0] (passed by caller)

    // Translate bytecode
    int pc = start_pc;
    while (pc <= end_pc) {
        uint8_t opcode = vm->code[pc++];

        switch (opcode) {
            case OP_PUSH: {
                int64_t value = read_i64_at(vm, pc);
                pc += 8;
                compile_push(&cb, value);
                break;
            }
            case OP_ADD:
                compile_add(&cb);
                break;
            case OP_LOAD: {
                int addr = read_i16_at(vm, pc);
                pc += 2;
                compile_load(&cb, addr);
                break;
            }
            case OP_STORE: {
                int addr = read_i16_at(vm, pc);
                pc += 2;
                compile_store(&cb, addr);
                break;
            }
            case OP_JNZ: {
                compile_jnz(&cb, 0, loop_start);
                pc += 2;  // Skip offset bytes
                break;
            }
            // ... other cases
        }
    }

    // Emit epilogue: ret
    emit8(&cb, 0xC3);  // ret

    // Calculate size
    size_t code_size = cb.ptr - cb.buffer;

    // Install in dispatch table
    typedef void (*JitFunc)(int64_t* memory);
    dispatch[start_pc] = (JitFunc)cb.buffer;

    printf("[JIT] Compiled %zu bytes at %p\n", code_size, cb.buffer);
}
Hint 9: Calling JIT Code from Interpreter

The dispatch loop calls JIT code:

typedef void (*JitFunc)(int64_t* memory);
JitFunc dispatch[65536] = {NULL};

void run_with_jit(VMState* vm) {
    while (vm->running) {
        int pc = vm->pc;

        // Check for JIT code
        if (dispatch[pc] != NULL) {
            // Call JIT code
            // Pass pointer to VM memory
            dispatch[pc](vm->memory);

            // JIT code handles the loop internally
            // and returns when loop exits
            // We need to sync VM state...
            // (This is simplified - real impl needs more care)
        } else {
            // Interpret as usual
            interpret_one(vm);
        }
    }
}

Note: A complete implementation needs to handle:

  • Syncing VM stack with hardware stack
  • Updating VM PC after JIT returns
  • Handling exits from JIT (loop exits, exceptions)
Hint 10: Full x86-64 Instruction Reference

Reference for all instructions you need:

INSTRUCTION          ENCODING                    NOTES
─────────────────────────────────────────────────────────────────
mov rax, imm64      48 B8 [8 bytes]             Load 64-bit immediate
mov rax, [rdi+off]  48 8B 87 [4 bytes]          Load from memory
mov [rdi+off], rax  48 89 87 [4 bytes]          Store to memory

push rax            50                           Push to stack
pop rax             58                           Pop from stack
pop rcx             59                           Pop to rcx
pop rdx             5A                           Pop to rdx

add rax, rcx        48 01 C8                    rax = rax + rcx
sub rax, rcx        48 29 C8                    rax = rax - rcx
imul rax, rcx       48 0F AF C1                 rax = rax * rcx

cqo                 48 99                        Sign-extend rax to rdx:rax
idiv rcx            48 F7 F9                    rdx:rax / rcx, result in rax

test rax, rax       48 85 C0                    Set flags based on rax
cmp rax, rcx        48 39 C8                    Set flags: rax - rcx

jmp rel32           E9 [4 bytes]                Unconditional jump
jnz rel32           0F 85 [4 bytes]             Jump if not zero (ZF=0)
jz rel32            0F 84 [4 bytes]             Jump if zero (ZF=1)

ret                 C3                           Return from function

Relative addressing: All jump offsets are calculated from the END of the instruction.

5.8 The Interview Questions They’ll Ask

After completing this project, you should be able to answer:

  1. “What is JIT compilation and when is it better than interpretation?”
    • Expected: JIT compiles code at runtime, better for hot paths that execute frequently
    • Bonus: Discuss compilation overhead vs. execution speed tradeoff
  2. “How do you generate executable code at runtime?”
    • Expected: mmap with PROT_EXEC, write machine code bytes, call via function pointer
    • Bonus: Discuss W^X security, mprotect separation
  3. “How do you detect hot paths?”
    • Expected: Backward branch counting (loops), method invocation counting
    • Bonus: Discuss tracing JITs vs. method JITs
  4. “How does QEMU’s TCG relate to JIT compilation?”
    • Expected: TCG translates guest ISA to host ISA at runtime using JIT techniques
    • Bonus: Discuss IR-based translation, code caching
  5. “What’s the difference between a stack machine and a register machine?”
    • Expected: Stack machines use implicit stack, register machines use explicit registers
    • Bonus: Discuss why bytecode is often stack-based (simpler) but native code is register-based (faster)
  6. “How do you handle branches in JIT-compiled code?”
    • Expected: Calculate relative offsets from instruction end, emit conditional jumps
    • Bonus: Discuss branch prediction, trace compilation
  7. “What are the challenges of JIT code cache management?”
    • Expected: Memory limits, invalidation on code changes, fragmentation
    • Bonus: Discuss GC integration, OSR (on-stack replacement)
  8. “How would you debug JIT-generated code?”
    • Expected: Dump generated bytes, use gdb to disassemble, step through execution
    • Bonus: Discuss JIT map files, perf integration

5.9 Books That Will Help

Topic Book Specific Chapter Why It Helps
Stack-based VMs “Language Implementation Patterns” by Terence Parr Chapter 10-11 Core VM design patterns
JIT techniques “Engineering a Compiler” by Cooper & Torczon Chapter 7, 13 Code generation and optimization
x86-64 encoding “Modern X86 Assembly” by Daniel Kusswurm Chapters 1-5 Instruction encoding reference
Memory mapping “The Linux Programming Interface” by Kerrisk Chapter 49 mmap, mprotect details
QEMU internals QEMU documentation TCG documentation How production binary translation works
LuaJIT LuaJIT source code lj_jit.c, lj_asm.c Real trace JIT implementation

5.10 Implementation Phases

Phase 1: Bytecode Interpreter (Days 1-3)

Goals:

  • Implement all 14 bytecode instructions
  • Create bytecode loader
  • Test with simple programs

Tasks:

  1. Define opcodes and bytecode format
  2. Implement VMState structure
  3. Write interpreter dispatch loop
  4. Implement each instruction
  5. Create test bytecode by hand
  6. Verify with simple programs (add two numbers, loop to 10)

Checkpoint: Can run sum_to_100.bc and get correct result.

Phase 2: Hot Path Detection (Days 4-5)

Goals:

  • Track backward branches
  • Identify when threshold is crossed
  • Log hot path detection events

Tasks:

  1. Add branch count array
  2. Modify JNZ handler to track backward branches
  3. Add threshold check
  4. Log when hot path is detected
  5. Calculate loop region (start_pc, end_pc)

Checkpoint: “Hot loop detected at 0xXX” message appears after 1000 iterations.

Phase 3: Basic Code Generation (Days 6-10)

Goals:

  • Allocate executable memory
  • Generate code for PUSH, ADD, SUB, MUL
  • Execute simplest JIT code

Tasks:

  1. Implement mmap wrapper for executable memory
  2. Create CodeBuffer structure
  3. Write emit_* functions for basic instructions
  4. Compile simplest case: PUSH; PUSH; ADD
  5. Call JIT code from interpreter
  6. Verify result is correct

Checkpoint: JIT-compiled 5 + 3 returns 8.

Phase 4: Full Loop Compilation (Days 11-15)

Goals:

  • Compile LOAD, STORE
  • Compile JNZ with backward branch
  • Handle VM memory access from JIT code

Tasks:

  1. Pass VM memory pointer to JIT code (via rdi)
  2. Implement compile_load, compile_store
  3. Implement compile_jnz with loop back
  4. Track code generation position for jump targets
  5. Test complete loop compilation

Checkpoint: Sum 1-to-100 loop runs entirely in JIT code.

Phase 5: Integration & Benchmarking (Days 16-21)

Goals:

  • Seamless interpreter/JIT switching
  • Code cache management
  • Performance measurement

Tasks:

  1. Implement dispatch table
  2. Add interpreter fallback for uncompiled code
  3. Implement code cache with size limit
  4. Create Mandelbrot benchmark
  5. Measure and compare interpreter vs. JIT
  6. Tune JIT threshold

Checkpoint: 10x+ speedup on Mandelbrot, working demo.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Stack representation VM stack vs. x86 stack x86 stack Simpler code generation, hardware optimized
Memory passing Global pointer vs. parameter Parameter (rdi) Clean interface, follows ABI
Jump handling Relative vs. absolute Relative Standard x86-64 encoding
Code cache Fixed vs. growing Fixed 64KB Simpler management, sufficient for demo
Deoptimization Full vs. none Simple fallback Keep scope manageable

6. Testing Strategy

6.1 Unit Tests

Test Category What to Test
Interpreter Each opcode in isolation
Code emission Each emit_* function produces correct bytes
Code execution Simple JIT sequences return correct values
Hot path detection Threshold triggers at correct count
Memory access LOAD/STORE work correctly from JIT code

6.2 Integration Tests

Test 1: Simple Addition
  Bytecode: PUSH 5; PUSH 3; ADD; HALT
  Expected: Stack contains [8]

Test 2: Loop Counter
  Bytecode: PUSH 0; STORE 0; PUSH 10; STORE 1;
           LOAD 0; PUSH 1; ADD; DUP; STORE 0;
           LOAD 1; PUSH 1; SUB; DUP; STORE 1; JNZ -20;
           LOAD 0; HALT
  Expected: Memory[0] = 10 (counted 10 times)

Test 3: Sum 1 to N
  Run sum_to_100 with JIT disabled: result = 5050
  Run sum_to_100 with JIT enabled: result = 5050
  Verify: Same result, different speed

Test 4: JIT Triggering
  Loop that iterates exactly 1001 times
  Verify: JIT compilation message appears at iteration 1000

Test 5: Multiple Hot Regions
  Two separate loops in one program
  Verify: Both get compiled independently

6.3 Benchmark Tests

Benchmark 1: Sum 1 to 1,000,000
  Interpreter: ~2 seconds
  JIT: ~0.05 seconds
  Expected speedup: 40x+

Benchmark 2: Fibonacci(40)
  Interpreter: ~5 seconds
  JIT: ~0.2 seconds
  Expected speedup: 25x+

Benchmark 3: Mandelbrot 80x24
  Interpreter: ~4 seconds
  JIT: ~0.1 seconds
  Expected speedup: 40x+

7. Common Pitfalls & Debugging

Problem Symptom Cause Solution Verification  
Segfault on JIT call SIGSEGV immediately Wrong calling convention Check rdi points to valid memory gdb: x/10gx $rdi  
Segfault in JIT code SIGSEGV during execution Bad machine code encoding Disassemble generated code objdump -D -b binary -m i386:x86-64 dump.bin  
Wrong computation result Incorrect output Stack order or encoding error Add debug prints, trace execution Compare interpreter vs JIT step by step  
mmap fails MAP_FAILED return SELinux, SIP restrictions Use PROT_READ PROT_WRITE then mprotect Check dmesg for denials
JIT not triggering Always interprets Threshold not reached or wrong detection Add logging to branch handler Print branch count every 100 iterations  
Infinite loop in JIT Hangs Jump offset calculated wrong Print jump target address gdb breakpoint before jump  
Memory corruption Random crashes Buffer overflow in emit Add bounds checking to CodeBuffer Valgrind: valgrind ./jitvm  
Performance not improved Same speed JIT code not being called Add logging to dispatch Print “entering JIT” message  

7.1 Debugging Techniques

Disassemble generated code:

# Dump JIT buffer to file
# In code: write(fd, cb.buffer, cb.ptr - cb.buffer);

# Then disassemble:
objdump -D -b binary -m i386:x86-64 jit_dump.bin

GDB debugging:

gdb ./jitvm
(gdb) break *0x7ffff7ff0000  # Break at JIT code address
(gdb) run test.bc
(gdb) disassemble $pc, $pc+50  # Show generated code
(gdb) info registers
(gdb) x/10gx $rdi  # Examine VM memory

Add verification:

void verify_emit(CodeBuffer* cb, const char* description) {
    printf("[EMIT] %s: ", description);
    for (uint8_t* p = cb->buffer; p < cb->ptr; p++) {
        printf("%02X ", *p);
    }
    printf("\n");
}

8. Extensions & Challenges

8.1 Beginner Extensions

  • More operations: Add AND, OR, XOR, NOT, MOD
  • Better diagnostics: Print disassembly of generated code
  • Statistics: Track compilation time, cache hit rate
  • Multiple loops: Support multiple hot regions in one program

8.2 Intermediate Extensions

  • Function calls: Compile CALL/RET instructions
  • Tiered compilation: Fast first-tier JIT, optimized second tier
  • Register allocation: Map VM stack slots to x86 registers
  • Code cache eviction: LRU policy when cache is full

8.3 Advanced Extensions

  • On-stack replacement (OSR): Enter JIT mid-execution
  • Deoptimization: Fall back to interpreter when invariants break
  • Inline caching: Optimize repeated operations
  • Vectorization: Use SIMD for parallel operations
  • Different architectures: Add ARM64 backend

9. Real-World Connections

9.1 How Production JITs Differ

Aspect This Project Production JIT (V8, HotSpot)
Optimization level None Multiple tiers, heavy optimization
IR usage Direct translation SSA-based IR with optimizations
Register allocation Stack-based Graph coloring, linear scan
Garbage collection None Full GC integration
Debugging support Minimal Source maps, profiler integration
OSR None Full on-stack replacement
Code size ~1000 lines 100,000+ lines

9.2 Real-World Systems Using This Technique

QEMU TCG:

  • Translates guest ISA to host ISA
  • Uses intermediate TCG ops
  • Supports dozens of guest/host combinations
  • Handles privileged instructions, memory protection

V8 TurboFan:

  • JavaScript JIT compiler
  • Sea of Nodes IR
  • Speculative optimization
  • Deoptimization on type changes

LuaJIT:

  • Trace-based JIT for Lua
  • Records executed traces
  • Generates highly optimized code
  • Often faster than C for some workloads

HotSpot C2:

  • JVM server compiler
  • Extensive optimizations
  • Profile-guided optimization
  • Escape analysis, inlining

9.3 Open Source References

Project What to Learn
QEMU tcg/ directory - real binary translator
LuaJIT src/lj_asm.c - trace JIT
Copy Tiny JIT in 1000 lines
LibJIT JIT library for multiple backends

10. Resources

10.1 Essential Reading

  • “Language Implementation Patterns” by Terence Parr - Chapters 10-11 on VMs
  • “Engineering a Compiler” by Cooper & Torczon - Chapter 7 on code generation
  • Intel 64 and IA-32 Architectures Software Developer’s Manual - Volume 2 for encoding
  • QEMU Documentation - TCG ops and translation concepts

10.2 Online Tutorials

10.3 Reference Materials


11. Self-Assessment Checklist

Understanding

  • I can explain what JIT compilation is and when it’s beneficial
  • I understand how mmap with PROT_EXEC creates executable memory
  • I can describe how a stack machine differs from a register machine
  • I know how to encode basic x86-64 instructions
  • I understand how QEMU’s TCG relates to this project
  • I can explain hot path detection strategies

Implementation

  • My interpreter correctly executes all bytecode instructions
  • Hot path detection triggers after the configured threshold
  • JIT compilation produces correct machine code
  • Generated code executes and produces correct results
  • The dispatch table correctly routes to JIT or interpreter
  • Benchmarks show significant speedup (10x+)

Growth

  • I profiled my code and understand the performance characteristics
  • I can debug JIT-generated code using gdb
  • I could extend this to support new instructions
  • I could explain my implementation in a technical interview

12. Submission / Completion Criteria

Minimum Viable Completion

  • Interpreter executes PUSH, POP, ADD, SUB, MUL, LOAD, STORE, JNZ, HALT correctly
  • Hot path detection triggers on backward branches
  • At least arithmetic operations are JIT compiled
  • JIT code executes and produces correct results
  • Demo shows measurable speedup

Full Completion

  • All 14 bytecode instructions interpreted correctly
  • JIT compiles arithmetic, memory, and control flow operations
  • Code cache manages memory with configurable size
  • Mandelbrot benchmark shows 10x+ speedup
  • Clean interpreter/JIT switching

Excellence (Going Above and Beyond)

  • CALL/RET compiled in JIT
  • Register allocation optimization
  • Multiple compilation tiers
  • Deoptimization support
  • Performance comparable to LuaJIT-like systems

13. Learning Milestones

Track your progress through these milestones:

  1. mmap+PROT_EXEC works - Can execute generated code
    • Test: Generate mov rax, 42; ret and call it
    • Verify: Function returns 42
  2. Simple arithmetic JITs correctly - Basic code generation works
    • Test: PUSH 5; PUSH 3; ADD returns 8 from JIT code
    • Verify: Same result as interpreter
  3. Loops JIT correctly - Control flow handled
    • Test: Loop counting to 100 runs entirely in JIT
    • Verify: Memory[0] contains 100 after execution
  4. 10x speedup on hot loops - Effective JIT
    • Test: Sum 1-to-1M benchmark
    • Verify: JIT version is 10x+ faster
  5. Interpreter/JIT switching seamless - Full integration
    • Test: Program with cold startup and hot loop
    • Verify: Correct result, smooth transition, good performance

This guide was expanded from HYPERVISOR_VIRTUALIZATION_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the main file which covers 15 projects from basic emulation to full hypervisor development.