Project 4: Simple JIT Compiler for a Bytecode VM

Build a Just-In-Time compiler that dynamically translates bytecode to native x86-64 machine code at runtime, achieving dramatic speedups for hot code paths - the same technique that powers QEMU’s TCG, the JVM, V8, and LuaJIT.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	3-4 weeks
Language	C
Alternative Languages	C++, Rust
Prerequisites	Projects 1-3 (emulation basics), x86-64 assembly, understanding of stack machines
Key Topics	JIT compilation, dynamic code generation, executable memory (mmap), hot path detection, code cache management
Coolness Level	Level 5: Pure Magic (Super Cool)
Portfolio Value	Resume Gold
Main Book	“Language Implementation Patterns” by Terence Parr

1. Learning Objectives

By completing this project, you will:

Understand JIT compilation fundamentals: Explain the difference between interpretation and JIT compilation, and when each approach is appropriate
Master runtime code generation: Write C code that generates executable x86-64 machine code at runtime using mmap with PROT_EXEC
Implement hot path detection: Build a profiling mechanism that identifies frequently-executed code worth compiling
Design a code cache: Manage compiled code fragments with proper invalidation and memory limits
Handle stack-to-register mapping: Transform stack-based bytecode operations into efficient register-based machine code
Implement interpreter/JIT switching: Seamlessly transition between interpreted and compiled execution
Build the foundation for understanding QEMU’s TCG: This is exactly how QEMU’s Tiny Code Generator works for cross-architecture emulation

2. Theoretical Foundation

2.1 What is JIT Compilation?

A Just-In-Time (JIT) compiler translates code at runtime rather than ahead-of-time. Unlike traditional compilers that produce executables before the program runs, a JIT compiles code while the program executes.

COMPILATION STRATEGIES COMPARISON

Ahead-of-Time (AOT) Compilation:
┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│ Source Code │───►│   Compiler   │───►│ Executable  │
│   (foo.c)   │    │   (gcc/clang)│    │  (foo.exe)  │
└─────────────┘    └──────────────┘    └─────────────┘
                          │
                   Compile Time
                          │
    ─────────────────────────────────────────────────────
                          │
                    Run Time
                          │
                          ▼
              ┌─────────────────────┐
              │   Native Execution  │
              │   (Already compiled)│
              └─────────────────────┘


Just-In-Time (JIT) Compilation:
┌─────────────┐    ┌──────────────┐
│ Source Code │───►│ Bytecode     │
│   (App.java)│    │ Compiler     │
└─────────────┘    └──────┬───────┘
                          │
                   Compile Time (AOT to bytecode)
                          │
                          ▼
              ┌─────────────────────┐
              │     Bytecode        │
              │   (App.class)       │
              └─────────────────────┘
    ─────────────────────────────────────────────────────
                          │
                    Run Time
                          │
                          ▼
┌─────────────────────────────────────────────────────┐
│                    JVM / VM Runtime                  │
│                                                      │
│  ┌───────────────────┐     ┌────────────────────┐   │
│  │    Interpreter    │     │   JIT Compiler     │   │
│  │  (Cold code path) │     │  (Hot code path)   │   │
│  └─────────┬─────────┘     └──────────┬─────────┘   │
│            │                          │             │
│            ▼                          ▼             │
│    ┌────────────────┐      ┌────────────────────┐  │
│    │ Execute slowly │      │ Compile to native  │  │
│    │ (but starts    │      │ code, then execute │  │
│    │  immediately)  │      │ at full speed      │  │
│    └────────────────┘      └────────────────────┘  │
└─────────────────────────────────────────────────────┘

2.2 Why JIT Compilation?

The key insight is that not all code is equally important. Most programs spend 90% of their time in 10% of the code (hot paths). JIT compilation optimizes this by:

Start Fast: Begin interpreting immediately - no compilation delay
Profile: Identify which code runs frequently (hot paths)
Compile Hot Paths: Convert frequently-executed bytecode to native code
Run Fast: Execute compiled code at near-native speed

EXECUTION TIME DISTRIBUTION

Program Execution:
────────────────────────────────────────────────────────────────────
Time →

┌──────────┬───────────────────────────────────────────────────────┐
│Startup   │                    Main Execution                      │
│ (10%)    │                        (90%)                           │
└──────────┴───────────────────────────────────────────────────────┘

Code Path Distribution:
┌─────────────────────────────────────────────────┐
│                                                  │
│  Hot Paths (10% of code)                        │
│  ┌─────────────────────────────────────────┐    │
│  │ ████████████████████████████████████    │    │ 90% of time
│  └─────────────────────────────────────────┘    │
│                                                  │
│  Cold Paths (90% of code)                       │
│  ┌─────────────────────────────────────────┐    │
│  │ ████                                    │    │ 10% of time
│  └─────────────────────────────────────────┘    │
│                                                  │
└─────────────────────────────────────────────────┘

JIT Strategy: Only compile the hot paths (10% of code, 90% of time)
             - Interpret cold paths (acceptable speed for rare code)
             - Maximum benefit with minimal compilation overhead

2.3 How QEMU’s TCG Uses JIT

QEMU (Quick EMUlator) uses its Tiny Code Generator (TCG) to translate guest architecture instructions to host architecture code at runtime. This is the same concept we’re implementing:

QEMU TCG Architecture

Guest Code (ARM)              TCG (Intermediate)           Host Code (x86-64)
┌─────────────────┐          ┌─────────────────┐          ┌─────────────────┐
│ ADD R0, R1, R2  │          │ load_i32 t0, r1 │          │ mov eax, [rbx+4]│
│ SUB R0, R0, #1  │    ───►  │ load_i32 t1, r2 │    ───►  │ mov ecx, [rbx+8]│
│ BNE loop        │          │ add_i32 t2,t0,t1│          │ add eax, ecx    │
└─────────────────┘          │ store_i32 r0,t2 │          │ sub eax, 1      │
                             │ ...             │          │ jne loop        │
                             └─────────────────┘          └─────────────────┘

OUR PROJECT: Simplified Version

Bytecode (Stack-based)        Direct Translation           Host Code (x86-64)
┌─────────────────┐                                       ┌─────────────────┐
│ PUSH 10         │                                       │ push 10         │
│ PUSH 20         │          No intermediate!             │ push 20         │
│ ADD             │          ─────────────────►           │ pop rcx         │
│ STORE 0         │          (Direct bytecode             │ pop rax         │
│ JNZ -4          │           to x86-64)                  │ add rax, rcx    │
└─────────────────┘                                       │ mov [vars], rax │
                                                          │ test rax, rax   │
                                                          │ jnz loop        │
                                                          └─────────────────┘

2.4 The Stack Machine Model

Our bytecode VM uses a stack-based architecture (like the JVM, Python VM, and WebAssembly). Understanding this is crucial:

STACK MACHINE vs REGISTER MACHINE

Stack Machine (Our Bytecode):        Register Machine (x86-64):
┌──────────────────────────────┐    ┌──────────────────────────────┐
│                              │    │                              │
│  a + b * c                   │    │  a + b * c                   │
│                              │    │                              │
│  Bytecode:                   │    │  x86-64:                     │
│    PUSH a     ; [a]          │    │    mov rax, [a]              │
│    PUSH b     ; [a, b]       │    │    mov rcx, [b]              │
│    PUSH c     ; [a, b, c]    │    │    mov rdx, [c]              │
│    MUL        ; [a, b*c]     │    │    imul rcx, rdx  ; rcx=b*c  │
│    ADD        ; [a+b*c]      │    │    add rax, rcx   ; result   │
│                              │    │                              │
│  Operands implicit (stack)   │    │  Operands explicit (regs)    │
│  Simple bytecode format      │    │  Complex instruction encoding│
│  Easy to compile from source │    │  Hard to compile from source │
│  Slow to interpret           │    │  Fast to execute             │
│                              │    │                              │
└──────────────────────────────┘    └──────────────────────────────┘

JIT CHALLENGE: Convert stack operations to register operations

2.5 Executable Memory: The mmap Magic

The key to JIT compilation is creating memory that the CPU can execute. This requires special permissions:

MEMORY PERMISSIONS

Normal heap allocation (malloc):
┌─────────────────────────────────────────────────────┐
│ void* p = malloc(4096);                             │
│ Permissions: READ + WRITE                           │
│ Cannot execute code here - CPU will fault!          │
└─────────────────────────────────────────────────────┘

Executable allocation (mmap):
┌─────────────────────────────────────────────────────┐
│ void* code = mmap(NULL, 4096,                       │
│                   PROT_READ | PROT_WRITE | PROT_EXEC│
│                   MAP_PRIVATE | MAP_ANONYMOUS,      │
│                   -1, 0);                           │
│                                                     │
│ Permissions: READ + WRITE + EXECUTE                 │
│ CAN execute code here - CPU will run it!            │
└─────────────────────────────────────────────────────┘

SECURITY NOTE:
- W^X (Write XOR Execute) is a security principle
- Many systems restrict RWX memory
- Production JITs: write code, then mprotect() to RX

2.6 Hot Path Detection

Not every piece of code deserves JIT compilation. We need to identify hot paths:

HOT PATH DETECTION STRATEGIES

Strategy 1: Invocation Counting (Method-based JIT)
┌─────────────────────────────────────────────────────┐
│ function_call_count[func_id]++;                     │
│ if (function_call_count[func_id] > THRESHOLD) {    │
│     compile_function(func_id);                      │
│ }                                                   │
│                                                     │
│ Used by: JVM (HotSpot), V8 (initial tier)          │
└─────────────────────────────────────────────────────┘

Strategy 2: Loop Detection (Trace-based JIT)
┌─────────────────────────────────────────────────────┐
│ // Backward branch = potential loop                 │
│ if (branch_target < current_pc) {                   │
│     backward_branch_count[current_pc]++;            │
│     if (backward_branch_count[current_pc] > THRESH) │
│         compile_trace_from(branch_target);          │
│ }                                                   │
│                                                     │
│ Used by: LuaJIT, TraceMonkey (old Firefox)         │
└─────────────────────────────────────────────────────┘

OUR PROJECT: Uses Strategy 2 (Loop Detection)
- Simple to implement
- Highly effective (loops are where programs spend time)
- Threshold: 1000 backward branches before compilation

2.7 The JIT Compilation Pipeline

JIT COMPILATION PIPELINE

┌─────────────────────────────────────────────────────────────────┐
│                        JIT COMPILER                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. HOT PATH DETECTION                                           │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ Interpreter tracks execution counts                  │     │
│     │ Backward branch counter > 1000? → Trigger JIT       │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  2. TRACE RECORDING                                              │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ Record bytecode sequence from loop start            │     │
│     │ Stop at: loop end, function call, or threshold      │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  3. CODE GENERATION                                              │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ For each bytecode instruction:                       │     │
│     │   - Map stack operations to registers               │     │
│     │   - Emit corresponding x86-64 bytes                 │     │
│     │   - Handle control flow (branches, loops)           │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  4. CODE INSTALLATION                                            │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ mmap() executable memory                            │     │
│     │ Copy generated code                                 │     │
│     │ Update dispatch table: bytecode_addr → native_addr │     │
│     └─────────────────────────────────────────────────────┘     │
│                              │                                   │
│                              ▼                                   │
│  5. EXECUTION                                                    │
│     ┌─────────────────────────────────────────────────────┐     │
│     │ Next time interpreter hits this address:            │     │
│     │   → Jump to native code instead of interpreting    │     │
│     └─────────────────────────────────────────────────────┘     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

2.8 Why This Matters for Virtualization

JIT compilation is the heart of software-based virtualization:

System	JIT Technique	Purpose
QEMU TCG	Binary translation	Run ARM/MIPS/RISC-V code on x86
Rosetta 2	Binary translation	Run x86 apps on Apple Silicon
JVM HotSpot	Method JIT	Run Java bytecode at native speed
V8/SpiderMonkey	Tiered JIT	Run JavaScript at native speed
LuaJIT	Trace JIT	Run Lua at C-like speed
WebAssembly	AOT + JIT	Run web apps at near-native speed
eBPF	Verification + JIT	Run safe code in Linux kernel

Understanding JIT compilation means understanding how ALL of these systems work internally.

2.9 Common Misconceptions

Misconception 1: “JIT is always faster than interpretation”

Reality: JIT has compilation overhead. For code that runs once, interpretation is faster. JIT only wins for frequently-executed code.

Misconception 2: “JIT generates optimal code”

Reality: JIT must balance compilation time vs. code quality. It generates “good enough” code quickly, not optimal code slowly.

Misconception 3: “Writing a JIT is extremely complex”

Reality: A basic JIT (like this project) is ~1000 lines. Production JITs are complex because of optimizations, not the core concept.

Misconception 4: “You need an IR (Intermediate Representation)”

Reality: QEMU’s TCG uses an IR, but simple JITs can translate directly. We’ll translate bytecode directly to x86-64.

Misconception 5: “JIT code must be position-independent”

Reality: We know exactly where our code will be. Absolute addresses are fine (and simpler).

3. Project Specification

3.1 What You Will Build

A complete JIT compiler system that:

Interprets a stack-based bytecode VM with 14 instructions
Detects hot loops using backward branch counting
JIT compiles hot paths to native x86-64 machine code
Manages a code cache with memory limits
Seamlessly switches between interpreted and JIT-compiled execution
Achieves 10-50x speedup on computational loops

3.2 Bytecode Instruction Set

Your VM will support these 14 instructions:

Opcode	Instruction	Stack Effect	Description
0x01	`PUSH n`	[] -> [n]	Push immediate value onto stack
0x02	`POP`	[a] -> []	Discard top of stack
0x03	`DUP`	[a] -> [a, a]	Duplicate top of stack
0x10	`ADD`	[a, b] -> [a+b]	Add top two values
0x11	`SUB`	[a, b] -> [a-b]	Subtract (second - top)
0x12	`MUL`	[a, b] -> [a*b]	Multiply top two values
0x13	`DIV`	[a, b] -> [a/b]	Integer divide
0x20	`LOAD addr`	[] -> [mem[addr]]	Load from memory
0x21	`STORE addr`	[a] -> []	Store to memory
0x30	`JMP offset`	[] -> []	Unconditional jump
0x31	`JZ offset`	[a] -> []	Jump if zero
0x32	`JNZ offset`	[a] -> []	Jump if not zero
0x40	`CALL addr`	[] -> []	Call subroutine
0x41	`RET`	[] -> []	Return from subroutine
0xFF	`HALT`	[] -> []	Stop execution

3.3 Functional Requirements

ID	Requirement	Priority
FR1	Interpreter correctly executes all 14 bytecode instructions	Must Have
FR2	Hot path detection triggers after 1000 backward branches	Must Have
FR3	JIT compiles arithmetic operations (ADD, SUB, MUL, DIV)	Must Have
FR4	JIT compiles control flow (JMP, JZ, JNZ)	Must Have
FR5	JIT compiles memory operations (LOAD, STORE, PUSH, POP)	Must Have
FR6	Code cache limits memory usage to configurable size	Should Have
FR7	Deoptimization falls back to interpreter gracefully	Should Have
FR8	Performance logging shows compilation events	Should Have
FR9	CALL/RET handled (interpret or compile)	Could Have
FR10	Multiple hot regions can be compiled independently	Could Have

3.4 Non-Functional Requirements

Requirement	Target	Measurement
Interpreter speed	> 10M bytecode ops/sec	Benchmark tight loop
JIT speedup	> 10x over interpreter	Compare Mandelbrot times
Compilation time	< 1ms for typical loop	Measure with clock_gettime
Code cache size	Configurable, default 64KB	Track allocations
Memory overhead	< 100 bytes per cached region	Profile memory usage

3.5 Example Usage / Output

$ ./jitvm mandelbrot.bc
[JIT] Detected hot loop at bytecode offset 0x42
[JIT] Compiling 23 bytecode ops to 156 bytes of x86-64
[JIT] Code cached at 0x7f4a12340000

++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ +++++++++++++++++++
++++++++++++++++++++  +++++++++++++++++
+++++++++++++++++++    ++++++++++++++++
++++++++++++++++++      +++++++++++++++
++++++++++++++  ++       +++++++++++++
+++++++++++                +++++++++++
+++++++++++                +++++++++++
++++++++++++++  ++       +++++++++++++
++++++++++++++++++      +++++++++++++++
+++++++++++++++++++    ++++++++++++++++
++++++++++++++++++++  +++++++++++++++++
++++++++++++++++++++ +++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++

Execution time: 0.089s (JIT enabled)
Execution time: 4.231s (interpreter only)
Speedup: 47.5x

3.6 Real World Outcome

When you complete this project:

You will have built a working JIT compiler - actual native code generation
You will understand QEMU’s TCG at a fundamental level
You will know how V8, JVM HotSpot, and LuaJIT work internally
You will have resume-worthy evidence of advanced systems programming skills
You will be able to explain JIT compilation in technical interviews
You will have built the foundation for hardware-assisted virtualization projects

4. Solution Architecture

4.1 High-Level Design

┌──────────────────────────────────────────────────────────────────────────┐
│                          JIT VM ARCHITECTURE                              │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐ │
│  │                         BYTECODE LOADER                              │ │
│  │   Load .bc file → Parse header → Validate → Store in code[]         │ │
│  └───────────────────────────────────┬─────────────────────────────────┘ │
│                                      │                                    │
│                                      ▼                                    │
│  ┌──────────────────────────────────────────────────────────────────────┐│
│  │                        EXECUTION ENGINE                               ││
│  │                                                                       ││
│  │  ┌─────────────────────────────────────────────────────────────────┐ ││
│  │  │                    DISPATCH LOOP                                 │ ││
│  │  │                                                                  │ ││
│  │  │   while (running) {                                              │ ││
│  │  │       if (jit_code[pc] != NULL) {                               │ ││
│  │  │           // Execute JIT-compiled native code                    │ ││
│  │  │           jit_code[pc](&vm_state);                              │ ││
│  │  │       } else {                                                   │ ││
│  │  │           // Interpret bytecode                                  │ ││
│  │  │           interpret_one(&vm_state);                             │ ││
│  │  │           check_hot_path(pc);                                   │ ││
│  │  │       }                                                          │ ││
│  │  │   }                                                              │ ││
│  │  │                                                                  │ ││
│  │  └─────────────────────────────────────────────────────────────────┘ ││
│  │                     │                           │                     ││
│  │        ┌────────────┘                           └────────────┐        ││
│  │        ▼                                                     ▼        ││
│  │  ┌──────────────────────────┐           ┌──────────────────────────┐ ││
│  │  │      INTERPRETER         │           │      JIT COMPILER         │ ││
│  │  │                          │           │                           │ ││
│  │  │  switch (opcode) {       │           │  1. Trace recording       │ ││
│  │  │    case PUSH: ...        │           │  2. Bytecode → x86-64     │ ││
│  │  │    case ADD: ...         │           │  3. Emit to code cache    │ ││
│  │  │    case JNZ: ...         │           │  4. Update dispatch       │ ││
│  │  │  }                       │           │                           │ ││
│  │  │                          │           │                           │ ││
│  │  └──────────────────────────┘           └──────────────────────────┘ ││
│  │                                                     │                 ││
│  │                                                     ▼                 ││
│  │                                         ┌──────────────────────────┐ ││
│  │                                         │      CODE CACHE          │ ││
│  │                                         │                          │ ││
│  │                                         │  mmap() RWX memory       │ ││
│  │                                         │  [region1][region2]...   │ ││
│  │                                         │  Max 64KB, LRU eviction  │ ││
│  │                                         │                          │ ││
│  │                                         └──────────────────────────┘ ││
│  │                                                                       ││
│  └───────────────────────────────────────────────────────────────────────┘│
│                                                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐│
│  │                          VM STATE                                     ││
│  │                                                                       ││
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                ││
│  │  │   Stack      │  │   Memory     │  │  Registers   │                ││
│  │  │   [0..255]   │  │   [0..4095]  │  │  PC, SP, FP  │                ││
│  │  └──────────────┘  └──────────────┘  └──────────────┘                ││
│  │                                                                       ││
│  └───────────────────────────────────────────────────────────────────────┘│
│                                                                           │
└───────────────────────────────────────────────────────────────────────────┘

4.2 Key Components

Component	Responsibility	Performance Requirement
Bytecode Loader	Parse and validate bytecode file	O(n) for n bytes
Interpreter	Execute bytecode instructions	> 10M ops/sec
Hot Path Detector	Track backward branch counts	O(1) per branch
JIT Compiler	Translate bytecode to x86-64	< 1ms per region
Code Emitter	Write x86-64 machine code bytes	O(n) for n instructions
Code Cache	Store and manage compiled code	O(1) lookup
Dispatch Table	Map bytecode address to native	O(1) lookup

4.3 Data Structures

// VM State - passed to JIT-compiled code
typedef struct {
    int64_t stack[256];      // Operand stack
    int stack_top;           // Stack pointer
    int64_t memory[4096];    // VM memory (variables)
    uint8_t* code;           // Bytecode array
    int pc;                  // Program counter
    int running;             // Execution flag
} VMState;

// Hot path tracking
typedef struct {
    int execution_count;     // Number of times executed
    int bytecode_start;      // Starting PC of region
    int bytecode_end;        // Ending PC of region
    void* native_code;       // Pointer to compiled code
    size_t code_size;        // Size of compiled code
} CompiledRegion;

// Code cache management
typedef struct {
    void* memory;            // mmap'd executable memory
    size_t capacity;         // Total size
    size_t used;             // Bytes used
    CompiledRegion regions[256];
    int num_regions;
} CodeCache;

// Dispatch table
typedef void (*JitFunc)(VMState*);
JitFunc dispatch[65536];     // Index by bytecode PC

4.4 x86-64 Code Generation

The JIT compiler translates each bytecode instruction to x86-64. Here’s the mapping:

BYTECODE TO x86-64 TRANSLATION

PUSH immediate:
  Bytecode: [0x01] [value:8]
  x86-64:   mov rax, <value>        ; 48 B8 <8 bytes>
            push rax                 ; 50

POP:
  Bytecode: [0x02]
  x86-64:   add rsp, 8              ; 48 83 C4 08

ADD:
  Bytecode: [0x10]
  x86-64:   pop rcx                 ; 59
            pop rax                 ; 58
            add rax, rcx            ; 48 01 C8
            push rax                ; 50

SUB:
  Bytecode: [0x11]
  x86-64:   pop rcx                 ; 59
            pop rax                 ; 58
            sub rax, rcx            ; 48 29 C8
            push rax                ; 50

MUL:
  Bytecode: [0x12]
  x86-64:   pop rcx                 ; 59
            pop rax                 ; 58
            imul rax, rcx           ; 48 0F AF C1
            push rax                ; 50

LOAD addr:
  Bytecode: [0x20] [addr:2]
  x86-64:   mov rax, [rdi + offset] ; 48 8B 87 <offset>
            push rax                ; 50
  (where rdi points to vm->memory, offset = addr * 8)

STORE addr:
  Bytecode: [0x21] [addr:2]
  x86-64:   pop rax                 ; 58
            mov [rdi + offset], rax ; 48 89 87 <offset>

JNZ offset (backward branch - loop):
  Bytecode: [0x32] [offset:2]
  x86-64:   pop rax                 ; 58
            test rax, rax           ; 48 85 C0
            jnz <relative_addr>     ; 0F 85 <4 bytes>

4.5 Algorithm Overview

ALGORITHM: JIT Compilation Pipeline

function interpret_with_jit(vm):
    while vm.running:
        pc = vm.pc

        // Check if JIT code exists for this address
        if dispatch[pc] != NULL:
            dispatch[pc](&vm)
            continue

        // Interpret one instruction
        opcode = vm.code[pc]
        switch opcode:
            case PUSH: ...
            case ADD: ...
            case JNZ:
                // HOT PATH DETECTION
                target = vm.pc + offset
                if target < pc:  // Backward branch (loop)
                    branch_count[pc]++
                    if branch_count[pc] > JIT_THRESHOLD:
                        compile_region(target, pc)

function compile_region(start_pc, end_pc):
    // 1. Allocate code buffer
    buffer = alloc_executable(4096)
    emit_ptr = buffer

    // 2. Emit prologue
    emit_prologue(&emit_ptr)

    // 3. Translate each bytecode instruction
    for pc = start_pc to end_pc:
        opcode = bytecode[pc]
        switch opcode:
            case PUSH:
                emit_push_immediate(&emit_ptr, get_immediate(pc))
            case ADD:
                emit_add(&emit_ptr)
            case JNZ:
                emit_conditional_jump(&emit_ptr, target)
            ...

    // 4. Emit epilogue
    emit_epilogue(&emit_ptr)

    // 5. Install in dispatch table
    dispatch[start_pc] = buffer

    // 6. Log compilation
    log("[JIT] Compiled %d bytes at %p", emit_ptr - buffer, buffer)

5. Implementation Guide

5.1 Development Environment Setup

# Install required tools
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install gcc gdb nasm hexdump

# macOS
brew install gcc nasm

# Verify installation
gcc --version    # Need GCC 7+ for C11
nasm --version   # For understanding x86 encoding
gdb --version    # For debugging

# Create project structure
mkdir -p jitvm/{src,include,tests,bytecode}
cd jitvm

5.2 Project Structure

jitvm/
├── Makefile
├── src/
│   ├── main.c              # Entry point, argument parsing
│   ├── vm.c                # VM state management
│   ├── interpreter.c       # Bytecode interpreter
│   ├── jit.c               # JIT compiler core
│   ├── codegen.c           # x86-64 code generation
│   ├── codecache.c         # Code cache management
│   └── loader.c            # Bytecode loader
├── include/
│   ├── vm.h                # VM structures
│   ├── opcodes.h           # Bytecode definitions
│   ├── jit.h               # JIT interfaces
│   └── codegen.h           # Code generation helpers
├── tests/
│   ├── test_interpreter.c  # Unit tests
│   └── test_jit.c          # JIT tests
├── bytecode/
│   ├── simple.bc           # Simple test program
│   ├── fibonacci.bc        # Fibonacci benchmark
│   └── mandelbrot.bc       # Mandelbrot set renderer
└── tools/
    └── assembler.py        # Simple bytecode assembler

5.3 The Core Question You’re Answering

“How do you generate executable machine code at runtime, and when is it worth the compilation overhead?”

This question captures the essence of JIT compilation:

Runtime code generation: Using mmap with PROT_EXEC to create executable memory, then writing valid x86-64 instruction bytes
When it’s worth it: Only for hot paths - code that executes frequently enough that compilation cost is amortized

This is the same question answered by every JIT system from the JVM to QEMU’s TCG.

5.4 Concepts You Must Understand First

Before starting implementation, verify your understanding:

Concept	Self-Assessment Question	Where to Learn
Stack machines	How does `PUSH 5; PUSH 3; ADD` leave the stack?	“Language Implementation Patterns” Ch. 5
x86-64 registers	What’s the difference between RAX, EAX, AX, AL?	Intel SDM Vol. 1, Ch. 3
x86-64 encoding	What bytes encode `mov rax, 0x1234567890ABCDEF`?	Intel SDM Vol. 2
Memory mapping	What does mmap with PROT_EXEC do?	“TLPI” Ch. 49
Calling conventions	How are arguments passed in x86-64 Linux/macOS?	System V ABI
Branch prediction	Why are backward branches usually predicted taken?	CS:APP Ch. 5

5.5 Questions to Guide Your Design

Interpreter Design:

How will you represent the bytecode instruction set?
What data structure will hold the VM stack?
How will you handle control flow (jumps, branches)?

Hot Path Detection:

What threshold triggers compilation?
How will you identify loop boundaries?
What data structure tracks execution counts?

Code Generation:

How will you map VM stack operations to x86-64 registers?
How will you handle memory operands (LOAD/STORE)?
How will you encode conditional branches?

Code Cache:

How will you allocate executable memory?
What’s your eviction policy when cache is full?
How will you invalidate stale code (if needed)?

Integration:

How does the interpreter transfer control to JIT code?
How does JIT code return control to the interpreter?
How do you handle operations JIT doesn’t support?

5.6 Thinking Exercise

Before writing any code, trace through this bytecode execution:

Program: Sum 1 to 100
Address  Bytecode        Stack (after)    Memory
0x00     PUSH 0          [0]
0x02     STORE 0         []               M[0]=0 (sum)
0x05     PUSH 1          [1]
0x07     STORE 1         []               M[1]=1 (counter)
0x0A     LOAD 0          [0]              <- LOOP START
0x0D     LOAD 1          [0, 1]
0x10     ADD             [1]
0x11     STORE 0         []               M[0]=1
0x14     LOAD 1          [1]
0x17     PUSH 1          [1, 1]
0x19     ADD             [2]
0x1A     DUP             [2, 2]
0x1B     STORE 1         [2]              M[1]=2
0x1E     PUSH 100        [2, 100]
0x21     SUB             [-98]
0x22     JNZ -0x18       []               Jump to 0x0A if != 0
0x25     LOAD 0          [5050]
0x28     HALT

Questions to answer:

How many times does the loop execute?
Which bytecode addresses constitute the “hot region”?
What x86-64 instructions would you emit for LOAD 0; LOAD 1; ADD; STORE 0?
How would you encode JNZ -0x18 in x86-64?

5.7 Hints in Layers

Use these progressive hints only when stuck.

Hint 1: Starting with the Interpreter

Start with a working interpreter before adding JIT:

void interpret(VMState* vm) {
    while (vm->running) {
        uint8_t opcode = vm->code[vm->pc++];

        switch (opcode) {
            case OP_PUSH: {
                int64_t value = read_i64(vm);  // Read 8-byte immediate
                vm->stack[vm->stack_top++] = value;
                break;
            }
            case OP_ADD: {
                int64_t b = vm->stack[--vm->stack_top];
                int64_t a = vm->stack[--vm->stack_top];
                vm->stack[vm->stack_top++] = a + b;
                break;
            }
            case OP_JNZ: {
                int16_t offset = read_i16(vm);  // Read 2-byte offset
                int64_t cond = vm->stack[--vm->stack_top];
                if (cond != 0) {
                    vm->pc += offset - 3;  // -3 for instruction size
                }
                break;
            }
            case OP_HALT:
                vm->running = 0;
                break;
            // ... other cases
        }
    }
}

Hint 2: Hot Path Detection

Track backward branches to find loops:

#define JIT_THRESHOLD 1000

int branch_counts[65536] = {0};  // Per-PC branch counter

void check_hot_path(VMState* vm, int target_pc) {
    int current_pc = vm->pc;

    // Only track backward branches (loops)
    if (target_pc < current_pc) {
        int loop_header = target_pc;
        branch_counts[loop_header]++;

        if (branch_counts[loop_header] >= JIT_THRESHOLD) {
            printf("[JIT] Hot loop detected at 0x%04X\n", loop_header);
            compile_region(vm, loop_header, current_pc);
            branch_counts[loop_header] = 0;  // Reset counter
        }
    }
}

Hint 3: Allocating Executable Memory

Use mmap to create RWX memory:

#include <sys/mman.h>

void* alloc_executable(size_t size) {
    void* mem = mmap(
        NULL,                           // Let kernel choose address
        size,                           // Size in bytes
        PROT_READ | PROT_WRITE | PROT_EXEC,  // RWX permissions
        MAP_PRIVATE | MAP_ANONYMOUS,    // Private, not file-backed
        -1,                             // No file descriptor
        0                               // Offset
    );

    if (mem == MAP_FAILED) {
        perror("mmap");
        return NULL;
    }

    return mem;
}

void free_executable(void* mem, size_t size) {
    munmap(mem, size);
}

Security Note: On some systems (macOS with SIP, selinux), you may need to:

Write code first (PROT_READ PROT_WRITE)
Then mprotect() to (PROT_READ PROT_EXEC)

// Safer two-step approach
void* mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ... write code to mem ...
mprotect(mem, size, PROT_READ | PROT_EXEC);

Hint 4: Code Generation Basics

Emit x86-64 bytes directly to a buffer:

typedef struct {
    uint8_t* buffer;
    uint8_t* ptr;
    size_t size;
} CodeBuffer;

// Emit single byte
void emit8(CodeBuffer* cb, uint8_t b) {
    *cb->ptr++ = b;
}

// Emit 32-bit value (little-endian)
void emit32(CodeBuffer* cb, uint32_t v) {
    *cb->ptr++ = v & 0xFF;
    *cb->ptr++ = (v >> 8) & 0xFF;
    *cb->ptr++ = (v >> 16) & 0xFF;
    *cb->ptr++ = (v >> 24) & 0xFF;
}

// Emit 64-bit value (little-endian)
void emit64(CodeBuffer* cb, uint64_t v) {
    emit32(cb, v & 0xFFFFFFFF);
    emit32(cb, v >> 32);
}

// Example: mov rax, immediate64
void emit_mov_rax_imm64(CodeBuffer* cb, int64_t value) {
    emit8(cb, 0x48);  // REX.W prefix
    emit8(cb, 0xB8);  // mov rax, imm64
    emit64(cb, value);
}

// Example: push rax
void emit_push_rax(CodeBuffer* cb) {
    emit8(cb, 0x50);
}

// Example: pop rax
void emit_pop_rax(CodeBuffer* cb) {
    emit8(cb, 0x58);
}

// Example: pop rcx
void emit_pop_rcx(CodeBuffer* cb) {
    emit8(cb, 0x59);
}

// Example: add rax, rcx
void emit_add_rax_rcx(CodeBuffer* cb) {
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x01);  // add r/m64, r64
    emit8(cb, 0xC8);  // ModRM: rcx -> rax
}

Hint 5: Compiling a Simple Operation

Translate bytecode ADD to x86-64:

// Bytecode: ADD
// Stack:    [a, b] -> [a + b]
// x86-64:   pop rcx; pop rax; add rax, rcx; push rax

void compile_add(CodeBuffer* cb) {
    // pop rcx (second operand)
    emit8(cb, 0x59);

    // pop rax (first operand)
    emit8(cb, 0x58);

    // add rax, rcx
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x01);  // ADD r/m64, r64
    emit8(cb, 0xC8);  // ModRM: rcx -> rax

    // push rax (result)
    emit8(cb, 0x50);
}

For PUSH immediate:

// Bytecode: PUSH value
// Stack:    [] -> [value]
// x86-64:   mov rax, value; push rax

void compile_push(CodeBuffer* cb, int64_t value) {
    // mov rax, imm64
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0xB8);  // mov rax, imm64
    emit64(cb, value);

    // push rax
    emit8(cb, 0x50);
}

Hint 6: Handling LOAD/STORE

The VM’s memory array is passed via register (rdi in System V ABI):

// Before calling JIT code:
// rdi = &vm->memory[0]  (base of VM memory)

// Bytecode: LOAD addr
// Load from VM memory, push to hardware stack

void compile_load(CodeBuffer* cb, int addr) {
    int offset = addr * 8;  // int64_t array

    // mov rax, [rdi + offset]
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x8B);  // mov r64, r/m64
    emit8(cb, 0x87);  // ModRM: [rdi + disp32]
    emit32(cb, offset);

    // push rax
    emit8(cb, 0x50);
}

// Bytecode: STORE addr
// Pop from hardware stack, store to VM memory

void compile_store(CodeBuffer* cb, int addr) {
    int offset = addr * 8;

    // pop rax
    emit8(cb, 0x58);

    // mov [rdi + offset], rax
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x89);  // mov r/m64, r64
    emit8(cb, 0x87);  // ModRM: [rdi + disp32]
    emit32(cb, offset);
}

Hint 7: Compiling Conditional Branches

JNZ (Jump if Not Zero) requires careful handling:

// Bytecode: JNZ offset
// Pop value, jump if != 0
// For loops: offset is negative (backward jump)

void compile_jnz(CodeBuffer* cb, int target_offset, uint8_t* loop_start) {
    // pop rax (condition)
    emit8(cb, 0x58);

    // test rax, rax
    emit8(cb, 0x48);  // REX.W
    emit8(cb, 0x85);  // test r/m64, r64
    emit8(cb, 0xC0);  // ModRM: rax, rax

    // jnz rel32
    emit8(cb, 0x0F);
    emit8(cb, 0x85);

    // Calculate relative offset to loop start
    // JNZ is 6 bytes total, so calculate from end of instruction
    uint8_t* jnz_end = cb->ptr + 4;  // After we emit the 4-byte offset
    int32_t rel = (int32_t)(loop_start - jnz_end);
    emit32(cb, rel);
}

Key insight: x86-64 relative jumps are calculated from the END of the instruction, not the beginning.

Hint 8: Complete Compilation Example

Compiling a simple loop:

void compile_region(VMState* vm, int start_pc, int end_pc) {
    CodeBuffer cb;
    cb.buffer = alloc_executable(4096);
    cb.ptr = cb.buffer;

    // Remember start for loop jumps
    uint8_t* loop_start = cb.ptr;

    // Emit prologue: save callee-saved registers, get vm_memory pointer
    // Assume: rdi = &vm->memory[0] (passed by caller)

    // Translate bytecode
    int pc = start_pc;
    while (pc <= end_pc) {
        uint8_t opcode = vm->code[pc++];

        switch (opcode) {
            case OP_PUSH: {
                int64_t value = read_i64_at(vm, pc);
                pc += 8;
                compile_push(&cb, value);
                break;
            }
            case OP_ADD:
                compile_add(&cb);
                break;
            case OP_LOAD: {
                int addr = read_i16_at(vm, pc);
                pc += 2;
                compile_load(&cb, addr);
                break;
            }
            case OP_STORE: {
                int addr = read_i16_at(vm, pc);
                pc += 2;
                compile_store(&cb, addr);
                break;
            }
            case OP_JNZ: {
                compile_jnz(&cb, 0, loop_start);
                pc += 2;  // Skip offset bytes
                break;
            }
            // ... other cases
        }
    }

    // Emit epilogue: ret
    emit8(&cb, 0xC3);  // ret

    // Calculate size
    size_t code_size = cb.ptr - cb.buffer;

    // Install in dispatch table
    typedef void (*JitFunc)(int64_t* memory);
    dispatch[start_pc] = (JitFunc)cb.buffer;

    printf("[JIT] Compiled %zu bytes at %p\n", code_size, cb.buffer);
}

Hint 9: Calling JIT Code from Interpreter

The dispatch loop calls JIT code:

typedef void (*JitFunc)(int64_t* memory);
JitFunc dispatch[65536] = {NULL};

void run_with_jit(VMState* vm) {
    while (vm->running) {
        int pc = vm->pc;

        // Check for JIT code
        if (dispatch[pc] != NULL) {
            // Call JIT code
            // Pass pointer to VM memory
            dispatch[pc](vm->memory);

            // JIT code handles the loop internally
            // and returns when loop exits
            // We need to sync VM state...
            // (This is simplified - real impl needs more care)
        } else {
            // Interpret as usual
            interpret_one(vm);
        }
    }
}

Note: A complete implementation needs to handle:

Syncing VM stack with hardware stack
Updating VM PC after JIT returns
Handling exits from JIT (loop exits, exceptions)

Hint 10: Full x86-64 Instruction Reference

Reference for all instructions you need:

INSTRUCTION          ENCODING                    NOTES
─────────────────────────────────────────────────────────────────
mov rax, imm64      48 B8 [8 bytes]             Load 64-bit immediate
mov rax, [rdi+off]  48 8B 87 [4 bytes]          Load from memory
mov [rdi+off], rax  48 89 87 [4 bytes]          Store to memory

push rax            50                           Push to stack
pop rax             58                           Pop from stack
pop rcx             59                           Pop to rcx
pop rdx             5A                           Pop to rdx

add rax, rcx        48 01 C8                    rax = rax + rcx
sub rax, rcx        48 29 C8                    rax = rax - rcx
imul rax, rcx       48 0F AF C1                 rax = rax * rcx

cqo                 48 99                        Sign-extend rax to rdx:rax
idiv rcx            48 F7 F9                    rdx:rax / rcx, result in rax

test rax, rax       48 85 C0                    Set flags based on rax
cmp rax, rcx        48 39 C8                    Set flags: rax - rcx

jmp rel32           E9 [4 bytes]                Unconditional jump
jnz rel32           0F 85 [4 bytes]             Jump if not zero (ZF=0)
jz rel32            0F 84 [4 bytes]             Jump if zero (ZF=1)

ret                 C3                           Return from function

Relative addressing: All jump offsets are calculated from the END of the instruction.

5.8 The Interview Questions They’ll Ask

After completing this project, you should be able to answer:

“What is JIT compilation and when is it better than interpretation?”
- Expected: JIT compiles code at runtime, better for hot paths that execute frequently
- Bonus: Discuss compilation overhead vs. execution speed tradeoff
“How do you generate executable code at runtime?”
- Expected: mmap with PROT_EXEC, write machine code bytes, call via function pointer
- Bonus: Discuss W^X security, mprotect separation
“How do you detect hot paths?”
- Expected: Backward branch counting (loops), method invocation counting
- Bonus: Discuss tracing JITs vs. method JITs
“How does QEMU’s TCG relate to JIT compilation?”
- Expected: TCG translates guest ISA to host ISA at runtime using JIT techniques
- Bonus: Discuss IR-based translation, code caching
“What’s the difference between a stack machine and a register machine?”
- Expected: Stack machines use implicit stack, register machines use explicit registers
- Bonus: Discuss why bytecode is often stack-based (simpler) but native code is register-based (faster)
“How do you handle branches in JIT-compiled code?”
- Expected: Calculate relative offsets from instruction end, emit conditional jumps
- Bonus: Discuss branch prediction, trace compilation
“What are the challenges of JIT code cache management?”
- Expected: Memory limits, invalidation on code changes, fragmentation
- Bonus: Discuss GC integration, OSR (on-stack replacement)
“How would you debug JIT-generated code?”
- Expected: Dump generated bytes, use gdb to disassemble, step through execution
- Bonus: Discuss JIT map files, perf integration

5.9 Books That Will Help

Topic	Book	Specific Chapter	Why It Helps
Stack-based VMs	“Language Implementation Patterns” by Terence Parr	Chapter 10-11	Core VM design patterns
JIT techniques	“Engineering a Compiler” by Cooper & Torczon	Chapter 7, 13	Code generation and optimization
x86-64 encoding	“Modern X86 Assembly” by Daniel Kusswurm	Chapters 1-5	Instruction encoding reference
Memory mapping	“The Linux Programming Interface” by Kerrisk	Chapter 49	mmap, mprotect details
QEMU internals	QEMU documentation	TCG documentation	How production binary translation works
LuaJIT	LuaJIT source code	lj_jit.c, lj_asm.c	Real trace JIT implementation

5.10 Implementation Phases

Phase 1: Bytecode Interpreter (Days 1-3)

Goals:

Implement all 14 bytecode instructions
Create bytecode loader
Test with simple programs

Tasks:

Define opcodes and bytecode format
Implement VMState structure
Write interpreter dispatch loop
Implement each instruction
Create test bytecode by hand
Verify with simple programs (add two numbers, loop to 10)

Checkpoint: Can run sum_to_100.bc and get correct result.

Phase 2: Hot Path Detection (Days 4-5)

Goals:

Track backward branches
Identify when threshold is crossed
Log hot path detection events

Tasks:

Add branch count array
Modify JNZ handler to track backward branches
Add threshold check
Log when hot path is detected
Calculate loop region (start_pc, end_pc)

Checkpoint: “Hot loop detected at 0xXX” message appears after 1000 iterations.

Phase 3: Basic Code Generation (Days 6-10)

Goals:

Allocate executable memory
Generate code for PUSH, ADD, SUB, MUL
Execute simplest JIT code

Tasks:

Implement mmap wrapper for executable memory
Create CodeBuffer structure
Write emit_* functions for basic instructions
Compile simplest case: PUSH; PUSH; ADD
Call JIT code from interpreter
Verify result is correct

Checkpoint: JIT-compiled 5 + 3 returns 8.

Phase 4: Full Loop Compilation (Days 11-15)

Goals:

Compile LOAD, STORE
Compile JNZ with backward branch
Handle VM memory access from JIT code

Tasks:

Pass VM memory pointer to JIT code (via rdi)
Implement compile_load, compile_store
Implement compile_jnz with loop back
Track code generation position for jump targets
Test complete loop compilation

Checkpoint: Sum 1-to-100 loop runs entirely in JIT code.

Phase 5: Integration & Benchmarking (Days 16-21)

Goals:

Seamless interpreter/JIT switching
Code cache management
Performance measurement

Tasks:

Implement dispatch table
Add interpreter fallback for uncompiled code
Implement code cache with size limit
Create Mandelbrot benchmark
Measure and compare interpreter vs. JIT
Tune JIT threshold

Checkpoint: 10x+ speedup on Mandelbrot, working demo.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Stack representation	VM stack vs. x86 stack	x86 stack	Simpler code generation, hardware optimized
Memory passing	Global pointer vs. parameter	Parameter (rdi)	Clean interface, follows ABI
Jump handling	Relative vs. absolute	Relative	Standard x86-64 encoding
Code cache	Fixed vs. growing	Fixed 64KB	Simpler management, sufficient for demo
Deoptimization	Full vs. none	Simple fallback	Keep scope manageable

6. Testing Strategy

6.1 Unit Tests

Test Category	What to Test
Interpreter	Each opcode in isolation
Code emission	Each emit_* function produces correct bytes
Code execution	Simple JIT sequences return correct values
Hot path detection	Threshold triggers at correct count
Memory access	LOAD/STORE work correctly from JIT code

6.2 Integration Tests

Test 1: Simple Addition
  Bytecode: PUSH 5; PUSH 3; ADD; HALT
  Expected: Stack contains [8]

Test 2: Loop Counter
  Bytecode: PUSH 0; STORE 0; PUSH 10; STORE 1;
           LOAD 0; PUSH 1; ADD; DUP; STORE 0;
           LOAD 1; PUSH 1; SUB; DUP; STORE 1; JNZ -20;
           LOAD 0; HALT
  Expected: Memory[0] = 10 (counted 10 times)

Test 3: Sum 1 to N
  Run sum_to_100 with JIT disabled: result = 5050
  Run sum_to_100 with JIT enabled: result = 5050
  Verify: Same result, different speed

Test 4: JIT Triggering
  Loop that iterates exactly 1001 times
  Verify: JIT compilation message appears at iteration 1000

Test 5: Multiple Hot Regions
  Two separate loops in one program
  Verify: Both get compiled independently

6.3 Benchmark Tests

Benchmark 1: Sum 1 to 1,000,000
  Interpreter: ~2 seconds
  JIT: ~0.05 seconds
  Expected speedup: 40x+

Benchmark 2: Fibonacci(40)
  Interpreter: ~5 seconds
  JIT: ~0.2 seconds
  Expected speedup: 25x+

Benchmark 3: Mandelbrot 80x24
  Interpreter: ~4 seconds
  JIT: ~0.1 seconds
  Expected speedup: 40x+

7. Common Pitfalls & Debugging

Problem	Symptom	Cause	Solution	Verification
Segfault on JIT call	SIGSEGV immediately	Wrong calling convention	Check rdi points to valid memory	gdb: `x/10gx $rdi`
Segfault in JIT code	SIGSEGV during execution	Bad machine code encoding	Disassemble generated code	`objdump -D -b binary -m i386:x86-64 dump.bin`
Wrong computation result	Incorrect output	Stack order or encoding error	Add debug prints, trace execution	Compare interpreter vs JIT step by step
mmap fails	MAP_FAILED return	SELinux, SIP restrictions	Use PROT_READ	PROT_WRITE then mprotect	Check `dmesg` for denials
JIT not triggering	Always interprets	Threshold not reached or wrong detection	Add logging to branch handler	Print branch count every 100 iterations
Infinite loop in JIT	Hangs	Jump offset calculated wrong	Print jump target address	gdb breakpoint before jump
Memory corruption	Random crashes	Buffer overflow in emit	Add bounds checking to CodeBuffer	Valgrind: `valgrind ./jitvm`
Performance not improved	Same speed	JIT code not being called	Add logging to dispatch	Print “entering JIT” message

7.1 Debugging Techniques

Disassemble generated code:

# Dump JIT buffer to file
# In code: write(fd, cb.buffer, cb.ptr - cb.buffer);

# Then disassemble:
objdump -D -b binary -m i386:x86-64 jit_dump.bin

GDB debugging:

gdb ./jitvm
(gdb) break *0x7ffff7ff0000  # Break at JIT code address
(gdb) run test.bc
(gdb) disassemble $pc, $pc+50  # Show generated code
(gdb) info registers
(gdb) x/10gx $rdi  # Examine VM memory

Add verification:

void verify_emit(CodeBuffer* cb, const char* description) {
    printf("[EMIT] %s: ", description);
    for (uint8_t* p = cb->buffer; p < cb->ptr; p++) {
        printf("%02X ", *p);
    }
    printf("\n");
}

8. Extensions & Challenges

8.1 Beginner Extensions

More operations: Add AND, OR, XOR, NOT, MOD
Better diagnostics: Print disassembly of generated code
Statistics: Track compilation time, cache hit rate
Multiple loops: Support multiple hot regions in one program

8.2 Intermediate Extensions

Function calls: Compile CALL/RET instructions
Tiered compilation: Fast first-tier JIT, optimized second tier
Register allocation: Map VM stack slots to x86 registers
Code cache eviction: LRU policy when cache is full

8.3 Advanced Extensions

On-stack replacement (OSR): Enter JIT mid-execution
Deoptimization: Fall back to interpreter when invariants break
Inline caching: Optimize repeated operations
Vectorization: Use SIMD for parallel operations
Different architectures: Add ARM64 backend

9. Real-World Connections

9.1 How Production JITs Differ

Aspect	This Project	Production JIT (V8, HotSpot)
Optimization level	None	Multiple tiers, heavy optimization
IR usage	Direct translation	SSA-based IR with optimizations
Register allocation	Stack-based	Graph coloring, linear scan
Garbage collection	None	Full GC integration
Debugging support	Minimal	Source maps, profiler integration
OSR	None	Full on-stack replacement
Code size	~1000 lines	100,000+ lines

9.2 Real-World Systems Using This Technique

QEMU TCG:

Translates guest ISA to host ISA
Uses intermediate TCG ops
Supports dozens of guest/host combinations
Handles privileged instructions, memory protection

V8 TurboFan:

JavaScript JIT compiler
Sea of Nodes IR
Speculative optimization
Deoptimization on type changes

LuaJIT:

Trace-based JIT for Lua
Records executed traces
Generates highly optimized code
Often faster than C for some workloads

HotSpot C2:

JVM server compiler
Extensive optimizations
Profile-guided optimization
Escape analysis, inlining

9.3 Open Source References

Project	What to Learn
QEMU	tcg/ directory - real binary translator
LuaJIT	src/lj_asm.c - trace JIT
Copy	Tiny JIT in 1000 lines
LibJIT	JIT library for multiple backends

10. Resources

10.1 Essential Reading

“Language Implementation Patterns” by Terence Parr - Chapters 10-11 on VMs
“Engineering a Compiler” by Cooper & Torczon - Chapter 7 on code generation
Intel 64 and IA-32 Architectures Software Developer’s Manual - Volume 2 for encoding
QEMU Documentation - TCG ops and translation concepts

10.2 Online Tutorials

10.3 Reference Materials

11. Self-Assessment Checklist

Understanding

I can explain what JIT compilation is and when it’s beneficial
I understand how mmap with PROT_EXEC creates executable memory
I can describe how a stack machine differs from a register machine
I know how to encode basic x86-64 instructions
I understand how QEMU’s TCG relates to this project
I can explain hot path detection strategies

Implementation

My interpreter correctly executes all bytecode instructions
Hot path detection triggers after the configured threshold
JIT compilation produces correct machine code
Generated code executes and produces correct results
The dispatch table correctly routes to JIT or interpreter
Benchmarks show significant speedup (10x+)

Growth

I profiled my code and understand the performance characteristics
I can debug JIT-generated code using gdb
I could extend this to support new instructions
I could explain my implementation in a technical interview

12. Submission / Completion Criteria

Minimum Viable Completion

Interpreter executes PUSH, POP, ADD, SUB, MUL, LOAD, STORE, JNZ, HALT correctly
Hot path detection triggers on backward branches
At least arithmetic operations are JIT compiled
JIT code executes and produces correct results
Demo shows measurable speedup

Full Completion

All 14 bytecode instructions interpreted correctly
JIT compiles arithmetic, memory, and control flow operations
Code cache manages memory with configurable size
Mandelbrot benchmark shows 10x+ speedup
Clean interpreter/JIT switching

Excellence (Going Above and Beyond)

CALL/RET compiled in JIT
Register allocation optimization
Multiple compilation tiers
Deoptimization support
Performance comparable to LuaJIT-like systems

13. Learning Milestones

Track your progress through these milestones:

mmap+PROT_EXEC works - Can execute generated code
- Test: Generate mov rax, 42; ret and call it
- Verify: Function returns 42
Simple arithmetic JITs correctly - Basic code generation works
- Test: PUSH 5; PUSH 3; ADD returns 8 from JIT code
- Verify: Same result as interpreter
Loops JIT correctly - Control flow handled
- Test: Loop counting to 100 runs entirely in JIT
- Verify: Memory[0] contains 100 after execution
10x speedup on hot loops - Effective JIT
- Test: Sum 1-to-1M benchmark
- Verify: JIT version is 10x+ faster
Interpreter/JIT switching seamless - Full integration
- Test: Program with cold startup and hot loop
- Verify: Correct result, smooth transition, good performance

This guide was expanded from HYPERVISOR_VIRTUALIZATION_DEEP_DIVE_PROJECTS.md. For the complete learning path, see the main file which covers 15 projects from basic emulation to full hypervisor development.