Project 3: Basic Block Binary Translator
Build a static binary translator that converts RISC-V basic blocks to x86-64 machine code, creating executables that run natively with 100-300x speedup over interpretation - the core technique behind QEMU’s Tiny Code Generator.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 3-4 weeks |
| Language | C |
| Alternative Languages | C++, Rust |
| Prerequisites | Projects 1-2 (emulation basics), x86-64 assembly knowledge, understanding of instruction encoding |
| Key Topics | Binary translation, basic block identification, register mapping, instruction selection, code emission, control flow graphs |
| Coolness Level | Level 5: Pure Magic (Super Cool) |
| Portfolio Value | Resume Gold |
| Main Book | “Engineering a Compiler” by Cooper & Torczon |
1. Learning Objectives
By completing this project, you will:
- Understand binary translation fundamentals: Explain how guest code can be converted to host code for native execution
- Master basic block identification: Identify instruction sequences with single entry points and single exit points
- Implement register mapping strategies: Map guest registers to host registers efficiently, with spillover to memory
- Build an instruction selector: Choose optimal host instructions for each guest operation
- Generate valid x86-64 machine code: Emit raw bytes that the CPU can execute directly
- Handle control flow translation: Chain translated blocks together and handle branches
- Understand QEMU’s TCG at a fundamental level: This is exactly how QEMU’s Tiny Code Generator works for cross-architecture emulation
2. Theoretical Foundation
2.1 What is Binary Translation?
Binary translation is the process of converting machine code from one instruction set architecture (ISA) to another. Unlike interpretation (which decodes and executes each instruction one at a time), binary translation converts entire sequences of instructions at once, then executes the translated code natively.
INTERPRETATION vs BINARY TRANSLATION
Interpretation (Project 2):
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ RISC-V Instruction Interpreter Loop │
│ ┌─────────────────┐ ┌────────────────────────────────────┐ │
│ │ add x10, x11, x12│ ──► │ 1. Fetch instruction (memory read) │ │
│ └─────────────────┘ │ 2. Decode opcode (bit extraction) │ │
│ │ 3. Execute (switch statement) │ │
│ │ 4. Update PC │ │
│ │ 5. Repeat for EVERY instruction │ │
│ └────────────────────────────────────┘ │
│ │
│ Cost: 50-200 host instructions PER guest instruction │
│ │
└─────────────────────────────────────────────────────────────────────┘
Binary Translation (This Project):
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ RISC-V Basic Block Translation x86-64 Block │
│ ┌─────────────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ add x10, x11, x12│ │ │ │ mov eax, ebx │ │
│ │ add x13, x10, x14│ ──► │ One-time │ ──► │ add eax, ecx │ │
│ │ sw x13, 0(x15) │ │ Translation│ │ add eax, r8d │ │
│ │ bne x10, x0, -12│ │ │ │ mov [r9], eax│ │
│ └─────────────────┘ └───────────┘ │ test eax, eax│ │
│ │ jne target │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ Execute natively, full │
│ speed, many times! │
│ │
│ Cost: Translation is expensive, but execution is nearly free │
│ │
└─────────────────────────────────────────────────────────────────────┘
Performance Comparison:
┌────────────────────────────────────────────────────────────────────┐
│ │
│ Method │ Startup │ Per-Instruction │ Overall │
│ ─────────────────────┼─────────┼─────────────────┼────────────────│
│ Interpretation │ Fast │ ~100 cycles │ Slow │
│ Binary Translation │ Slow │ ~1-5 cycles │ Fast │
│ │
│ Break-even point: ~50-100 executions of the same block │
│ Hot loops: 10,000x better with translation! │
│ │
└────────────────────────────────────────────────────────────────────┘
2.2 Basic Blocks: The Unit of Translation
A basic block is a sequence of instructions with:
- One entry point: Execution always enters at the first instruction
- One exit point: Execution always leaves at the last instruction
- No internal branches: Instructions execute sequentially
BASIC BLOCK IDENTIFICATION
Source Code (C): Basic Blocks:
┌────────────────────────────┐ ┌──────────────────────────────────┐
│ int sum = 0; │ │ │
│ for (int i = 0; i < n; i++)│ │ Block 1: Setup │
│ { │ │ ┌──────────────────────────┐ │
│ sum += arr[i]; │ │ │ li x10, 0 # sum = 0 │ │
│ } │ │ │ li x11, 0 # i = 0 │ │
│ return sum; │ │ └──────────────────────────┘ │
└────────────────────────────┘ │ │ │
│ ▼ │
│ Block 2: Loop Header │
│ ┌──────────────────────────┐ │
│ │ bge x11, x12, exit │ │
│ └──────────────────────────┘ │
│ │ │
│ ▼ │
│ Block 3: Loop Body │
│ ┌──────────────────────────┐ │
│ │ slli x13, x11, 2 │ │
│ │ add x13, x13, x14 │ │
│ │ lw x15, 0(x13) │ │
│ │ add x10, x10, x15 │ │
│ │ addi x11, x11, 1 │ │
│ │ j loop_header │ │
│ └──────────────────────────┘ │
│ │
│ Block 4: Exit │
│ ┌──────────────────────────┐ │
│ │ ret │ │
│ └──────────────────────────┘ │
└──────────────────────────────────┘
Block Boundaries:
- A block STARTS after: function entry, branch target, instruction after branch
- A block ENDS with: branch, jump, return, call, or before a branch target
2.3 Register Mapping Strategies
One of the key challenges in binary translation is mapping guest registers to host registers efficiently:
REGISTER MAPPING CHALLENGE
RISC-V has 32 registers: x0-x31
x86-64 has 16 registers: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8-r15
Problem: More guest registers than host registers!
STRATEGY 1: Direct Mapping (Partial)
┌─────────────────────────────────────────────────────────────────────┐
│ Map frequently-used RISC-V registers to x86-64 registers: │
│ │
│ RISC-V Purpose x86-64 Notes │
│ ───────────────────────────────────────────────────────────────────│
│ x10 (a0) Return value/arg1 rax Natural for return value │
│ x11 (a1) Argument 2 rbx Callee-saved │
│ x12 (a2) Argument 3 rcx Caller-saved │
│ x13 (a3) Argument 4 rdx Caller-saved │
│ x14 (a4) Argument 5 r8 Caller-saved │
│ x15 (a5) Argument 6 r9 Caller-saved │
│ x5 (t0) Temporary r10 Caller-saved │
│ x6 (t1) Temporary r11 Caller-saved │
│ │
│ All others: Spill to memory array │
└─────────────────────────────────────────────────────────────────────┘
STRATEGY 2: Memory Array + Load/Store
┌─────────────────────────────────────────────────────────────────────┐
│ Keep ALL guest registers in memory, load/store as needed: │
│ │
│ struct cpu_state { │
│ uint64_t x[32]; // Guest register file │
│ }; │
│ │
│ Translation of "add x10, x11, x12": │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ mov rax, [rbp + 11*8] ; Load x11 from memory │ │
│ │ add rax, [rbp + 12*8] ; Add x12 from memory │ │
│ │ mov [rbp + 10*8], rax ; Store to x10 in memory │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Pros: Simple, correct, works for all registers │
│ Cons: Memory traffic, slower │
└─────────────────────────────────────────────────────────────────────┘
STRATEGY 3: Hybrid (Recommended)
┌─────────────────────────────────────────────────────────────────────┐
│ Map hot registers to host registers, spill cold ones to memory: │
│ │
│ 1. Direct map x10-x17 (a0-a7) to rax, rbx, rcx, rdx, r8-r11 │
│ 2. Keep rbp pointing to cpu_state structure │
│ 3. For x0-x9 and x18-x31: load/store from memory │
│ │
│ Translation of "add x10, x11, x12" (all mapped): │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ mov rax, rbx ; x10 = x11 (mapped to rax, rbx) │ │
│ │ add rax, rcx ; x10 += x12 (mapped to rcx) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Translation of "add x20, x21, x22" (all spilled): │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ mov r12, [rbp + 21*8] ; Load x21 (temporary host register) │ │
│ │ add r12, [rbp + 22*8] ; Add x22 │ │
│ │ mov [rbp + 20*8], r12 ; Store to x20 │ │
│ └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
2.4 x86-64 Instruction Encoding
To emit machine code, you must understand x86-64 instruction encoding:
x86-64 INSTRUCTION ENCODING PRIMER
General Format:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ [Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]│
│ 0-4 bytes 0-1 1-3 0-1 0-1 0-4 0-4 │
│ │
└─────────────────────────────────────────────────────────────────────┘
REX Prefix (for 64-bit operations and extended registers):
┌──────────────────────────────────────────────────────────────────────┐
│ Bit: 7 6 5 4 3 2 1 0 │
│ 0 1 0 0 W R X B │
│ │ │ │ │ │
│ │ │ │ └── Extension of ModR/M.rm │
│ │ │ │ or SIB.base │
│ │ │ └────── Extension of SIB.index │
│ │ └────────── Extension of ModR/M.reg │
│ └────────────── 1 = 64-bit operand │
│ │
│ REX.W (48h): Use 64-bit operand size │
│ REX.R (44h): Access r8-r15 in ModR/M.reg │
│ REX.B (41h): Access r8-r15 in ModR/M.rm │
└──────────────────────────────────────────────────────────────────────┘
ModR/M Byte:
┌──────────────────────────────────────────────────────────────────────┐
│ Bit: 7 6 5 4 3 2 1 0 │
│ └─mod─┘ └───reg───┘ └───r/m───┘ │
│ │ │ │ │
│ │ │ └── Register or memory operand │
│ │ └────────────── Register operand │
│ └─────────────────────── Addressing mode: │
│ 00 = [reg] │
│ 01 = [reg + disp8] │
│ 10 = [reg + disp32] │
│ 11 = reg (register direct) │
└──────────────────────────────────────────────────────────────────────┘
COMMON ENCODINGS YOU'LL NEED:
Instruction Encoding Bytes
────────────────────────────────────────────────────────────────
mov rax, rbx 48 89 D8 3 ; REX.W + MOV + ModR/M
mov rax, rcx 48 89 C8 3
mov rax, imm64 48 B8 + 8 bytes 10 ; REX.W + MOV + immediate
mov eax, imm32 B8 + 4 bytes 5 ; MOV + immediate (32-bit)
add rax, rbx 48 01 D8 3 ; REX.W + ADD + ModR/M
add rax, imm32 48 05 + 4 bytes 6 ; REX.W + ADD + immediate
sub rax, rbx 48 29 D8 3
and rax, rbx 48 21 D8 3
or rax, rbx 48 09 D8 3
xor rax, rbx 48 31 D8 3
cmp rax, rbx 48 39 D8 3
test rax, rax 48 85 C0 3
mov rax, [rbp+disp8] 48 8B 45 XX 4 ; Load from memory
mov [rbp+disp8], rax 48 89 45 XX 4 ; Store to memory
jmp rel32 E9 + 4 bytes 5 ; Near jump
je rel32 0F 84 + 4 bytes 6 ; Jump if equal
jne rel32 0F 85 + 4 bytes 6 ; Jump if not equal
ret C3 1 ; Return
REGISTER ENCODING:
x86-64 Register Code With REX.R/B
───────────────────────────────────────
rax/eax 000 r8/r8d
rcx/ecx 001 r9/r9d
rdx/edx 010 r10/r10d
rbx/ebx 011 r11/r11d
rsp/esp 100 r12/r12d
rbp/ebp 101 r13/r13d
rsi/esi 110 r14/r14d
rdi/edi 111 r15/r15d
2.5 Why This Matters for QEMU
HOW QEMU'S TCG WORKS (SIMPLIFIED)
QEMU's Tiny Code Generator (TCG) does exactly what you're building:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Guest Code TCG Frontend TCG IR TCG Backend │
│ (ARM, RISC-V, (Per-target) (Portable) (Per-host) │
│ MIPS, x86...) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ ARM │ │ ARM │ │ TCG ops │ │ x86-64 │ │
│ │ add r0, │ ──► │ frontend │ ──► │ add_i32 │ ──► │ backend │ │
│ │ r1, r2 │ │ │ │ tmp0, │ │ │ │
│ └──────────┘ └──────────┘ │ t1, t2 │ └──────────┘ │
│ └──────────┘ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ x86-64 │ │
│ │ add eax, │ │
│ │ ebx, ecx │ │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Your Project: Simplified version (RISC-V → x86-64 directly)
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ RISC-V Code Your Translator x86-64 Code │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ add x10, │ │ │ │ mov eax, ebx │ │
│ │ x11, x12 │ ────► │ Direct │ ────► │ add eax, ecx │ │
│ │ │ │ Translation │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
After this project, you'll understand:
- Why QEMU can emulate any architecture on any host
- How TCG achieves 10-30x better performance than interpretation
- The tradeoffs between translation quality and compilation time
- Why hardware virtualization (VT-x) is still preferred when available
2.6 Common Misconceptions
Misconception 1: “Binary translation is just like a compiler”
- Reality: Compilers have source code with types and structure. Binary translators work with raw machine code where intent must be inferred.
Misconception 2: “Translation is always better than interpretation”
- Reality: Translation has startup cost. For code executed only once, interpretation is faster. This is why JIT compilers use interpretation initially.
Misconception 3: “You can always achieve native speed with translation”
- Reality: Semantic differences between architectures (memory models, flags, etc.) require extra instructions. 70-90% native speed is typical.
Misconception 4: “x86-64 is simpler than RISC-V because it’s newer”
- Reality: x86-64 is far more complex due to backwards compatibility. RISC-V’s regularity makes it easier to translate FROM.
3. Project Specification
3.1 What You Will Build
A static binary translator that:
- Reads a RISC-V RV32I binary file
- Identifies basic blocks in the code
- Translates each basic block to x86-64 machine code
- Links the translated blocks together
- Outputs a native x86-64 executable
SYSTEM ARCHITECTURE
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Input: fibonacci.rv32 │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ 00000000: 00100513 addi x10, x0, 1 # n = 1 │ │
│ │ 00000004: 00100593 addi x11, x0, 1 # a = 1 │ │
│ │ 00000008: 00000613 addi x12, x0, 0 # b = 0 │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ BINARY TRANSLATOR │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ 1. Load │──►│ 2. Find │──►│ 3. Translate │ │ │
│ │ │ Binary │ │ Blocks │ │ Each Block │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Parse RV32I │ │ Build CFG │ │ Emit x86-64 │ │ │
│ │ │ opcodes │ │ (Control │ │ machine code │ │ │
│ │ │ │ │ Flow Graph) │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 4. Link Blocks │ │ │
│ │ │ - Patch branch targets │ │ │
│ │ │ - Add runtime support code │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 5. Generate Executable │ │ │
│ │ │ - Write ELF header │ │ │
│ │ │ - Write translated code │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Output: fibonacci.native │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Native x86-64 executable that runs directly on the host CPU │ │
│ │ Fibonacci(30) = 832040 in 0.003s (vs 0.892s interpreted) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
3.2 Functional Requirements
- Load RV32I binary files
- Parse simple raw binary format or minimal ELF
- Extract code section and entry point
- Identify basic blocks
- Find block boundaries (branches, jump targets)
- Build control flow graph
- Translate RV32I instructions to x86-64
- Support core RV32I instructions (ADD, SUB, AND, OR, XOR, SLL, SRL, SRA)
- Support memory operations (LW, SW, LB, SB, LH, SH)
- Support branches (BEQ, BNE, BLT, BGE, BLTU, BGEU)
- Support jumps (JAL, JALR)
- Support immediates (ADDI, ANDI, ORI, XORI, SLTI, LUI, AUIPC)
- Generate valid x86-64 code
- Emit correct instruction encodings
- Handle register mapping and spilling
- Manage return address and stack
- Link translated blocks
- Patch branch targets after all blocks are translated
- Handle forward references
- Output executable
- Generate minimal ELF executable or raw code with loader
3.3 Non-Functional Requirements
- Performance: Translated code should run 10-100x faster than interpretation
- Correctness: Produce identical results to the interpreter for all inputs
- Code quality: Clean, well-documented translation logic
- Debuggability: Option to emit assembly text alongside machine code
3.4 Example Usage / Output
# Compile a simple RISC-V program
$ cat fibonacci.c
int fib(int n) {
if (n <= 1) return n;
return fib(n-1) + fib(n-2);
}
int main() {
return fib(30);
}
$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -O2 -o fibonacci.rv32 fibonacci.c
# Run the translator
$ ./translator fibonacci.rv32 -o fibonacci.native -v
Binary Translator v0.1
======================
Loading fibonacci.rv32...
Code size: 1,248 bytes
Entry point: 0x10074
Finding basic blocks...
Block 0x10074: 12 instructions (function: main)
Block 0x10098: 8 instructions (function: fib)
Block 0x100b4: 4 instructions (fib: base case)
Block 0x100c0: 15 instructions (fib: recursive case)
Block 0x10100: 3 instructions (fib: return)
... 47 total blocks
Translating blocks...
Block 0x10074: 12 RISC-V → 8 x86-64 instructions
Block 0x10098: 8 RISC-V → 6 x86-64 instructions
Block 0x100b4: 4 RISC-V → 3 x86-64 instructions
Block 0x100c0: 15 RISC-V → 11 x86-64 instructions
Block 0x10100: 3 RISC-V → 2 x86-64 instructions
...
Linking blocks...
Patching 52 branch targets
Adding runtime support code
Generating executable...
Output: fibonacci.native (2,891 bytes)
Translation statistics:
RISC-V instructions: 423
x86-64 instructions: 312
Code expansion: 0.74x (smaller!)
Translation time: 0.012s
# Run the translated program
$ ./fibonacci.native
Fibonacci(30) = 832040
# Compare performance
$ time ./interpreter fibonacci.rv32
Fibonacci(30) = 832040
real 0m0.892s
$ time ./fibonacci.native
Fibonacci(30) = 832040
real 0m0.003s
# Speedup: 297x!
4. Solution Architecture
4.1 High-Level Design
TRANSLATOR COMPONENT ARCHITECTURE
┌─────────────────────────────────────────────────────────────────────┐
│ main.c │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ int main(int argc, char **argv) { │ │
│ │ Binary *bin = load_binary(input_file); │ │
│ │ CFG *cfg = find_basic_blocks(bin); │ │
│ │ TranslatedCode *code = translate_all_blocks(cfg); │ │
│ │ link_blocks(code); │ │
│ │ generate_executable(code, output_file); │ │
│ │ } │ │
│ └─────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┬────┘
│
┌────────────────┬───────────────┬──────────────────┼────────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ ┌─────────────┐
│ loader.c │ │ cfg.c │ │ translator.c │ │ linker.c │ │ emitter.c │
│ │ │ │ │ │ │ │ │ │
│ - load_binary() │ │ - find_ │ │ - translate_ │ │ - link_ │ │ - emit_ │
│ - parse_elf() │ │ blocks() │ │ block() │ │ blocks() │ │ prologue()│
│ - get_section() │ │ - build_ │ │ - translate_ │ │ - patch_ │ │ - emit_ │
│ │ │ cfg() │ │ instruction() │ │ branches()│ │ epilogue()│
│ │ │ │ │ - reg_alloc() │ │ │ │ - write_ │
│ │ │ │ │ │ │ │ │ elf() │
└───────────────────┘ └─────────────┘ └─────────────────┘ └─────────────┘ └─────────────┘
│
│
┌────────────────┴────────────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ rv32_decode.c │ │ x86_encode.c │
│ │ │ │
│ - decode_r_type() │ │ - emit_mov_reg() │
│ - decode_i_type() │ │ - emit_add_reg() │
│ - decode_s_type() │ │ - emit_jmp() │
│ - decode_b_type() │ │ - emit_call() │
│ - decode_u_type() │ │ - emit_ret() │
│ - decode_j_type() │ │ - ModRM encoding │
└─────────────────────┘ └─────────────────────┘
4.2 Key Components
Component 1: Loader (loader.c)
- Reads binary file into memory
- Parses ELF headers (or simple raw format)
- Extracts code, data, entry point
Component 2: CFG Builder (cfg.c)
- Scans code for branch/jump instructions
- Identifies basic block boundaries
- Builds control flow graph with edges
Component 3: Translator (translator.c)
- Main translation engine
- Manages register allocation state
- Coordinates instruction translation
Component 4: RISC-V Decoder (rv32_decode.c)
- Decodes RISC-V instruction formats
- Extracts opcode, rd, rs1, rs2, immediates
- Validates instruction encoding
Component 5: x86-64 Encoder (x86_encode.c)
- Emits x86-64 machine code bytes
- Handles REX prefixes, ModR/M, SIB
- Manages code buffer
Component 6: Linker (linker.c)
- Patches branch targets after translation
- Handles forward references
- Adds runtime support code
Component 7: Emitter (emitter.c)
- Generates final executable
- Writes ELF headers
- Manages sections
4.3 Data Structures
// Basic block representation
typedef struct BasicBlock {
uint32_t start_addr; // RISC-V address
uint32_t end_addr;
int num_instructions;
uint32_t *rv_instructions; // Original RISC-V code
uint8_t *x86_code; // Translated x86-64 code
int x86_code_size;
uint64_t x86_addr; // Address in output
// Control flow
struct BasicBlock *fall_through;
struct BasicBlock *branch_target;
int branch_type; // BRANCH_COND, BRANCH_UNCOND, BRANCH_CALL, etc.
// Linking
int num_fixups;
struct BranchFixup *fixups;
} BasicBlock;
// Control flow graph
typedef struct CFG {
BasicBlock *blocks;
int num_blocks;
BasicBlock *entry_block;
// Block lookup by RISC-V address
struct BlockMap *addr_map;
} CFG;
// Register allocation state
typedef struct RegAlloc {
int rv_to_x86[32]; // Guest reg -> host reg mapping (-1 = spilled)
int x86_to_rv[16]; // Host reg -> guest reg mapping (-1 = free)
bool dirty[32]; // Does guest reg need writeback?
// Spill management
uint64_t spill_base; // Base address of spill area
} RegAlloc;
// Branch fixup for linking
typedef struct BranchFixup {
uint64_t patch_offset; // Offset in x86 code to patch
uint32_t target_rv_addr; // RISC-V address of target
int fixup_type; // FIXUP_REL32, FIXUP_ABS64
} BranchFixup;
// Code buffer
typedef struct CodeBuffer {
uint8_t *code;
int size;
int capacity;
} CodeBuffer;
4.4 Algorithm Overview
TRANSLATION ALGORITHM
For each basic block:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ 1. DECODE all RISC-V instructions in block │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ for each instruction: │ │
│ │ decode opcode, funct3, funct7 │ │
│ │ extract rd, rs1, rs2, immediate │ │
│ │ store in instruction array │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ 2. TRANSLATE each instruction │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ for each instruction: │ │
│ │ allocate host registers for operands │ │
│ │ emit x86-64 code for operation │ │
│ │ update register allocation state │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ 3. HANDLE block terminator (branch/jump) │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ if unconditional jump: │ │
│ │ emit JMP (target to be patched) │ │
│ │ add fixup entry │ │
│ │ │ │
│ │ if conditional branch: │ │
│ │ emit comparison (CMP/TEST) │ │
│ │ emit conditional jump (JE/JNE/JL/JGE) │ │
│ │ add fixup entry │ │
│ │ fall through to next block │ │
│ │ │ │
│ │ if return: │ │
│ │ emit RET │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ 4. RECORD block information │
│ - x86 code bytes │
│ - size │
│ - fixup locations │
│ │
└─────────────────────────────────────────────────────────────────────┘
After all blocks translated:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ 5. ASSIGN addresses to all blocks │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ base_addr = code_start │ │
│ │ for each block: │ │
│ │ block.x86_addr = base_addr │ │
│ │ base_addr += block.x86_code_size │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ 6. PATCH all branch fixups │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ for each block: │ │
│ │ for each fixup: │ │
│ │ target_block = lookup(fixup.target_rv_addr) │ │
│ │ rel32 = target_block.x86_addr - (fixup_addr + 4) │ │
│ │ write rel32 at fixup.patch_offset │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ 7. GENERATE executable │
│ - ELF header │
│ - Program headers │
│ - Code section (all translated blocks) │
│ - Data section (guest memory, spill area) │
│ │
└─────────────────────────────────────────────────────────────────────┘
5. Implementation Guide
5.1 Development Environment Setup
# Install required tools
# On Ubuntu/Debian:
sudo apt-get install build-essential
sudo apt-get install binutils # For objdump to verify output
# RISC-V toolchain for test programs:
sudo apt-get install gcc-riscv64-unknown-elf
# On macOS:
brew install riscv-gnu-toolchain
# Verify installation
riscv64-unknown-elf-gcc --version
5.2 Project Structure
basic_block_translator/
├── Makefile
├── include/
│ ├── translator.h # Main header
│ ├── rv32_decode.h # RISC-V decoder
│ ├── x86_encode.h # x86-64 encoder
│ ├── cfg.h # Control flow graph
│ └── loader.h # Binary loader
├── src/
│ ├── main.c # Entry point
│ ├── loader.c # Binary loading
│ ├── cfg.c # Basic block finder
│ ├── translator.c # Translation engine
│ ├── rv32_decode.c # RISC-V instruction decoder
│ ├── x86_encode.c # x86-64 code emitter
│ ├── linker.c # Block linking
│ └── emitter.c # Executable generation
├── test/
│ ├── test_decode.c # Unit tests for decoder
│ ├── test_encode.c # Unit tests for encoder
│ ├── simple.c # Simple test program
│ ├── fibonacci.c # Recursive test
│ └── loops.c # Loop test
└── runtime/
└── support.S # Runtime support code (optional)
5.3 The Core Question You’re Answering
“How can we transform code written for one CPU architecture into code that runs natively on a different architecture, achieving near-native performance?”
This is the fundamental question behind:
- QEMU’s cross-architecture emulation
- Apple’s Rosetta 2 (x86 on ARM)
- Wine’s Windows-on-Linux support
- Game console emulators
5.4 Concepts You Must Understand First
Self-assessment questions before starting:
-
RISC-V Instruction Formats: Can you decode
0x00A58533intoadd x10, x11, x10by hand? -
x86-64 Encoding: Can you write the bytes for
mov rax, rbxwithout looking it up? (Answer: 48 89 D8) -
Register Allocation: Why is it important to minimize memory accesses? What is “spilling”?
-
Control Flow: How is a conditional branch different from a function call at the machine level?
-
Relocations: Why can’t we know branch target addresses until all code is generated?
5.5 Questions to Guide Your Design
Architecture Questions:
- How will you represent basic blocks in memory?
- How will you handle forward branches (to blocks not yet translated)?
- Should you translate all blocks upfront, or lazily on first execution?
Register Allocation Questions:
- Which RISC-V registers map to which x86-64 registers?
- What happens when you need more host registers than available?
- How do you handle x0 (which is always zero in RISC-V)?
Code Generation Questions:
- How do you handle RISC-V’s signed/unsigned branch variants?
- How do you implement LUI+AUIPC for building 32-bit constants?
- What x86 instructions correspond to RISC-V’s shift operations?
Performance Questions:
- Is it faster to emit multiple simple instructions or fewer complex ones?
- When should you use memory operands vs. explicit loads?
- How can you minimize code size while maintaining performance?
5.6 Thinking Exercise
Before writing any code, trace through this translation by hand:
RISC-V Basic Block:
┌────────────────────────────────────────────────────────────────────┐
│ 0x1000: addi x10, x0, 5 # x10 = 5 │
│ 0x1004: addi x11, x0, 3 # x11 = 3 │
│ 0x1008: add x12, x10, x11 # x12 = x10 + x11 │
│ 0x100c: sw x12, 0(x1) # store x12 to address in x1 │
│ 0x1010: bne x12, x0, -16 # if x12 != 0, jump back to 0x1000 │
└────────────────────────────────────────────────────────────────────┘
Questions:
1. What are the basic block boundaries?
2. Using this register mapping:
x10 → rax, x11 → rbx, x12 → rcx, x1 → r8
What x86-64 instructions would you generate?
3. How do you handle the branch target (0x1000)?
4. What if x1 needs to be loaded from memory first?
Write out your expected x86-64 translation before reading the hints!
5.7 Hints in Layers
Hint 1 - Starting Point (Conceptual Direction)
Start simple: first make a “transpiler” that outputs C code instead of machine code. This lets you verify translation logic without worrying about x86 encoding.
// Instead of emitting bytes, emit printf:
void translate_add(int rd, int rs1, int rs2) {
printf(" cpu.x[%d] = cpu.x[%d] + cpu.x[%d];\n", rd, rs1, rs2);
}
Then compile and run the C output. Once correct, switch to emitting actual machine code.
Hint 2 - Next Level (More Specific Guidance)
For x86 code emission, create helper functions for common patterns:
// Emit: mov reg1, reg2
void emit_mov_rr(CodeBuffer *buf, int dst, int src) {
emit_byte(buf, 0x48); // REX.W
emit_byte(buf, 0x89); // MOV r/m64, r64
emit_byte(buf, 0xC0 | (src << 3) | dst); // ModR/M
}
// Emit: add reg1, reg2
void emit_add_rr(CodeBuffer *buf, int dst, int src) {
emit_byte(buf, 0x48); // REX.W
emit_byte(buf, 0x01); // ADD r/m64, r64
emit_byte(buf, 0xC0 | (src << 3) | dst); // ModR/M
}
Build up your x86 encoder from these primitives.
Hint 3 - Technical Details (Approach/Pseudocode)
Translation for common instructions:
void translate_instruction(TransState *s, uint32_t insn) {
uint32_t opcode = insn & 0x7F;
switch (opcode) {
case 0x33: // R-type (ADD, SUB, etc.)
int rd = (insn >> 7) & 0x1F;
int rs1 = (insn >> 15) & 0x1F;
int rs2 = (insn >> 20) & 0x1F;
int funct3 = (insn >> 12) & 0x7;
int funct7 = (insn >> 25) & 0x7F;
// Get host registers
int h_rs1 = get_or_load(s, rs1); // Load rs1 to host reg
int h_rs2 = get_or_load(s, rs2); // Load rs2 to host reg
int h_rd = alloc_for_write(s, rd); // Allocate for rd
if (h_rd != h_rs1) {
emit_mov_rr(s->code, h_rd, h_rs1);
}
switch (funct7 << 3 | funct3) {
case 0x00: // ADD
emit_add_rr(s->code, h_rd, h_rs2);
break;
case 0x100: // SUB
emit_sub_rr(s->code, h_rd, h_rs2);
break;
// ... other operations
}
break;
case 0x63: // B-type (BEQ, BNE, etc.)
// ... branch handling
break;
// ... other instruction types
}
}
Hint 4 - Tools/Debugging (Verification Methods)
Use objdump to verify your generated code:
# Disassemble your output
objdump -d fibonacci.native
# Compare with expected:
# Should see x86-64 instructions that implement the RISC-V logic
# Test with a known program:
echo 'int main() { return 42; }' > test.c
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 test.c -o test.rv32
./translator test.rv32 -o test.native
./test.native
echo $? # Should print 42
Use GDB to step through translated code:
gdb ./fibonacci.native
(gdb) break *0x401000 # First instruction
(gdb) run
(gdb) si # Single step
(gdb) info registers
5.8 The Interview Questions They’ll Ask
- “How does binary translation differ from interpretation, and when would you use each?”
- Expected: Discuss startup cost vs execution speed tradeoff, hot path optimization
- “What is a basic block and why is it the unit of translation?”
- Expected: Single entry, single exit, no internal branches, enables optimization
- “How do you handle register mapping when the guest has more registers than the host?”
- Expected: Discuss spilling, allocation policies, liveness analysis
- “What happens when translated code needs to call back into the runtime?”
- Expected: ABI concerns, context saving, mode switching
- “How would you extend this translator to support self-modifying code?”
- Expected: Code cache invalidation, checking for writes to translated regions
- “Why is x86-64 instruction encoding so complex compared to RISC-V?”
- Expected: Variable length, legacy prefixes, multiple addressing modes, backwards compatibility
- “How does QEMU’s TCG handle cross-architecture translation?”
- Expected: Intermediate representation, frontend/backend separation, runtime helpers
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Basic blocks and CFG | Engineering a Compiler | Chapter 8 |
| Instruction selection | Engineering a Compiler | Chapter 11 |
| Register allocation | Engineering a Compiler | Chapter 13 |
| x86-64 encoding | Intel SDM Volume 2 | Instruction Reference |
| x86-64 assembly | Modern X86 Assembly Language Programming | Chapters 1-4 |
| RISC-V ISA | RISC-V Specification | Chapters 2, 12 |
| Binary translation | Virtual Machines (Smith & Nair) | Chapter 2 |
5.10 Implementation Phases
Phase 1: RISC-V Decoder (3-4 days)
- Implement instruction format parsing
- Test with known instruction bytes
- Build opcode dispatch table
Phase 2: Basic Block Finder (2-3 days)
- Scan for branches and jump targets
- Build basic block list
- Verify with simple programs
Phase 3: x86-64 Encoder (4-5 days)
- Implement core instruction encodings
- Test each instruction in isolation
- Verify with objdump
Phase 4: Translator Core (5-7 days)
- Implement instruction translation
- Handle register allocation
- Test arithmetic programs
Phase 5: Control Flow (3-4 days)
- Implement branch translation
- Add block linking
- Test with loops
Phase 6: Executable Generation (2-3 days)
- Generate ELF headers
- Write code section
- Test complete programs
Phase 7: Optimization (Optional, 3-5 days)
- Peephole optimizations
- Improved register allocation
- Benchmark and profile
5.11 Key Implementation Decisions
Decision 1: Register Allocation Strategy
- Simple: All guest registers in memory, load/store for each use
- Better: Map hot registers (x10-x17) to host registers, spill others
- Trade-off: Complexity vs performance
Decision 2: Code Layout
- Option A: Translate all blocks upfront, then link
- Option B: Lazy translation on first execution (like JIT)
- Trade-off: Startup time vs implementation complexity
Decision 3: Branch Handling
- Option A: Always indirect through dispatch table
- Option B: Direct branches with patching
- Trade-off: Flexibility vs performance
Decision 4: Memory Access Translation
- Option A: Inline address calculation
- Option B: Call helper functions
- Trade-off: Code size vs simplicity
6. Testing Strategy
6.1 Unit Tests
// test_decode.c
void test_decode_add() {
// add x10, x11, x12 = 0x00C58533
RVInsn insn = decode(0x00C58533);
assert(insn.opcode == OP_ADD);
assert(insn.rd == 10);
assert(insn.rs1 == 11);
assert(insn.rs2 == 12);
}
// test_encode.c
void test_encode_mov() {
CodeBuffer buf = {0};
emit_mov_rr(&buf, RAX, RBX);
assert(buf.size == 3);
assert(buf.code[0] == 0x48);
assert(buf.code[1] == 0x89);
assert(buf.code[2] == 0xD8);
}
6.2 Integration Tests
// Simple arithmetic
int test_add() {
int a = 5;
int b = 3;
return a + b; // Should return 8
}
// Loops
int test_sum() {
int sum = 0;
for (int i = 0; i < 10; i++) {
sum += i;
}
return sum; // Should return 45
}
// Recursion
int test_fib(int n) {
if (n <= 1) return n;
return test_fib(n-1) + test_fib(n-2);
}
// test_fib(10) should return 55
6.3 Verification Process
# For each test:
# 1. Compile to RISC-V
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 test.c -o test.rv32
# 2. Run with interpreter (reference)
./rv32_interpreter test.rv32
# Record result
# 3. Translate
./translator test.rv32 -o test.native
# 4. Run translated
./test.native
# Compare result
# 5. Verify correctness
if [ $interpreter_result -eq $translated_result ]; then
echo "PASS"
else
echo "FAIL"
fi
7. Common Pitfalls & Debugging
| Problem | Root Cause | Fix | Verification |
|---|---|---|---|
| “Segmentation fault on execution” | Invalid x86 encoding | Check REX prefix, ModR/M byte | objdump -d output.native |
| “Wrong arithmetic results” | Sign extension issues | Use MOVSX for signed loads | Test with negative numbers |
| “Branches go to wrong places” | Relative offset calculation wrong | Remember: offset is from AFTER the jump instruction | Single-step in GDB |
| “Infinite loops” | Back-edge not properly linked | Check fixup patching for negative offsets | Add loop counter limit |
| “Works for small programs, crashes on large” | Code buffer overflow | Dynamic buffer growth | Test with 10K+ instruction programs |
| “Registers corrupted across blocks” | Not saving/restoring at block boundaries | Ensure consistent register state | Print registers at block entry/exit |
| “Stack corruption” | RSP modified incorrectly | Keep RSP aligned to 16 bytes | Check RSP in GDB |
Debugging Tips
# 1. Add verbose output during translation
./translator --verbose test.rv32
# Shows each instruction translation
# 2. Emit assembly alongside binary
./translator --emit-asm test.rv32
# Generates test.native.s for review
# 3. Use GDB effectively
gdb ./test.native
(gdb) layout asm # Show disassembly
(gdb) layout regs # Show registers
(gdb) break *0x401000 # Break at start of translated code
(gdb) x/20i $rip # Examine next 20 instructions
(gdb) si # Single step
(gdb) p/x $rax # Print register
# 4. Compare with reference
./rv32_interpreter --trace test.rv32 > interp.log
./test.native --trace > native.log # (if you add tracing)
diff interp.log native.log
8. Extensions & Challenges
Extension 1: Optimization Passes
Add peephole optimizations:
- Remove redundant MOV instructions
- Combine constant loads
- Strength reduction (MUL by power of 2 → shift)
Extension 2: Better Register Allocation
Implement graph-coloring register allocation:
- Build interference graph
- Color with k colors (k = available registers)
- Spill lowest-priority registers
Extension 3: Block Chaining
Instead of returning to dispatcher after each block:
- Patch translated blocks to jump directly to each other
- Only return to dispatcher for indirect branches
Extension 4: Dynamic Translation
Convert to JIT:
- Start with interpreter
- Detect hot loops (backward branches executed N times)
- Compile hot blocks on demand
Extension 5: More Instructions
Add support for:
- RV32M (multiply/divide)
- RV32F (floating point)
- RV32A (atomics)
- Compressed instructions (RVC)
Extension 6: Self-Modifying Code
Handle code that writes to code regions:
- Write-protect translated code regions
- Invalidate on write
- Re-translate modified blocks
9. Real-World Connections
QEMU’s TCG (Tiny Code Generator)
Your project implements a simplified version of QEMU’s TCG:
QEMU TCG Architecture:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Your Project QEMU TCG │
│ ──────────── ──────── │
│ RISC-V → x86-64 (direct) Guest → TCG IR → Host │
│ Static translation Dynamic (JIT) │
│ Single target Multiple targets │
│ Basic blocks only Translation blocks + chaining │
│ │
│ Key Similarities: │
│ - Basic block as unit of translation │
│ - Register mapping guest → host │
│ - Patching branch targets │
│ - Code cache management │
│ │
└─────────────────────────────────────────────────────────────────────┘
Study QEMU’s source:
tcg/tcg.c- TCG coretcg/i386/tcg-target.c.inc- x86 backendtarget/riscv/translate.c- RISC-V frontend
Apple’s Rosetta 2
Rosetta 2 translates x86-64 to ARM64, the reverse of your project:
- Ahead-of-time translation for binaries
- JIT for dynamic code
- Handles complex x86 semantics (flags, memory ordering)
Other Systems
- FEX-Emu: Linux x86-64 on ARM64
- Box64: x86-64 emulator for ARM
- Dynamo: HP’s dynamic optimization system
- Pin: Intel’s dynamic instrumentation tool
10. Resources
Online References
- RISC-V Specification - Official ISA reference
- Intel SDM Volume 2 - x86 instruction reference
- x86-64 Instruction Encoding - OSDev Wiki
- Introduction to Dynamic Recompilation - Zenogais tutorial
Source Code to Study
- QEMU TCG:
git clone https://gitlab.com/qemu-project/qemu.git - rv8: RISC-V to x86 translator:
https://github.com/rv8-io/rv8 - TinyEMU: Small RISC-V emulator:
https://bellard.org/tinyemu/
Academic Papers
- “Binary Translation” - Sites, Chernoff (1993) - The classic paper
- “Efficient Method for Binary Translation” - Cifuentes, Malhotra (1996)
- “QEMU, a Fast and Portable Dynamic Translator” - Bellard (2005)
11. Self-Assessment Checklist
Milestone 1: Foundation Complete
- Can decode all RV32I instruction formats
- Can emit basic x86-64 instructions (MOV, ADD, SUB)
- Unit tests pass for decoder and encoder
Milestone 2: Single Block Translation
- Can translate a basic block with arithmetic instructions
- Register allocation works for mapped registers
- Translated code executes and produces correct results
Milestone 3: Control Flow
- Can identify basic blocks in a program
- Branch instructions translate correctly
- Block linking works for simple programs
Milestone 4: Complete Programs
- Can translate and run fibonacci(30)
- Performance is 10x+ better than interpretation
- All test programs produce correct results
Milestone 5: Robustness
- Handles all RV32I instructions
- Works with GCC-compiled programs
- No memory leaks or crashes on large programs
12. Submission / Completion Criteria
You have successfully completed this project when:
- Correctness: Your translator produces executables that compute the same results as the RISC-V interpreter for all test programs:
- Simple arithmetic
- Loops (for, while)
- Recursion (fibonacci, factorial)
- Array access
- Function calls
- Performance: Translated code runs at least 10x faster than interpretation:
```
fibonacci(30):
- Interpreter: > 0.5 seconds
- Translated: < 0.05 seconds ```
- Code Quality:
- Clean separation between components
- Well-documented translation logic
- Comprehensive test suite
- Understanding: You can explain:
- How basic blocks are identified
- Why register mapping matters
- How branch patching works
- The trade-offs between translation and interpretation
- Demonstration: You can show:
# Compile RISC-V program $ riscv64-unknown-elf-gcc -march=rv32i test.c -o test.rv32 # Translate to native $ ./translator test.rv32 -o test.native # Run and verify $ ./test.native Expected output: [correct result]
13. What’s Next?
After completing this project, you’re ready for:
- Project 4: Simple JIT Compiler - Add runtime compilation and hot path detection
- Project 5: Shadow Page Table Simulator - Understand memory virtualization
- Project 10-12: Hardware Virtualization - Use Intel VT-x for near-native performance
You now understand the core of how QEMU achieves cross-architecture emulation. The next step is dynamic translation (JIT), where code is compiled at runtime rather than ahead-of-time. This enables optimizations based on runtime behavior and handles dynamically-generated code.
After completing this project, you’ll understand why binary translation exists, how it achieves massive speedups over interpretation, and how systems like QEMU can run software for any architecture on any host. This is foundational knowledge for anyone working on virtualization, emulators, or low-level systems.