Project 3: Basic Block Binary Translator

Build a static binary translator that converts RISC-V basic blocks to x86-64 machine code, creating executables that run natively with 100-300x speedup over interpretation - the core technique behind QEMU’s Tiny Code Generator.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	3-4 weeks
Language	C
Alternative Languages	C++, Rust
Prerequisites	Projects 1-2 (emulation basics), x86-64 assembly knowledge, understanding of instruction encoding
Key Topics	Binary translation, basic block identification, register mapping, instruction selection, code emission, control flow graphs
Coolness Level	Level 5: Pure Magic (Super Cool)
Portfolio Value	Resume Gold
Main Book	“Engineering a Compiler” by Cooper & Torczon

1. Learning Objectives

By completing this project, you will:

Understand binary translation fundamentals: Explain how guest code can be converted to host code for native execution
Master basic block identification: Identify instruction sequences with single entry points and single exit points
Implement register mapping strategies: Map guest registers to host registers efficiently, with spillover to memory
Build an instruction selector: Choose optimal host instructions for each guest operation
Generate valid x86-64 machine code: Emit raw bytes that the CPU can execute directly
Handle control flow translation: Chain translated blocks together and handle branches
Understand QEMU’s TCG at a fundamental level: This is exactly how QEMU’s Tiny Code Generator works for cross-architecture emulation

2. Theoretical Foundation

2.1 What is Binary Translation?

Binary translation is the process of converting machine code from one instruction set architecture (ISA) to another. Unlike interpretation (which decodes and executes each instruction one at a time), binary translation converts entire sequences of instructions at once, then executes the translated code natively.

INTERPRETATION vs BINARY TRANSLATION

Interpretation (Project 2):
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  RISC-V Instruction          Interpreter Loop                        │
│  ┌─────────────────┐        ┌────────────────────────────────────┐  │
│  │ add x10, x11, x12│   ──►  │ 1. Fetch instruction (memory read) │  │
│  └─────────────────┘        │ 2. Decode opcode (bit extraction)  │  │
│                             │ 3. Execute (switch statement)       │  │
│                             │ 4. Update PC                        │  │
│                             │ 5. Repeat for EVERY instruction     │  │
│                             └────────────────────────────────────┘  │
│                                                                      │
│  Cost: 50-200 host instructions PER guest instruction               │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Binary Translation (This Project):
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  RISC-V Basic Block         Translation            x86-64 Block     │
│  ┌─────────────────┐       ┌───────────┐         ┌──────────────┐  │
│  │ add x10, x11, x12│       │           │         │ mov eax, ebx │  │
│  │ add x13, x10, x14│  ──►  │ One-time  │   ──►   │ add eax, ecx │  │
│  │ sw x13, 0(x15)  │       │ Translation│         │ add eax, r8d │  │
│  │ bne x10, x0, -12│       │           │         │ mov [r9], eax│  │
│  └─────────────────┘       └───────────┘         │ test eax, eax│  │
│                                                   │ jne target   │  │
│                                                   └──────────────┘  │
│                                                          │          │
│                                                          ▼          │
│                                            Execute natively, full   │
│                                            speed, many times!       │
│                                                                      │
│  Cost: Translation is expensive, but execution is nearly free       │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Performance Comparison:
┌────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Method               │ Startup │ Per-Instruction │ Overall        │
│  ─────────────────────┼─────────┼─────────────────┼────────────────│
│  Interpretation       │ Fast    │ ~100 cycles     │ Slow           │
│  Binary Translation   │ Slow    │ ~1-5 cycles     │ Fast           │
│                                                                     │
│  Break-even point: ~50-100 executions of the same block            │
│  Hot loops: 10,000x better with translation!                       │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

2.2 Basic Blocks: The Unit of Translation

A basic block is a sequence of instructions with:

One entry point: Execution always enters at the first instruction
One exit point: Execution always leaves at the last instruction
No internal branches: Instructions execute sequentially

BASIC BLOCK IDENTIFICATION

Source Code (C):                     Basic Blocks:
┌────────────────────────────┐      ┌──────────────────────────────────┐
│ int sum = 0;               │      │                                   │
│ for (int i = 0; i < n; i++)│      │  Block 1: Setup                  │
│ {                          │      │  ┌──────────────────────────┐    │
│     sum += arr[i];         │      │  │ li x10, 0     # sum = 0  │    │
│ }                          │      │  │ li x11, 0     # i = 0    │    │
│ return sum;                │      │  └──────────────────────────┘    │
└────────────────────────────┘      │            │                      │
                                    │            ▼                      │
                                    │  Block 2: Loop Header            │
                                    │  ┌──────────────────────────┐    │
                                    │  │ bge x11, x12, exit       │    │
                                    │  └──────────────────────────┘    │
                                    │            │                      │
                                    │            ▼                      │
                                    │  Block 3: Loop Body              │
                                    │  ┌──────────────────────────┐    │
                                    │  │ slli x13, x11, 2         │    │
                                    │  │ add x13, x13, x14        │    │
                                    │  │ lw x15, 0(x13)           │    │
                                    │  │ add x10, x10, x15        │    │
                                    │  │ addi x11, x11, 1         │    │
                                    │  │ j loop_header            │    │
                                    │  └──────────────────────────┘    │
                                    │                                   │
                                    │  Block 4: Exit                   │
                                    │  ┌──────────────────────────┐    │
                                    │  │ ret                      │    │
                                    │  └──────────────────────────┘    │
                                    └──────────────────────────────────┘

Block Boundaries:
- A block STARTS after: function entry, branch target, instruction after branch
- A block ENDS with: branch, jump, return, call, or before a branch target

2.3 Register Mapping Strategies

One of the key challenges in binary translation is mapping guest registers to host registers efficiently:

REGISTER MAPPING CHALLENGE

RISC-V has 32 registers: x0-x31
x86-64 has 16 registers: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8-r15

Problem: More guest registers than host registers!

STRATEGY 1: Direct Mapping (Partial)
┌─────────────────────────────────────────────────────────────────────┐
│ Map frequently-used RISC-V registers to x86-64 registers:          │
│                                                                     │
│ RISC-V     Purpose              x86-64    Notes                    │
│ ───────────────────────────────────────────────────────────────────│
│ x10 (a0)   Return value/arg1   rax       Natural for return value │
│ x11 (a1)   Argument 2          rbx       Callee-saved             │
│ x12 (a2)   Argument 3          rcx       Caller-saved             │
│ x13 (a3)   Argument 4          rdx       Caller-saved             │
│ x14 (a4)   Argument 5          r8        Caller-saved             │
│ x15 (a5)   Argument 6          r9        Caller-saved             │
│ x5  (t0)   Temporary           r10       Caller-saved             │
│ x6  (t1)   Temporary           r11       Caller-saved             │
│                                                                     │
│ All others: Spill to memory array                                  │
└─────────────────────────────────────────────────────────────────────┘

STRATEGY 2: Memory Array + Load/Store
┌─────────────────────────────────────────────────────────────────────┐
│ Keep ALL guest registers in memory, load/store as needed:          │
│                                                                     │
│ struct cpu_state {                                                  │
│     uint64_t x[32];  // Guest register file                        │
│ };                                                                  │
│                                                                     │
│ Translation of "add x10, x11, x12":                                │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ mov rax, [rbp + 11*8]   ; Load x11 from memory                │ │
│ │ add rax, [rbp + 12*8]   ; Add x12 from memory                 │ │
│ │ mov [rbp + 10*8], rax   ; Store to x10 in memory              │ │
│ └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│ Pros: Simple, correct, works for all registers                     │
│ Cons: Memory traffic, slower                                       │
└─────────────────────────────────────────────────────────────────────┘

STRATEGY 3: Hybrid (Recommended)
┌─────────────────────────────────────────────────────────────────────┐
│ Map hot registers to host registers, spill cold ones to memory:    │
│                                                                     │
│ 1. Direct map x10-x17 (a0-a7) to rax, rbx, rcx, rdx, r8-r11       │
│ 2. Keep rbp pointing to cpu_state structure                        │
│ 3. For x0-x9 and x18-x31: load/store from memory                  │
│                                                                     │
│ Translation of "add x10, x11, x12" (all mapped):                   │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ mov rax, rbx           ; x10 = x11 (mapped to rax, rbx)       │ │
│ │ add rax, rcx           ; x10 += x12 (mapped to rcx)           │ │
│ └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│ Translation of "add x20, x21, x22" (all spilled):                  │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ mov r12, [rbp + 21*8]  ; Load x21 (temporary host register)   │ │
│ │ add r12, [rbp + 22*8]  ; Add x22                              │ │
│ │ mov [rbp + 20*8], r12  ; Store to x20                         │ │
│ └────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

2.4 x86-64 Instruction Encoding

To emit machine code, you must understand x86-64 instruction encoding:

x86-64 INSTRUCTION ENCODING PRIMER

General Format:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  [Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]│
│  0-4 bytes  0-1   1-3      0-1      0-1   0-4            0-4        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

REX Prefix (for 64-bit operations and extended registers):
┌──────────────────────────────────────────────────────────────────────┐
│  Bit:    7   6   5   4   3   2   1   0                               │
│          0   1   0   0   W   R   X   B                               │
│                          │   │   │   │                               │
│                          │   │   │   └── Extension of ModR/M.rm     │
│                          │   │   │       or SIB.base                │
│                          │   │   └────── Extension of SIB.index     │
│                          │   └────────── Extension of ModR/M.reg    │
│                          └────────────── 1 = 64-bit operand         │
│                                                                      │
│  REX.W (48h): Use 64-bit operand size                               │
│  REX.R (44h): Access r8-r15 in ModR/M.reg                          │
│  REX.B (41h): Access r8-r15 in ModR/M.rm                           │
└──────────────────────────────────────────────────────────────────────┘

ModR/M Byte:
┌──────────────────────────────────────────────────────────────────────┐
│  Bit:    7   6   5   4   3   2   1   0                               │
│          └─mod─┘ └───reg───┘ └───r/m───┘                            │
│             │        │           │                                   │
│             │        │           └── Register or memory operand     │
│             │        └────────────── Register operand               │
│             └─────────────────────── Addressing mode:               │
│                                      00 = [reg]                     │
│                                      01 = [reg + disp8]             │
│                                      10 = [reg + disp32]            │
│                                      11 = reg (register direct)     │
└──────────────────────────────────────────────────────────────────────┘

COMMON ENCODINGS YOU'LL NEED:

Instruction             Encoding          Bytes
────────────────────────────────────────────────────────────────
mov rax, rbx           48 89 D8          3    ; REX.W + MOV + ModR/M
mov rax, rcx           48 89 C8          3
mov rax, imm64         48 B8 + 8 bytes   10   ; REX.W + MOV + immediate
mov eax, imm32         B8 + 4 bytes      5    ; MOV + immediate (32-bit)
add rax, rbx           48 01 D8          3    ; REX.W + ADD + ModR/M
add rax, imm32         48 05 + 4 bytes   6    ; REX.W + ADD + immediate
sub rax, rbx           48 29 D8          3
and rax, rbx           48 21 D8          3
or  rax, rbx           48 09 D8          3
xor rax, rbx           48 31 D8          3
cmp rax, rbx           48 39 D8          3
test rax, rax          48 85 C0          3
mov rax, [rbp+disp8]   48 8B 45 XX       4    ; Load from memory
mov [rbp+disp8], rax   48 89 45 XX       4    ; Store to memory
jmp rel32              E9 + 4 bytes      5    ; Near jump
je  rel32              0F 84 + 4 bytes   6    ; Jump if equal
jne rel32              0F 85 + 4 bytes   6    ; Jump if not equal
ret                    C3                1    ; Return

REGISTER ENCODING:

x86-64 Register   Code   With REX.R/B
───────────────────────────────────────
rax/eax           000    r8/r8d
rcx/ecx           001    r9/r9d
rdx/edx           010    r10/r10d
rbx/ebx           011    r11/r11d
rsp/esp           100    r12/r12d
rbp/ebp           101    r13/r13d
rsi/esi           110    r14/r14d
rdi/edi           111    r15/r15d

2.5 Why This Matters for QEMU

HOW QEMU'S TCG WORKS (SIMPLIFIED)

QEMU's Tiny Code Generator (TCG) does exactly what you're building:

┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  Guest Code        TCG Frontend       TCG IR         TCG Backend    │
│  (ARM, RISC-V,     (Per-target)      (Portable)     (Per-host)     │
│   MIPS, x86...)                                                     │
│                                                                      │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐   │
│  │ ARM      │     │ ARM      │     │ TCG ops  │     │ x86-64   │   │
│  │ add r0,  │ ──► │ frontend │ ──► │ add_i32  │ ──► │ backend  │   │
│  │  r1, r2  │     │          │     │ tmp0,    │     │          │   │
│  └──────────┘     └──────────┘     │ t1, t2   │     └──────────┘   │
│                                    └──────────┘           │         │
│                                                           ▼         │
│                                                    ┌──────────┐    │
│                                                    │ x86-64   │    │
│                                                    │ add eax, │    │
│                                                    │ ebx, ecx │    │
│                                                    └──────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Your Project: Simplified version (RISC-V → x86-64 directly)

┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  RISC-V Code              Your Translator            x86-64 Code    │
│                                                                      │
│  ┌──────────────┐        ┌──────────────┐        ┌──────────────┐  │
│  │ add x10,     │        │              │        │ mov eax, ebx │  │
│  │  x11, x12    │  ────► │  Direct      │  ────► │ add eax, ecx │  │
│  │              │        │  Translation │        │              │  │
│  └──────────────┘        └──────────────┘        └──────────────┘  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

After this project, you'll understand:
- Why QEMU can emulate any architecture on any host
- How TCG achieves 10-30x better performance than interpretation
- The tradeoffs between translation quality and compilation time
- Why hardware virtualization (VT-x) is still preferred when available

2.6 Common Misconceptions

Misconception 1: “Binary translation is just like a compiler”

Reality: Compilers have source code with types and structure. Binary translators work with raw machine code where intent must be inferred.

Misconception 2: “Translation is always better than interpretation”

Reality: Translation has startup cost. For code executed only once, interpretation is faster. This is why JIT compilers use interpretation initially.

Misconception 3: “You can always achieve native speed with translation”

Reality: Semantic differences between architectures (memory models, flags, etc.) require extra instructions. 70-90% native speed is typical.

Misconception 4: “x86-64 is simpler than RISC-V because it’s newer”

Reality: x86-64 is far more complex due to backwards compatibility. RISC-V’s regularity makes it easier to translate FROM.

3. Project Specification

3.1 What You Will Build

A static binary translator that:

Reads a RISC-V RV32I binary file
Identifies basic blocks in the code
Translates each basic block to x86-64 machine code
Links the translated blocks together
Outputs a native x86-64 executable

SYSTEM ARCHITECTURE

┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  Input: fibonacci.rv32                                              │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ 00000000: 00100513  addi x10, x0, 1      # n = 1             │   │
│  │ 00000004: 00100593  addi x11, x0, 1      # a = 1             │   │
│  │ 00000008: 00000613  addi x12, x0, 0      # b = 0             │   │
│  │ ...                                                          │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                    BINARY TRANSLATOR                          │   │
│  │                                                               │   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐ │   │
│  │  │ 1. Load     │──►│ 2. Find     │──►│ 3. Translate        │ │   │
│  │  │    Binary   │   │    Blocks   │   │    Each Block       │ │   │
│  │  └─────────────┘   └─────────────┘   └─────────────────────┘ │   │
│  │         │                │                      │             │   │
│  │         ▼                ▼                      ▼             │   │
│  │  ┌─────────────┐   ┌─────────────┐   ┌─────────────────────┐ │   │
│  │  │ Parse RV32I │   │ Build CFG   │   │ Emit x86-64         │ │   │
│  │  │ opcodes     │   │ (Control    │   │ machine code        │ │   │
│  │  │             │   │ Flow Graph) │   │                     │ │   │
│  │  └─────────────┘   └─────────────┘   └─────────────────────┘ │   │
│  │                                                               │   │
│  │  ┌─────────────────────────────────────────────────────────┐ │   │
│  │  │ 4. Link Blocks                                          │ │   │
│  │  │    - Patch branch targets                               │ │   │
│  │  │    - Add runtime support code                           │ │   │
│  │  └─────────────────────────────────────────────────────────┘ │   │
│  │                                                               │   │
│  │  ┌─────────────────────────────────────────────────────────┐ │   │
│  │  │ 5. Generate Executable                                  │ │   │
│  │  │    - Write ELF header                                   │ │   │
│  │  │    - Write translated code                              │ │   │
│  │  └─────────────────────────────────────────────────────────┘ │   │
│  │                                                               │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ▼                                       │
│  Output: fibonacci.native                                           │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │ Native x86-64 executable that runs directly on the host CPU  │   │
│  │ Fibonacci(30) = 832040 in 0.003s (vs 0.892s interpreted)     │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

3.2 Functional Requirements

Load RV32I binary files
- Parse simple raw binary format or minimal ELF
- Extract code section and entry point
Identify basic blocks
- Find block boundaries (branches, jump targets)
- Build control flow graph
Translate RV32I instructions to x86-64
- Support core RV32I instructions (ADD, SUB, AND, OR, XOR, SLL, SRL, SRA)
- Support memory operations (LW, SW, LB, SB, LH, SH)
- Support branches (BEQ, BNE, BLT, BGE, BLTU, BGEU)
- Support jumps (JAL, JALR)
- Support immediates (ADDI, ANDI, ORI, XORI, SLTI, LUI, AUIPC)
Generate valid x86-64 code
- Emit correct instruction encodings
- Handle register mapping and spilling
- Manage return address and stack
Link translated blocks
- Patch branch targets after all blocks are translated
- Handle forward references
Output executable
- Generate minimal ELF executable or raw code with loader

3.3 Non-Functional Requirements

Performance: Translated code should run 10-100x faster than interpretation
Correctness: Produce identical results to the interpreter for all inputs
Code quality: Clean, well-documented translation logic
Debuggability: Option to emit assembly text alongside machine code

3.4 Example Usage / Output

# Compile a simple RISC-V program
$ cat fibonacci.c
int fib(int n) {
    if (n <= 1) return n;
    return fib(n-1) + fib(n-2);
}
int main() {
    return fib(30);
}

$ riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -O2 -o fibonacci.rv32 fibonacci.c

# Run the translator
$ ./translator fibonacci.rv32 -o fibonacci.native -v
Binary Translator v0.1
======================
Loading fibonacci.rv32...
  Code size: 1,248 bytes
  Entry point: 0x10074

Finding basic blocks...
  Block 0x10074: 12 instructions (function: main)
  Block 0x10098: 8 instructions (function: fib)
  Block 0x100b4: 4 instructions (fib: base case)
  Block 0x100c0: 15 instructions (fib: recursive case)
  Block 0x10100: 3 instructions (fib: return)
  ... 47 total blocks

Translating blocks...
  Block 0x10074: 12 RISC-V → 8 x86-64 instructions
  Block 0x10098: 8 RISC-V → 6 x86-64 instructions
  Block 0x100b4: 4 RISC-V → 3 x86-64 instructions
  Block 0x100c0: 15 RISC-V → 11 x86-64 instructions
  Block 0x10100: 3 RISC-V → 2 x86-64 instructions
  ...

Linking blocks...
  Patching 52 branch targets
  Adding runtime support code

Generating executable...
  Output: fibonacci.native (2,891 bytes)

Translation statistics:
  RISC-V instructions: 423
  x86-64 instructions: 312
  Code expansion: 0.74x (smaller!)
  Translation time: 0.012s

# Run the translated program
$ ./fibonacci.native
Fibonacci(30) = 832040

# Compare performance
$ time ./interpreter fibonacci.rv32
Fibonacci(30) = 832040
real    0m0.892s

$ time ./fibonacci.native
Fibonacci(30) = 832040
real    0m0.003s

# Speedup: 297x!

4. Solution Architecture

4.1 High-Level Design

TRANSLATOR COMPONENT ARCHITECTURE

┌─────────────────────────────────────────────────────────────────────┐
│                           main.c                                     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ int main(int argc, char **argv) {                           │    │
│  │     Binary *bin = load_binary(input_file);                  │    │
│  │     CFG *cfg = find_basic_blocks(bin);                      │    │
│  │     TranslatedCode *code = translate_all_blocks(cfg);       │    │
│  │     link_blocks(code);                                      │    │
│  │     generate_executable(code, output_file);                 │    │
│  │ }                                                           │    │
│  └─────────────────────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────────────────┬────┘
                                                                 │
            ┌────────────────┬───────────────┬──────────────────┼────────────────┐
            │                │               │                  │                │
            ▼                ▼               ▼                  ▼                ▼
┌───────────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ ┌─────────────┐
│   loader.c        │ │  cfg.c      │ │  translator.c   │ │  linker.c   │ │  emitter.c  │
│                   │ │             │ │                 │ │             │ │             │
│ - load_binary()   │ │ - find_     │ │ - translate_    │ │ - link_     │ │ - emit_     │
│ - parse_elf()     │ │   blocks()  │ │   block()       │ │   blocks()  │ │   prologue()│
│ - get_section()   │ │ - build_    │ │ - translate_    │ │ - patch_    │ │ - emit_     │
│                   │ │   cfg()     │ │   instruction() │ │   branches()│ │   epilogue()│
│                   │ │             │ │ - reg_alloc()   │ │             │ │ - write_    │
│                   │ │             │ │                 │ │             │ │   elf()     │
└───────────────────┘ └─────────────┘ └─────────────────┘ └─────────────┘ └─────────────┘
                                               │
                                               │
                              ┌────────────────┴────────────────┐
                              │                                 │
                              ▼                                 ▼
                    ┌─────────────────────┐         ┌─────────────────────┐
                    │   rv32_decode.c     │         │   x86_encode.c      │
                    │                     │         │                     │
                    │ - decode_r_type()   │         │ - emit_mov_reg()    │
                    │ - decode_i_type()   │         │ - emit_add_reg()    │
                    │ - decode_s_type()   │         │ - emit_jmp()        │
                    │ - decode_b_type()   │         │ - emit_call()       │
                    │ - decode_u_type()   │         │ - emit_ret()        │
                    │ - decode_j_type()   │         │ - ModRM encoding    │
                    └─────────────────────┘         └─────────────────────┘

4.2 Key Components

Component 1: Loader (`loader.c`)

Reads binary file into memory
Parses ELF headers (or simple raw format)
Extracts code, data, entry point

Component 2: CFG Builder (`cfg.c`)

Scans code for branch/jump instructions
Identifies basic block boundaries
Builds control flow graph with edges

Component 3: Translator (`translator.c`)

Main translation engine
Manages register allocation state
Coordinates instruction translation

Component 4: RISC-V Decoder (`rv32_decode.c`)

Decodes RISC-V instruction formats
Extracts opcode, rd, rs1, rs2, immediates
Validates instruction encoding

Component 5: x86-64 Encoder (`x86_encode.c`)

Emits x86-64 machine code bytes
Handles REX prefixes, ModR/M, SIB
Manages code buffer

Component 6: Linker (`linker.c`)

Patches branch targets after translation
Handles forward references
Adds runtime support code

Component 7: Emitter (`emitter.c`)

Generates final executable
Writes ELF headers
Manages sections

4.3 Data Structures

// Basic block representation
typedef struct BasicBlock {
    uint32_t start_addr;      // RISC-V address
    uint32_t end_addr;
    int num_instructions;

    uint32_t *rv_instructions; // Original RISC-V code

    uint8_t *x86_code;        // Translated x86-64 code
    int x86_code_size;
    uint64_t x86_addr;        // Address in output

    // Control flow
    struct BasicBlock *fall_through;
    struct BasicBlock *branch_target;
    int branch_type;          // BRANCH_COND, BRANCH_UNCOND, BRANCH_CALL, etc.

    // Linking
    int num_fixups;
    struct BranchFixup *fixups;
} BasicBlock;

// Control flow graph
typedef struct CFG {
    BasicBlock *blocks;
    int num_blocks;
    BasicBlock *entry_block;

    // Block lookup by RISC-V address
    struct BlockMap *addr_map;
} CFG;

// Register allocation state
typedef struct RegAlloc {
    int rv_to_x86[32];        // Guest reg -> host reg mapping (-1 = spilled)
    int x86_to_rv[16];        // Host reg -> guest reg mapping (-1 = free)
    bool dirty[32];           // Does guest reg need writeback?

    // Spill management
    uint64_t spill_base;      // Base address of spill area
} RegAlloc;

// Branch fixup for linking
typedef struct BranchFixup {
    uint64_t patch_offset;    // Offset in x86 code to patch
    uint32_t target_rv_addr;  // RISC-V address of target
    int fixup_type;           // FIXUP_REL32, FIXUP_ABS64
} BranchFixup;

// Code buffer
typedef struct CodeBuffer {
    uint8_t *code;
    int size;
    int capacity;
} CodeBuffer;

4.4 Algorithm Overview

TRANSLATION ALGORITHM

For each basic block:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  1. DECODE all RISC-V instructions in block                         │
│     ┌────────────────────────────────────────────────────────────┐  │
│     │ for each instruction:                                       │  │
│     │   decode opcode, funct3, funct7                            │  │
│     │   extract rd, rs1, rs2, immediate                          │  │
│     │   store in instruction array                                │  │
│     └────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  2. TRANSLATE each instruction                                      │
│     ┌────────────────────────────────────────────────────────────┐  │
│     │ for each instruction:                                       │  │
│     │   allocate host registers for operands                     │  │
│     │   emit x86-64 code for operation                           │  │
│     │   update register allocation state                         │  │
│     └────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  3. HANDLE block terminator (branch/jump)                           │
│     ┌────────────────────────────────────────────────────────────┐  │
│     │ if unconditional jump:                                      │  │
│     │   emit JMP (target to be patched)                          │  │
│     │   add fixup entry                                          │  │
│     │                                                             │  │
│     │ if conditional branch:                                      │  │
│     │   emit comparison (CMP/TEST)                               │  │
│     │   emit conditional jump (JE/JNE/JL/JGE)                    │  │
│     │   add fixup entry                                          │  │
│     │   fall through to next block                               │  │
│     │                                                             │  │
│     │ if return:                                                  │  │
│     │   emit RET                                                 │  │
│     └────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  4. RECORD block information                                        │
│     - x86 code bytes                                                │
│     - size                                                          │
│     - fixup locations                                               │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

After all blocks translated:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  5. ASSIGN addresses to all blocks                                  │
│     ┌────────────────────────────────────────────────────────────┐  │
│     │ base_addr = code_start                                      │  │
│     │ for each block:                                             │  │
│     │   block.x86_addr = base_addr                               │  │
│     │   base_addr += block.x86_code_size                         │  │
│     └────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  6. PATCH all branch fixups                                         │
│     ┌────────────────────────────────────────────────────────────┐  │
│     │ for each block:                                             │  │
│     │   for each fixup:                                          │  │
│     │     target_block = lookup(fixup.target_rv_addr)            │  │
│     │     rel32 = target_block.x86_addr - (fixup_addr + 4)       │  │
│     │     write rel32 at fixup.patch_offset                      │  │
│     └────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  7. GENERATE executable                                             │
│     - ELF header                                                    │
│     - Program headers                                               │
│     - Code section (all translated blocks)                          │
│     - Data section (guest memory, spill area)                       │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

5. Implementation Guide

5.1 Development Environment Setup

# Install required tools
# On Ubuntu/Debian:
sudo apt-get install build-essential
sudo apt-get install binutils  # For objdump to verify output

# RISC-V toolchain for test programs:
sudo apt-get install gcc-riscv64-unknown-elf

# On macOS:
brew install riscv-gnu-toolchain

# Verify installation
riscv64-unknown-elf-gcc --version

5.2 Project Structure

basic_block_translator/
├── Makefile
├── include/
│   ├── translator.h      # Main header
│   ├── rv32_decode.h     # RISC-V decoder
│   ├── x86_encode.h      # x86-64 encoder
│   ├── cfg.h             # Control flow graph
│   └── loader.h          # Binary loader
├── src/
│   ├── main.c            # Entry point
│   ├── loader.c          # Binary loading
│   ├── cfg.c             # Basic block finder
│   ├── translator.c      # Translation engine
│   ├── rv32_decode.c     # RISC-V instruction decoder
│   ├── x86_encode.c      # x86-64 code emitter
│   ├── linker.c          # Block linking
│   └── emitter.c         # Executable generation
├── test/
│   ├── test_decode.c     # Unit tests for decoder
│   ├── test_encode.c     # Unit tests for encoder
│   ├── simple.c          # Simple test program
│   ├── fibonacci.c       # Recursive test
│   └── loops.c           # Loop test
└── runtime/
    └── support.S         # Runtime support code (optional)

5.3 The Core Question You’re Answering

“How can we transform code written for one CPU architecture into code that runs natively on a different architecture, achieving near-native performance?”

This is the fundamental question behind:

QEMU’s cross-architecture emulation
Apple’s Rosetta 2 (x86 on ARM)
Wine’s Windows-on-Linux support
Game console emulators

5.4 Concepts You Must Understand First

Self-assessment questions before starting:

RISC-V Instruction Formats: Can you decode 0x00A58533 into add x10, x11, x10 by hand?
x86-64 Encoding: Can you write the bytes for mov rax, rbx without looking it up? (Answer: 48 89 D8)
Register Allocation: Why is it important to minimize memory accesses? What is “spilling”?
Control Flow: How is a conditional branch different from a function call at the machine level?
Relocations: Why can’t we know branch target addresses until all code is generated?

5.5 Questions to Guide Your Design

Architecture Questions:

How will you represent basic blocks in memory?
How will you handle forward branches (to blocks not yet translated)?
Should you translate all blocks upfront, or lazily on first execution?

Register Allocation Questions:

Which RISC-V registers map to which x86-64 registers?
What happens when you need more host registers than available?
How do you handle x0 (which is always zero in RISC-V)?

Code Generation Questions:

How do you handle RISC-V’s signed/unsigned branch variants?
How do you implement LUI+AUIPC for building 32-bit constants?
What x86 instructions correspond to RISC-V’s shift operations?

Performance Questions:

Is it faster to emit multiple simple instructions or fewer complex ones?
When should you use memory operands vs. explicit loads?
How can you minimize code size while maintaining performance?

5.6 Thinking Exercise

Before writing any code, trace through this translation by hand:

RISC-V Basic Block:
┌────────────────────────────────────────────────────────────────────┐
│ 0x1000: addi x10, x0, 5     # x10 = 5                             │
│ 0x1004: addi x11, x0, 3     # x11 = 3                             │
│ 0x1008: add  x12, x10, x11  # x12 = x10 + x11                     │
│ 0x100c: sw   x12, 0(x1)     # store x12 to address in x1          │
│ 0x1010: bne  x12, x0, -16   # if x12 != 0, jump back to 0x1000    │
└────────────────────────────────────────────────────────────────────┘

Questions:
1. What are the basic block boundaries?
2. Using this register mapping:
   x10 → rax, x11 → rbx, x12 → rcx, x1 → r8
   What x86-64 instructions would you generate?
3. How do you handle the branch target (0x1000)?
4. What if x1 needs to be loaded from memory first?

Write out your expected x86-64 translation before reading the hints!

5.7 Hints in Layers

Hint 1 - Starting Point (Conceptual Direction)

Start simple: first make a “transpiler” that outputs C code instead of machine code. This lets you verify translation logic without worrying about x86 encoding.

// Instead of emitting bytes, emit printf:
void translate_add(int rd, int rs1, int rs2) {
    printf("    cpu.x[%d] = cpu.x[%d] + cpu.x[%d];\n", rd, rs1, rs2);
}

Then compile and run the C output. Once correct, switch to emitting actual machine code.

Hint 2 - Next Level (More Specific Guidance)

For x86 code emission, create helper functions for common patterns:

// Emit: mov reg1, reg2
void emit_mov_rr(CodeBuffer *buf, int dst, int src) {
    emit_byte(buf, 0x48);  // REX.W
    emit_byte(buf, 0x89);  // MOV r/m64, r64
    emit_byte(buf, 0xC0 | (src << 3) | dst);  // ModR/M
}

// Emit: add reg1, reg2
void emit_add_rr(CodeBuffer *buf, int dst, int src) {
    emit_byte(buf, 0x48);  // REX.W
    emit_byte(buf, 0x01);  // ADD r/m64, r64
    emit_byte(buf, 0xC0 | (src << 3) | dst);  // ModR/M
}

Build up your x86 encoder from these primitives.

Hint 3 - Technical Details (Approach/Pseudocode)

Translation for common instructions:

void translate_instruction(TransState *s, uint32_t insn) {
    uint32_t opcode = insn & 0x7F;

    switch (opcode) {
    case 0x33:  // R-type (ADD, SUB, etc.)
        int rd = (insn >> 7) & 0x1F;
        int rs1 = (insn >> 15) & 0x1F;
        int rs2 = (insn >> 20) & 0x1F;
        int funct3 = (insn >> 12) & 0x7;
        int funct7 = (insn >> 25) & 0x7F;

        // Get host registers
        int h_rs1 = get_or_load(s, rs1);  // Load rs1 to host reg
        int h_rs2 = get_or_load(s, rs2);  // Load rs2 to host reg
        int h_rd = alloc_for_write(s, rd); // Allocate for rd

        if (h_rd != h_rs1) {
            emit_mov_rr(s->code, h_rd, h_rs1);
        }

        switch (funct7 << 3 | funct3) {
        case 0x00:  // ADD
            emit_add_rr(s->code, h_rd, h_rs2);
            break;
        case 0x100: // SUB
            emit_sub_rr(s->code, h_rd, h_rs2);
            break;
        // ... other operations
        }
        break;

    case 0x63:  // B-type (BEQ, BNE, etc.)
        // ... branch handling
        break;

    // ... other instruction types
    }
}

Hint 4 - Tools/Debugging (Verification Methods)

Use objdump to verify your generated code:

# Disassemble your output
objdump -d fibonacci.native

# Compare with expected:
# Should see x86-64 instructions that implement the RISC-V logic

# Test with a known program:
echo 'int main() { return 42; }' > test.c
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 test.c -o test.rv32
./translator test.rv32 -o test.native
./test.native
echo $?  # Should print 42

Use GDB to step through translated code:

gdb ./fibonacci.native
(gdb) break *0x401000  # First instruction
(gdb) run
(gdb) si  # Single step
(gdb) info registers

5.8 The Interview Questions They’ll Ask

“How does binary translation differ from interpretation, and when would you use each?”
- Expected: Discuss startup cost vs execution speed tradeoff, hot path optimization
“What is a basic block and why is it the unit of translation?”
- Expected: Single entry, single exit, no internal branches, enables optimization
“How do you handle register mapping when the guest has more registers than the host?”
- Expected: Discuss spilling, allocation policies, liveness analysis
“What happens when translated code needs to call back into the runtime?”
- Expected: ABI concerns, context saving, mode switching
“How would you extend this translator to support self-modifying code?”
- Expected: Code cache invalidation, checking for writes to translated regions
“Why is x86-64 instruction encoding so complex compared to RISC-V?”
- Expected: Variable length, legacy prefixes, multiple addressing modes, backwards compatibility
“How does QEMU’s TCG handle cross-architecture translation?”
- Expected: Intermediate representation, frontend/backend separation, runtime helpers

5.9 Books That Will Help

Topic	Book	Chapter
Basic blocks and CFG	Engineering a Compiler	Chapter 8
Instruction selection	Engineering a Compiler	Chapter 11
Register allocation	Engineering a Compiler	Chapter 13
x86-64 encoding	Intel SDM Volume 2	Instruction Reference
x86-64 assembly	Modern X86 Assembly Language Programming	Chapters 1-4
RISC-V ISA	RISC-V Specification	Chapters 2, 12
Binary translation	Virtual Machines (Smith & Nair)	Chapter 2

5.10 Implementation Phases

Phase 1: RISC-V Decoder (3-4 days)

Implement instruction format parsing
Test with known instruction bytes
Build opcode dispatch table

Phase 2: Basic Block Finder (2-3 days)

Scan for branches and jump targets
Build basic block list
Verify with simple programs

Phase 3: x86-64 Encoder (4-5 days)

Implement core instruction encodings
Test each instruction in isolation
Verify with objdump

Phase 4: Translator Core (5-7 days)

Implement instruction translation
Handle register allocation
Test arithmetic programs

Phase 5: Control Flow (3-4 days)

Implement branch translation
Add block linking
Test with loops

Phase 6: Executable Generation (2-3 days)

Generate ELF headers
Write code section
Test complete programs

Phase 7: Optimization (Optional, 3-5 days)

Peephole optimizations
Improved register allocation
Benchmark and profile

5.11 Key Implementation Decisions

Decision 1: Register Allocation Strategy

Simple: All guest registers in memory, load/store for each use
Better: Map hot registers (x10-x17) to host registers, spill others
Trade-off: Complexity vs performance

Decision 2: Code Layout

Option A: Translate all blocks upfront, then link
Option B: Lazy translation on first execution (like JIT)
Trade-off: Startup time vs implementation complexity

Decision 3: Branch Handling

Option A: Always indirect through dispatch table
Option B: Direct branches with patching
Trade-off: Flexibility vs performance

Decision 4: Memory Access Translation

Option A: Inline address calculation
Option B: Call helper functions
Trade-off: Code size vs simplicity

6. Testing Strategy

6.1 Unit Tests

// test_decode.c
void test_decode_add() {
    // add x10, x11, x12 = 0x00C58533
    RVInsn insn = decode(0x00C58533);
    assert(insn.opcode == OP_ADD);
    assert(insn.rd == 10);
    assert(insn.rs1 == 11);
    assert(insn.rs2 == 12);
}

// test_encode.c
void test_encode_mov() {
    CodeBuffer buf = {0};
    emit_mov_rr(&buf, RAX, RBX);
    assert(buf.size == 3);
    assert(buf.code[0] == 0x48);
    assert(buf.code[1] == 0x89);
    assert(buf.code[2] == 0xD8);
}

6.2 Integration Tests

// Simple arithmetic
int test_add() {
    int a = 5;
    int b = 3;
    return a + b;  // Should return 8
}

// Loops
int test_sum() {
    int sum = 0;
    for (int i = 0; i < 10; i++) {
        sum += i;
    }
    return sum;  // Should return 45
}

// Recursion
int test_fib(int n) {
    if (n <= 1) return n;
    return test_fib(n-1) + test_fib(n-2);
}
// test_fib(10) should return 55

6.3 Verification Process

# For each test:
# 1. Compile to RISC-V
riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 test.c -o test.rv32

# 2. Run with interpreter (reference)
./rv32_interpreter test.rv32
# Record result

# 3. Translate
./translator test.rv32 -o test.native

# 4. Run translated
./test.native
# Compare result

# 5. Verify correctness
if [ $interpreter_result -eq $translated_result ]; then
    echo "PASS"
else
    echo "FAIL"
fi

7. Common Pitfalls & Debugging

Problem	Root Cause	Fix	Verification
“Segmentation fault on execution”	Invalid x86 encoding	Check REX prefix, ModR/M byte	objdump -d output.native
“Wrong arithmetic results”	Sign extension issues	Use MOVSX for signed loads	Test with negative numbers
“Branches go to wrong places”	Relative offset calculation wrong	Remember: offset is from AFTER the jump instruction	Single-step in GDB
“Infinite loops”	Back-edge not properly linked	Check fixup patching for negative offsets	Add loop counter limit
“Works for small programs, crashes on large”	Code buffer overflow	Dynamic buffer growth	Test with 10K+ instruction programs
“Registers corrupted across blocks”	Not saving/restoring at block boundaries	Ensure consistent register state	Print registers at block entry/exit
“Stack corruption”	RSP modified incorrectly	Keep RSP aligned to 16 bytes	Check RSP in GDB

Debugging Tips

# 1. Add verbose output during translation
./translator --verbose test.rv32
# Shows each instruction translation

# 2. Emit assembly alongside binary
./translator --emit-asm test.rv32
# Generates test.native.s for review

# 3. Use GDB effectively
gdb ./test.native
(gdb) layout asm           # Show disassembly
(gdb) layout regs          # Show registers
(gdb) break *0x401000      # Break at start of translated code
(gdb) x/20i $rip           # Examine next 20 instructions
(gdb) si                   # Single step
(gdb) p/x $rax             # Print register

# 4. Compare with reference
./rv32_interpreter --trace test.rv32 > interp.log
./test.native --trace > native.log  # (if you add tracing)
diff interp.log native.log

8. Extensions & Challenges

Extension 1: Optimization Passes

Add peephole optimizations:

Remove redundant MOV instructions
Combine constant loads
Strength reduction (MUL by power of 2 → shift)

Extension 2: Better Register Allocation

Implement graph-coloring register allocation:

Build interference graph
Color with k colors (k = available registers)
Spill lowest-priority registers

Extension 3: Block Chaining

Instead of returning to dispatcher after each block:

Patch translated blocks to jump directly to each other
Only return to dispatcher for indirect branches

Extension 4: Dynamic Translation

Convert to JIT:

Start with interpreter
Detect hot loops (backward branches executed N times)
Compile hot blocks on demand

Extension 5: More Instructions

Add support for:

RV32M (multiply/divide)
RV32F (floating point)
RV32A (atomics)
Compressed instructions (RVC)

Extension 6: Self-Modifying Code

Handle code that writes to code regions:

Write-protect translated code regions
Invalidate on write
Re-translate modified blocks

9. Real-World Connections

QEMU’s TCG (Tiny Code Generator)

Your project implements a simplified version of QEMU’s TCG:

QEMU TCG Architecture:
┌─────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  Your Project                        QEMU TCG                       │
│  ────────────                        ────────                       │
│  RISC-V → x86-64 (direct)           Guest → TCG IR → Host          │
│  Static translation                  Dynamic (JIT)                  │
│  Single target                       Multiple targets               │
│  Basic blocks only                   Translation blocks + chaining  │
│                                                                      │
│  Key Similarities:                                                  │
│  - Basic block as unit of translation                               │
│  - Register mapping guest → host                                    │
│  - Patching branch targets                                          │
│  - Code cache management                                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Study QEMU’s source:

tcg/tcg.c - TCG core
tcg/i386/tcg-target.c.inc - x86 backend
target/riscv/translate.c - RISC-V frontend

Apple’s Rosetta 2

Rosetta 2 translates x86-64 to ARM64, the reverse of your project:

Ahead-of-time translation for binaries
JIT for dynamic code
Handles complex x86 semantics (flags, memory ordering)

Other Systems

FEX-Emu: Linux x86-64 on ARM64
Box64: x86-64 emulator for ARM
Dynamo: HP’s dynamic optimization system
Pin: Intel’s dynamic instrumentation tool

10. Resources

Online References

RISC-V Specification - Official ISA reference
Intel SDM Volume 2 - x86 instruction reference
x86-64 Instruction Encoding - OSDev Wiki
Introduction to Dynamic Recompilation - Zenogais tutorial

Source Code to Study

QEMU TCG: git clone https://gitlab.com/qemu-project/qemu.git
rv8: RISC-V to x86 translator: https://github.com/rv8-io/rv8
TinyEMU: Small RISC-V emulator: https://bellard.org/tinyemu/

Academic Papers

“Binary Translation” - Sites, Chernoff (1993) - The classic paper
“Efficient Method for Binary Translation” - Cifuentes, Malhotra (1996)
“QEMU, a Fast and Portable Dynamic Translator” - Bellard (2005)

11. Self-Assessment Checklist

Milestone 1: Foundation Complete

Can decode all RV32I instruction formats
Can emit basic x86-64 instructions (MOV, ADD, SUB)
Unit tests pass for decoder and encoder

Milestone 2: Single Block Translation

Can translate a basic block with arithmetic instructions
Register allocation works for mapped registers
Translated code executes and produces correct results

Milestone 3: Control Flow

Can identify basic blocks in a program
Branch instructions translate correctly
Block linking works for simple programs

Milestone 4: Complete Programs

Can translate and run fibonacci(30)
Performance is 10x+ better than interpretation
All test programs produce correct results

Milestone 5: Robustness

Handles all RV32I instructions
Works with GCC-compiled programs
No memory leaks or crashes on large programs

12. Submission / Completion Criteria

You have successfully completed this project when:

Correctness: Your translator produces executables that compute the same results as the RISC-V interpreter for all test programs:
- Simple arithmetic
- Loops (for, while)
- Recursion (fibonacci, factorial)
- Array access
- Function calls
Performance: Translated code runs at least 10x faster than interpretation: ``` fibonacci(30):
- Interpreter: > 0.5 seconds
- Translated: < 0.05 seconds ```
Code Quality:
- Clean separation between components
- Well-documented translation logic
- Comprehensive test suite
Understanding: You can explain:
- How basic blocks are identified
- Why register mapping matters
- How branch patching works
- The trade-offs between translation and interpretation

Demonstration: You can show:

# Compile RISC-V program
$ riscv64-unknown-elf-gcc -march=rv32i test.c -o test.rv32

# Translate to native
$ ./translator test.rv32 -o test.native

# Run and verify
$ ./test.native
Expected output: [correct result]

13. What’s Next?

After completing this project, you’re ready for:

Project 4: Simple JIT Compiler - Add runtime compilation and hot path detection
Project 5: Shadow Page Table Simulator - Understand memory virtualization
Project 10-12: Hardware Virtualization - Use Intel VT-x for near-native performance

You now understand the core of how QEMU achieves cross-architecture emulation. The next step is dynamic translation (JIT), where code is compiled at runtime rather than ahead-of-time. This enables optimizations based on runtime behavior and handles dynamically-generated code.

After completing this project, you’ll understand why binary translation exists, how it achieves massive speedups over interpretation, and how systems like QEMU can run software for any architecture on any host. This is foundational knowledge for anyone working on virtualization, emulators, or low-level systems.