← Back to all projects

LEARN DISASSEMBLY FROM SCRATCH

Learn Disassembly: By Building an x86-64 Disassembler in C

Goal: To deeply understand how disassembly works by building your own disassembler from scratch. You will learn how human-readable assembly instructions are encoded into the raw bytes of machine code and write a tool to translate them back.


Why Build a Disassembler?

Using a disassembler like objdump, Ghidra, or IDA is easy. But how do they work? How do they turn a sea of bytes like 48 89 e5 into the instruction mov rbp, rsp? This process is not magic, but a well-defined decoding of a complex instruction format.

Building your own disassembler is the only way to truly learn this process. You will not be building a full-featured tool, but a simple engine that forces you to understand:

  • The x86-64 Instruction Format: The core of this project. You’ll learn about opcodes, prefixes, and the complex ModR/M and SIB bytes that describe operands.
  • Variable-Length Instructions: The primary challenge of x86 disassembly. An instruction can be anywhere from 1 to 15 bytes long.
  • CPU Operating Modes: You’ll see how 64-bit mode (via the REX prefix) changes the game.
  • Disassembly Algorithms: You’ll start with a simple “Linear Sweep” and understand its limitations.

After completing this project, you will never look at an executable file the same way again. You will see the raw bytes and have a fundamental understanding of what they mean.


Core Concept Analysis: The x86-64 Instruction Format

The central challenge is that x86-64 instructions are variable-length. Your disassembler’s main job is to figure out, for each instruction, “how many bytes does this instruction consume?” To do this, you must parse the instruction format, which is composed of several optional and mandatory parts in a specific order.

[Legacy Prefixes] [REX Prefix] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]
  (0-4 bytes)       (0-1 byte)  (1-3 bytes) (0-1 byte)(0-1 byte) (0-8 bytes)    (0-8 bytes)

Key Components You Will Decode

  1. Opcode (The Verb): This is the main part of the instruction, telling the CPU what to do (e.g., MOV, ADD, JMP). It’s 1-3 bytes long. Your disassembler will have a giant lookup table or switch statement for this.
  2. ModR/M Byte (The Nouns): This incredibly important byte tells the CPU where to find its operands. It specifies whether the operands are registers or memory locations. It is divided into three fields:
    • Mod (2 bits): Specifies register-vs-memory.
    • Reg (3 bits): Identifies a register operand.
    • R/M (3 bits): Identifies a register or memory operand.
  3. REX Prefix (The 64-bit Extender): In 64-bit mode, this optional byte (0x40 to 0x4F) is used to access the extended registers (R8-R15) and to specify the use of 64-bit operand sizes. Your disassembler must check for this prefix to correctly identify registers and operand sizes.
  4. SIB Byte (Scale-Index-Base): An optional byte that follows the ModR/M byte to define more complex memory addressing, like [rax + rcx*8].
  5. Displacement & Immediate: Raw byte values embedded in the instruction. A displacement is part of a memory address calculation (e.g., the 0x20 in [rbp - 0x20]). An immediate is a constant value used in an operation (e.g., the 0x5 in add eax, 5).

The Project: Build “Deconstruct”, an x86-64 Disassembler

This is a single project broken into stages. Each stage will add support for a new part of the x86-64 instruction format. You will build a C program that reads raw bytes from an array and prints the corresponding assembly mnemonics.

Main Programming Language: C (C99 or later) Main “Book”: The Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM), Volume 2: Instruction Set Reference. This is the authoritative source. No other resource is as complete. Have it open at all times. Helpful Resource: The OSDev.org x86-64 Opcode List provides a friendlier, though less complete, view.


Stage 1: The Basic Loop and Single-Byte Opcodes

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: C Programming / CPU Architecture
  • Goal: Build the main loop and handle the simplest instructions.

What you’ll build: A C program that takes a byte array, loops through it, and decodes only single-byte opcodes.

Why it teaches disassembly: This establishes the core “read-decode-advance” loop. It forces you to create your first opcode table and see the direct byte-to-instruction relationship in its simplest form.

Core challenges you’ll face:

  • The Main Loop → maps to for (int pc = 0; pc < len; ) where pc is advanced by the size of the decoded instruction
  • Opcode Lookup Table → maps to a giant switch statement or an array of structs/strings
  • Decoding an instruction → maps to a function int decode_instruction(uint8_t* buffer, int pc) that returns the number of bytes consumed

Key Concepts:

  • Instruction Pointer: The concept of a program counter (pc) that points to the start of the next instruction to be executed.
  • Opcode Map: The 1:1 mapping of a byte value to a mnemonic.

Real world outcome: Your program takes an array like uint8_t code[] = {0x90, 0xc3, 0x50, 0x58}; and prints:

0x0000: 90          nop
0x0001: c3          ret
0x0002: 50          push rax
0x0003: 58          pop rax

Implementation Hints: Your main decode function will start simple:

int decode(uint8_t* buffer, int pc) {
    uint8_t opcode = buffer[pc];
    switch (opcode) {
        case 0x90: printf("nop\n"); return 1;
        case 0xc3: printf("ret\n"); return 1;
        // Opcodes 0x50 through 0x57 are PUSH register
        case 0x50: printf("push rax\n"); return 1;
        // ... more cases
        default: printf("unknown\n"); return 1;
    }
}

Notice how instructions for push on different registers have sequential opcodes. You can use this to your advantage: if (opcode >= 0x50 && opcode <= 0x57) ....


Stage 2: The REX Prefix and 64-bit Operands

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 2: Intermediate
  • Goal: Add support for 64-bit mode extensions.

What you’ll build: You will augment your decoder to first check for a REX prefix (0x40 - 0x4F). If present, you’ll parse its bits to determine if the operand size is 64-bit and if extended registers (R8-R15) are being used.

Why it teaches disassembly: The REX prefix is fundamental to x86-64. This stage teaches you that an instruction is not just its opcode; optional prefixes can fundamentally change its meaning and the registers it operates on.

Core challenges you’ll face:

  • Detecting the REX prefix → maps to checking if the byte at pc is in the 0x40-0x4F range
  • Parsing REX bits → maps to using bitwise operations to check the W (64-bit), R (ModR/M Reg extension), X (SIB Index extension), and B (ModR/M R/M extension) bits
  • Storing the REX info → maps to passing a rex_info struct or flags through your decoding functions

Key Concepts:

  • Intel SDM, Vol 2, Chapter 2.2.1.2: REX Prefixes.
  • Bitwise Operations: Using & and >> to extract information from a byte.

Real world outcome: Your program can now distinguish between 50 (push rax) and 41 50 (push r8).

// REX byte is 0b0100WRXB
// W bit (bit 3): 1 = 64-bit operand size
// R bit (bit 2): Extends ModR/M 'reg' field
// X bit (bit 1): Extends SIB 'index' field
// B bit (bit 0): Extends ModR/M 'r/m' field or 'reg' field in opcode

// In your decode function, before the switch:
bool rex_w = false;
bool rex_b = false;
//...
if (buffer[pc] >= 0x40 && buffer[pc] <= 0x4F) {
    uint8_t rex = buffer[pc];
    rex_w = (rex & 0b1000) != 0;
    rex_b = (rex & 0b0001) != 0;
    // ...
    pc++; // Consume the REX prefix
}
uint8_t opcode = buffer[pc];
//...

Stage 3: The ModR/M Byte and Register Operands

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 3: Advanced
  • Goal: Decode instructions with two register operands.

What you’ll build: Support for opcodes that are followed by a ModR/M byte. You will start by only handling the case where Mod == 11, which specifies a register-to-register operation.

Why it teaches disassembly: This is the biggest leap in complexity. The ModR/M byte is the key to understanding most x86 instructions. Mastering this stage unlocks a huge portion of the instruction set.

Core challenges you’ll face:

  • Parsing the ModR/M byte → maps to bitwise operations to extract the Mod, Reg, and R/M fields
  • Mapping field values to registers → maps to creating a register lookup table (e.g., const char* registers[] = {"rax", "rcx", ...})
  • Handling the instruction’s direction → maps to some opcodes have a ‘d’ bit that determines if the destination is reg or r/m

Key Concepts:

  • Intel SDM, Vol 2, Chapter 2.1.3: ModR/M and SIB Bytes.
  • Register Encoding: Understanding that 000 is AL/AX/EAX/RAX, 001 is CL/CX/ECX/RCX, etc. The REX prefix B bit extends this to R8-R15.

Real world outcome: Your program can now decode 48 89 c8 into mov rax, rcx.

  • 48: REX prefix with W=1 (64-bit operand).
  • 89: MOV r/m64, r64 opcode.
  • c8: ModR/M byte.
    • Mod = 11 (binary) -> register-to-register.
    • Reg = 001 (binary) -> rcx.
    • R/M = 000 (binary) -> rax.
    • Direction bit in opcode means the destination is R/M, source is Reg. So, mov rax, rcx.

Stage 4: Memory Operands (ModR/M + Displacement)

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 3: Advanced
  • Goal: Decode instructions that read from or write to memory.

What you’ll build: Extend your ModR/M parsing to handle cases where Mod is 00, 01, or 10. This involves reading an optional displacement of 1 or 4 bytes that follows the ModR/M byte.

Why it teaches disassembly: This teaches you how memory addressing is encoded. You’ll see the direct link between the bytes and assembly syntax like [rbp], [rax + 0x50], and [rip + 0x1234].

Core challenges you’ll face:

  • Handling different Mod values → maps to an if/else or switch on the Mod field
  • Reading the displacement → maps to reading 1, 2, or 4 bytes from the buffer after the ModR/M byte and advancing your program counter
  • RIP-relative addressing → maps to a special case where Mod=00 and R/M=101 means the address is relative to the current instruction pointer

Real world outcome: Your program can decode 48 8b 45 f8 into mov rax, [rbp-0x8].

  • 48: REX.W prefix.
  • 8b: MOV r64, r/m64 opcode.
  • 45: ModR/M byte.
    • Mod = 01 -> memory mode with 8-bit displacement.
    • Reg = 000 -> rax.
    • R/M = 101 -> rbp.
  • f8: The 8-bit displacement, which is -8 in two’s complement.

Stage 5: Complex Memory Operands (The SIB Byte)

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 4: Expert
  • Goal: Handle complex, indexed memory addressing.

What you’ll build: Support for the SIB byte. When the ModR/M byte’s R/M field is 100, it signals that an SIB byte follows, which you must then parse.

Why it teaches disassembly: This completes your understanding of memory addressing. The SIB byte allows for [base + index*scale] addressing, which is common in loops and array processing.

Core challenges you’ll face:

  • Detecting the need for an SIB byte → maps to checking modrm.rm == 4
  • Parsing the SIB byte → maps to bitwise operations to extract Scale (2 bits), Index (3 bits), and Base (3 bits) fields
  • Combining REX, ModR/M, and SIB → maps to the full, complex logic where REX bits extend the registers found in the SIB byte

Key Concepts:

  • Intel SDM, Vol 2, Chapter 2.1.4: SIB Byte.
  • Scaled Indexing: The concept of base + index * scale.

Real world outcome: Your program can decode 48 8b 04 c5 00 00 00 00 into mov rax, [rax*8].

  • 48: REX.W prefix.
  • 8b: MOV r64, r/m64 opcode.
  • 04: ModR/M byte. R/M = 100 signals an SIB byte follows. Reg = 000 is rax.
  • c5: SIB byte.
    • Scale = 11 -> *8.
    • Index = 000 -> rax.
    • Base = 101 -> No base, requires a 32-bit displacement.
  • 00 00 00 00: The 32-bit displacement.

Stage 6: Immediate Operands and Jumps

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 3: Advanced
  • Goal: Handle instructions with embedded constant values and control flow.

What you’ll build: Support for instructions that take an immediate value (e.g., ADD rax, 0x10) and for relative jumps (JMP, JNE, etc.).

Why it teaches disassembly: This completes the picture for most common instructions. Immediates are the data, and jumps are the control flow. Handling relative jumps requires you to use your instruction pointer’s position to calculate the target address.

Core challenges you’ll face:

  • Determining immediate size → maps to the opcode itself often dictates the size of the immediate that follows
  • Reading the immediate value → maps to reading 1, 2, 4, or 8 bytes from the very end of the instruction stream
  • Calculating relative jump targets → maps to reading the offset (e.g., 32-bit), and adding it to the address of the *next instruction*

Real world outcome:

  • Decode 48 83 c0 05 into add rax, 0x5.
  • Decode e9 78 56 34 12 into jmp 0x12345678 (plus the address of the next instruction).

Stage 7 (Advanced): A Full Disassembler Tool

  • File: LEARN_DISASSEMBLY_FROM_SCRATCH.md
  • Difficulty: Level 4: Expert
  • Goal: Turn your library into a usable command-line tool.

What you’ll build: A tool that can read an executable file (e.g., an ELF file), find its .text section (where the code lives), and run your disassembler on it.

Why it teaches disassembly: This connects your theoretical decoder to the real world. You’ll learn the basics of executable file formats and how to apply your disassembler to actual programs.

Core challenges you’ll face:

  • Parsing an ELF header → maps to reading the file format to find the section header table
  • Finding the .text section → maps to iterating the section headers to find the one named “.text” and getting its offset and size
  • Linear Sweep vs. Recursive Traversal → maps to choosing a disassembly strategy. Linear sweep is easier to start.

Key Concepts:

  • ELF File Format: The standard executable format for Linux.
  • Disassembly Algorithms: The strategies for navigating code.

Real world outcome: ./deconstruct my_program

Disassembly of .text section:
0x401000: 55                      push rbp
0x401001: 48 89 e5                mov rbp, rsp
0x401004: 48 83 ec 10             sub rsp, 0x10
...

You have built a functional disassembler. You now understand what those bytes truly mean.


Summary of Stages

Stage Goal Key Challenge Unlocks
1. Single-Byte Build the main loop Opcode table nop, ret
2. REX Prefix Handle 64-bit mode Bitwise parsing push r8, 64-bit ops
3. ModR/M (Reg) Decode register operands ModR/M parsing mov rax, rcx
4. ModR/M (Mem) Decode memory operands Displacement parsing mov rax, [rbp-8]
5. SIB Byte Decode complex memory SIB parsing mov rax, [rax+rbx*4]
6. Immediates/Jumps Handle constants & flow Relative addressing add rax, 5, jmp
7. Full Tool Disassemble real files ELF parsing A usable tool