LEARN DISASSEMBLY FROM SCRATCH
Learn Disassembly: By Building an x86-64 Disassembler in C
Goal: To deeply understand how disassembly works by building your own disassembler from scratch. You will learn how human-readable assembly instructions are encoded into the raw bytes of machine code and write a tool to translate them back.
Why Build a Disassembler?
Using a disassembler like objdump, Ghidra, or IDA is easy. But how do they work? How do they turn a sea of bytes like 48 89 e5 into the instruction mov rbp, rsp? This process is not magic, but a well-defined decoding of a complex instruction format.
Building your own disassembler is the only way to truly learn this process. You will not be building a full-featured tool, but a simple engine that forces you to understand:
- The x86-64 Instruction Format: The core of this project. You’ll learn about opcodes, prefixes, and the complex ModR/M and SIB bytes that describe operands.
- Variable-Length Instructions: The primary challenge of x86 disassembly. An instruction can be anywhere from 1 to 15 bytes long.
- CPU Operating Modes: You’ll see how 64-bit mode (via the REX prefix) changes the game.
- Disassembly Algorithms: You’ll start with a simple “Linear Sweep” and understand its limitations.
After completing this project, you will never look at an executable file the same way again. You will see the raw bytes and have a fundamental understanding of what they mean.
Core Concept Analysis: The x86-64 Instruction Format
The central challenge is that x86-64 instructions are variable-length. Your disassembler’s main job is to figure out, for each instruction, “how many bytes does this instruction consume?” To do this, you must parse the instruction format, which is composed of several optional and mandatory parts in a specific order.
[Legacy Prefixes] [REX Prefix] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]
(0-4 bytes) (0-1 byte) (1-3 bytes) (0-1 byte)(0-1 byte) (0-8 bytes) (0-8 bytes)
Key Components You Will Decode
- Opcode (The Verb): This is the main part of the instruction, telling the CPU what to do (e.g.,
MOV,ADD,JMP). It’s 1-3 bytes long. Your disassembler will have a giant lookup table orswitchstatement for this. - ModR/M Byte (The Nouns): This incredibly important byte tells the CPU where to find its operands. It specifies whether the operands are registers or memory locations. It is divided into three fields:
Mod(2 bits): Specifies register-vs-memory.Reg(3 bits): Identifies a register operand.R/M(3 bits): Identifies a register or memory operand.
- REX Prefix (The 64-bit Extender): In 64-bit mode, this optional byte (
0x40to0x4F) is used to access the extended registers (R8-R15) and to specify the use of 64-bit operand sizes. Your disassembler must check for this prefix to correctly identify registers and operand sizes. - SIB Byte (Scale-Index-Base): An optional byte that follows the ModR/M byte to define more complex memory addressing, like
[rax + rcx*8]. - Displacement & Immediate: Raw byte values embedded in the instruction. A displacement is part of a memory address calculation (e.g., the
0x20in[rbp - 0x20]). An immediate is a constant value used in an operation (e.g., the0x5inadd eax, 5).
The Project: Build “Deconstruct”, an x86-64 Disassembler
This is a single project broken into stages. Each stage will add support for a new part of the x86-64 instruction format. You will build a C program that reads raw bytes from an array and prints the corresponding assembly mnemonics.
Main Programming Language: C (C99 or later) Main “Book”: The Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM), Volume 2: Instruction Set Reference. This is the authoritative source. No other resource is as complete. Have it open at all times. Helpful Resource: The OSDev.org x86-64 Opcode List provides a friendlier, though less complete, view.
Stage 1: The Basic Loop and Single-Byte Opcodes
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 2: Intermediate
- Knowledge Area: C Programming / CPU Architecture
- Goal: Build the main loop and handle the simplest instructions.
What you’ll build: A C program that takes a byte array, loops through it, and decodes only single-byte opcodes.
Why it teaches disassembly: This establishes the core “read-decode-advance” loop. It forces you to create your first opcode table and see the direct byte-to-instruction relationship in its simplest form.
Core challenges you’ll face:
- The Main Loop → maps to
for (int pc = 0; pc < len; )wherepcis advanced by the size of the decoded instruction - Opcode Lookup Table → maps to a giant
switchstatement or an array of structs/strings - Decoding an instruction → maps to a function
int decode_instruction(uint8_t* buffer, int pc)that returns the number of bytes consumed
Key Concepts:
- Instruction Pointer: The concept of a program counter (
pc) that points to the start of the next instruction to be executed. - Opcode Map: The 1:1 mapping of a byte value to a mnemonic.
Real world outcome:
Your program takes an array like uint8_t code[] = {0x90, 0xc3, 0x50, 0x58}; and prints:
0x0000: 90 nop
0x0001: c3 ret
0x0002: 50 push rax
0x0003: 58 pop rax
Implementation Hints:
Your main decode function will start simple:
int decode(uint8_t* buffer, int pc) {
uint8_t opcode = buffer[pc];
switch (opcode) {
case 0x90: printf("nop\n"); return 1;
case 0xc3: printf("ret\n"); return 1;
// Opcodes 0x50 through 0x57 are PUSH register
case 0x50: printf("push rax\n"); return 1;
// ... more cases
default: printf("unknown\n"); return 1;
}
}
Notice how instructions for push on different registers have sequential opcodes. You can use this to your advantage: if (opcode >= 0x50 && opcode <= 0x57) ....
Stage 2: The REX Prefix and 64-bit Operands
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 2: Intermediate
- Goal: Add support for 64-bit mode extensions.
What you’ll build: You will augment your decoder to first check for a REX prefix (0x40 - 0x4F). If present, you’ll parse its bits to determine if the operand size is 64-bit and if extended registers (R8-R15) are being used.
Why it teaches disassembly: The REX prefix is fundamental to x86-64. This stage teaches you that an instruction is not just its opcode; optional prefixes can fundamentally change its meaning and the registers it operates on.
Core challenges you’ll face:
- Detecting the REX prefix → maps to checking if the byte at
pcis in the0x40-0x4Frange - Parsing REX bits → maps to using bitwise operations to check the
W(64-bit),R(ModR/M Reg extension),X(SIB Index extension), andB(ModR/M R/M extension) bits - Storing the REX info → maps to passing a
rex_infostruct or flags through your decoding functions
Key Concepts:
- Intel SDM, Vol 2, Chapter 2.2.1.2: REX Prefixes.
- Bitwise Operations: Using
&and>>to extract information from a byte.
Real world outcome:
Your program can now distinguish between 50 (push rax) and 41 50 (push r8).
// REX byte is 0b0100WRXB
// W bit (bit 3): 1 = 64-bit operand size
// R bit (bit 2): Extends ModR/M 'reg' field
// X bit (bit 1): Extends SIB 'index' field
// B bit (bit 0): Extends ModR/M 'r/m' field or 'reg' field in opcode
// In your decode function, before the switch:
bool rex_w = false;
bool rex_b = false;
//...
if (buffer[pc] >= 0x40 && buffer[pc] <= 0x4F) {
uint8_t rex = buffer[pc];
rex_w = (rex & 0b1000) != 0;
rex_b = (rex & 0b0001) != 0;
// ...
pc++; // Consume the REX prefix
}
uint8_t opcode = buffer[pc];
//...
Stage 3: The ModR/M Byte and Register Operands
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 3: Advanced
- Goal: Decode instructions with two register operands.
What you’ll build: Support for opcodes that are followed by a ModR/M byte. You will start by only handling the case where Mod == 11, which specifies a register-to-register operation.
Why it teaches disassembly: This is the biggest leap in complexity. The ModR/M byte is the key to understanding most x86 instructions. Mastering this stage unlocks a huge portion of the instruction set.
Core challenges you’ll face:
- Parsing the ModR/M byte → maps to bitwise operations to extract the
Mod,Reg, andR/Mfields - Mapping field values to registers → maps to creating a register lookup table (e.g.,
const char* registers[] = {"rax", "rcx", ...}) - Handling the instruction’s direction → maps to some opcodes have a ‘d’ bit that determines if the destination is
regorr/m
Key Concepts:
- Intel SDM, Vol 2, Chapter 2.1.3: ModR/M and SIB Bytes.
- Register Encoding: Understanding that
000is AL/AX/EAX/RAX,001is CL/CX/ECX/RCX, etc. The REX prefixBbit extends this to R8-R15.
Real world outcome:
Your program can now decode 48 89 c8 into mov rax, rcx.
48: REX prefix with W=1 (64-bit operand).89:MOV r/m64, r64opcode.c8: ModR/M byte.Mod=11(binary) -> register-to-register.Reg=001(binary) ->rcx.R/M=000(binary) ->rax.- Direction bit in opcode means the destination is R/M, source is Reg. So,
mov rax, rcx.
Stage 4: Memory Operands (ModR/M + Displacement)
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 3: Advanced
- Goal: Decode instructions that read from or write to memory.
What you’ll build: Extend your ModR/M parsing to handle cases where Mod is 00, 01, or 10. This involves reading an optional displacement of 1 or 4 bytes that follows the ModR/M byte.
Why it teaches disassembly: This teaches you how memory addressing is encoded. You’ll see the direct link between the bytes and assembly syntax like [rbp], [rax + 0x50], and [rip + 0x1234].
Core challenges you’ll face:
- Handling different
Modvalues → maps to anif/elseorswitchon theModfield - Reading the displacement → maps to reading 1, 2, or 4 bytes from the buffer after the ModR/M byte and advancing your program counter
- RIP-relative addressing → maps to a special case where
Mod=00andR/M=101means the address is relative to the current instruction pointer
Real world outcome:
Your program can decode 48 8b 45 f8 into mov rax, [rbp-0x8].
48: REX.W prefix.8b:MOV r64, r/m64opcode.45: ModR/M byte.Mod=01-> memory mode with 8-bit displacement.Reg=000->rax.R/M=101->rbp.
f8: The 8-bit displacement, which is -8 in two’s complement.
Stage 5: Complex Memory Operands (The SIB Byte)
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 4: Expert
- Goal: Handle complex, indexed memory addressing.
What you’ll build: Support for the SIB byte. When the ModR/M byte’s R/M field is 100, it signals that an SIB byte follows, which you must then parse.
Why it teaches disassembly: This completes your understanding of memory addressing. The SIB byte allows for [base + index*scale] addressing, which is common in loops and array processing.
Core challenges you’ll face:
- Detecting the need for an SIB byte → maps to checking
modrm.rm == 4 - Parsing the SIB byte → maps to bitwise operations to extract
Scale(2 bits),Index(3 bits), andBase(3 bits) fields - Combining REX, ModR/M, and SIB → maps to the full, complex logic where REX bits extend the registers found in the SIB byte
Key Concepts:
- Intel SDM, Vol 2, Chapter 2.1.4: SIB Byte.
- Scaled Indexing: The concept of
base + index * scale.
Real world outcome:
Your program can decode 48 8b 04 c5 00 00 00 00 into mov rax, [rax*8].
48: REX.W prefix.8b:MOV r64, r/m64opcode.04: ModR/M byte.R/M = 100signals an SIB byte follows.Reg = 000israx.c5: SIB byte.Scale=11->*8.Index=000->rax.Base=101-> No base, requires a 32-bit displacement.
00 00 00 00: The 32-bit displacement.
Stage 6: Immediate Operands and Jumps
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 3: Advanced
- Goal: Handle instructions with embedded constant values and control flow.
What you’ll build: Support for instructions that take an immediate value (e.g., ADD rax, 0x10) and for relative jumps (JMP, JNE, etc.).
Why it teaches disassembly: This completes the picture for most common instructions. Immediates are the data, and jumps are the control flow. Handling relative jumps requires you to use your instruction pointer’s position to calculate the target address.
Core challenges you’ll face:
- Determining immediate size → maps to the opcode itself often dictates the size of the immediate that follows
- Reading the immediate value → maps to reading 1, 2, 4, or 8 bytes from the very end of the instruction stream
- Calculating relative jump targets → maps to reading the offset (e.g., 32-bit), and adding it to the address of the *next instruction*
Real world outcome:
- Decode
48 83 c0 05intoadd rax, 0x5. - Decode
e9 78 56 34 12intojmp 0x12345678(plus the address of the next instruction).
Stage 7 (Advanced): A Full Disassembler Tool
- File:
LEARN_DISASSEMBLY_FROM_SCRATCH.md - Difficulty: Level 4: Expert
- Goal: Turn your library into a usable command-line tool.
What you’ll build: A tool that can read an executable file (e.g., an ELF file), find its .text section (where the code lives), and run your disassembler on it.
Why it teaches disassembly: This connects your theoretical decoder to the real world. You’ll learn the basics of executable file formats and how to apply your disassembler to actual programs.
Core challenges you’ll face:
- Parsing an ELF header → maps to reading the file format to find the section header table
- Finding the
.textsection → maps to iterating the section headers to find the one named “.text” and getting its offset and size - Linear Sweep vs. Recursive Traversal → maps to choosing a disassembly strategy. Linear sweep is easier to start.
Key Concepts:
- ELF File Format: The standard executable format for Linux.
- Disassembly Algorithms: The strategies for navigating code.
Real world outcome:
./deconstruct my_program
Disassembly of .text section:
0x401000: 55 push rbp
0x401001: 48 89 e5 mov rbp, rsp
0x401004: 48 83 ec 10 sub rsp, 0x10
...
You have built a functional disassembler. You now understand what those bytes truly mean.
Summary of Stages
| Stage | Goal | Key Challenge | Unlocks |
|---|---|---|---|
| 1. Single-Byte | Build the main loop | Opcode table | nop, ret |
| 2. REX Prefix | Handle 64-bit mode | Bitwise parsing | push r8, 64-bit ops |
| 3. ModR/M (Reg) | Decode register operands | ModR/M parsing | mov rax, rcx |
| 4. ModR/M (Mem) | Decode memory operands | Displacement parsing | mov rax, [rbp-8] |
| 5. SIB Byte | Decode complex memory | SIB parsing | mov rax, [rax+rbx*4] |
| 6. Immediates/Jumps | Handle constants & flow | Relative addressing | add rax, 5, jmp |
| 7. Full Tool | Disassemble real files | ELF parsing | A usable tool |