Project 3: Build a Simple Disassembler

Expanded deep-dive guide for Project 3 from the Binary Analysis sprint.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2-4 weeks
Main Programming Language	C
Alternative Programming Languages	Python (with Capstone), Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	1. The “Resume Gold”
Knowledge Area	Disassembly / x86 Instruction Encoding
Software or Tool	Intel manuals, Capstone engine
Main Book	“Intel 64 and IA-32 Architectures Software Developer’s Manual”

1. Learning Objectives

Build a working implementation with reproducible outputs.
Justify key design choices with binary-analysis principles.
Produce an evidence-backed report of findings and limitations.
Document hardening or next-step improvements.

2. All Theory Needed (Per-Concept Breakdown)

This project depends on concepts from the main sprint primer: loader semantics, control/data-flow recovery, runtime observation, and mitigation-aware vulnerability reasoning. Before implementation, restate the project’s core assumptions in your own words and define how you will validate them.

3. Project Specification

3.1 What You Will Build

A disassembler that converts x86/x64 machine code into human-readable assembly instructions.

3.2 Functional Requirements

Accept the target binary/input and validate format assumptions.
Produce analyzable outputs (console report and/or artifacts).
Handle malformed inputs safely with explicit errors.

3.3 Non-Functional Requirements

Reproducibility: same input should produce equivalent findings.
Safety: unknown samples run only in isolated lab contexts.
Clarity: separate facts, hypotheses, and inferred conclusions.

3.4 Expanded Project Brief

File: P03-build-a-simple-disassembler.md
Main Programming Language: C
Alternative Programming Languages: Python (with Capstone), Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Disassembly / x86 Instruction Encoding
Software or Tool: Intel manuals, Capstone engine
Main Book: “Intel 64 and IA-32 Architectures Software Developer’s Manual”

What you’ll build: A disassembler that converts x86/x64 machine code into human-readable assembly instructions.

Why it teaches binary analysis: Understanding how machine code maps to assembly is fundamental. Building a disassembler forces you to understand instruction encoding.

Core challenges you’ll face:

Variable-length instructions → maps to x86 has 1-15 byte instructions
Prefixes and REX bytes → maps to operand size, 64-bit registers
ModR/M and SIB bytes → maps to addressing modes
Immediate and displacement → maps to constants and offsets

Resources for key challenges:

MyDisassembler (GitHub) - Reference implementation
Capstone Engine - If you want to use a library
Intel SDM Volume 2 - Instruction Set Reference

Key Concepts:

x86 Instruction Format: Intel SDM Volume 2, Chapter 2
ModR/M Encoding: X86 Opcode Reference
Linear vs Recursive Descent: “Practical Binary Analysis” Ch. 6

Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Projects 1-2, solid x86 assembly knowledge

Real World Outcome

Deliverables:

Analysis output or tooling scripts
Report with control/data flow notes

Validation checklist:

Parses sample binaries correctly
Findings are reproducible in debugger

No unsafe execution outside lab

$ ./disasm program.bin
00000000: 55                    push rbp
00000001: 48 89 e5              mov rbp, rsp
00000004: 48 83 ec 40           sub rsp, 0x40
00000008: 48 8d 45 c0           lea rax, [rbp-0x40]
0000000c: 48 89 c7              mov rdi, rax
0000000f: e8 xx xx xx xx        call 0x????????
00000014: 31 c0                 xor eax, eax
00000016: c9                    leave
00000017: c3                    ret

Hints in Layers

x86 instruction format:

[Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]
   0-4       0-1    1-3      0-1     0-1      0-4           0-8

Start simple:

Handle single-byte opcodes first (push, pop, ret, nop)
Add instructions with ModR/M byte (mov, add, sub)
Add REX prefix support for 64-bit
Add SIB byte for complex addressing
Handle prefixes (operand size, segment override)

Questions to consider:

How do you distinguish mov eax, ebx from mov eax, [ebx]?
What does the REX.W prefix do?
How do you handle instructions with the same opcode but different meanings?

Learning milestones:

Disassemble basic instructions → Single-byte opcodes work
Handle ModR/M byte → Register and memory operands
Support 64-bit mode → REX prefix parsing
Handle all addressing modes → SIB byte, displacements

The Core Question You Are Answering

How does a CPU decode variable-length instruction streams into executable operations, and why is x86 considered one of the most complex instruction sets to disassemble?

Disassembly is reverse compilation at the lowest level. You’re recreating human-readable assembly from the raw bytes the CPU executes. Unlike fixed-width RISC architectures, x86/x64 instructions range from 1 to 15 bytes, making this problem fundamentally about pattern recognition and context.

Concepts You Must Understand First

1. Instruction Encoding and Variable-Length Instructions

x86 is a CISC architecture—Complex Instruction Set Computer. One instruction might be 1 byte (ret), another 15 bytes (a complex movaps with all prefixes).

Guiding questions:

Why doesn’t x86 use fixed-width instructions like ARM or MIPS?
How does the CPU know where one instruction ends and the next begins?
What happens if you try to disassemble from the wrong offset (misaligned)?

Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 3.5 (Instruction Encoding), Intel SDM Volume 2A Ch. 2 (Instruction Format)

2. Opcode Tables and Instruction Prefixes

The first byte (or bytes) of an instruction determine what it does. But prefixes can modify almost everything.

Guiding questions:

What’s the difference between a one-byte opcode and a two-byte opcode (0x0F escape)?
How many prefix bytes can one instruction have?
What does the LOCK prefix do?

Key reading: Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2, “Low-Level Programming” Ch. 3.5 (x86-64 Assembly Language)

3. ModR/M and SIB Bytes: Operand Encoding

After the opcode comes ModR/M (Mod-Reg-R/M), which encodes register and memory operands. Sometimes a SIB (Scale-Index-Base) byte follows.

Guiding questions:

How does ModR/M encode mov eax, ebx vs mov eax, [ebx]?
When do you need a SIB byte?
What do the Mod field values (00, 01, 10, 11) mean?

Key reading: Intel SDM Volume 2A Section 2.1.5 (ModR/M and SIB Bytes), “Practical Binary Analysis” Ch. 6.2.2 (Linear Disassembly)

4. Displacement and Immediate Values

Many instructions have trailing bytes for offsets (displacements) or constants (immediates).

Guiding questions:

How do you know if an instruction has a displacement?
What’s the difference between an 8-bit and 32-bit immediate?
How are signed immediates handled?

Key reading: Intel SDM Volume 2A Section 2.2 (Immediates and Displacements)

5. REX Prefix and 64-bit Mode

x86-64 added REX prefixes to access 64-bit registers (RAX, RBX, etc.) and extended registers (R8-R15).

Guiding questions:

How does the REX.W bit change instruction behavior?
What do REX.R, REX.X, REX.B extend?
Can you have multiple REX prefixes? (No!)

Key reading: “Low-Level Programming” Ch. 8 (x86-64 Architecture), Intel SDM Volume 2A Section 2.2.1 (REX Prefixes)

6. Linear vs. Recursive Descent Disassembly

Two strategies: start at the beginning and decode sequentially (linear), or follow control flow (recursive descent).

Guiding questions:

What are the advantages of linear disassembly?
When does linear disassembly fail? (Hint: inline data)
Why is recursive descent more accurate but incomplete?

Key reading: “Practical Binary Analysis” Ch. 6.2 (Disassembly Algorithms)

7. Addressing Modes

x86 has incredibly complex addressing modes: [base + index*scale + displacement].

Guiding questions:

How is mov rax, [rbx + rcx*8 + 0x10] encoded?
Which addressing modes require a SIB byte?
What’s RIP-relative addressing? (x64 only)

Key reading: Intel SDM Volume 1 Section 3.7 (Operand Addressing), “Computer Systems: A Programmer’s Perspective” Ch. 3.5.1 (Operand Specifiers)

8. Opcode Extensions and Group Encodings

Some opcodes are “groups” where the Reg field of ModR/M selects the actual instruction.

Guiding questions:

What is an opcode extension?
How do you decode 0xF7 /0 vs 0xF7 /4? (test vs mul)
Why does x86 use this complexity?

Key reading: Intel SDM Volume 2 Appendix A (Opcode Map), “Practical Binary Analysis” Ch. 6.2.2

Questions to Guide Your Design

Will you build your own opcode tables or use a library? Capstone is comprehensive, but building tables teaches you deeply. Which path aligns with your goals?
How will you handle invalid or undocumented opcodes? Should you show raw bytes, throw an error, or use heuristics?
What output format will you produce? Intel syntax (mov eax, ebx) or AT&T syntax (movl %ebx, %eax)? Both have audiences.
Will you support only one architecture (x86-64) or multiple? Supporting x86, x86-64, ARM, etc. requires modular design.
How will you display operands? Show registers by name (RAX) or encoding (0x0)? Hex or decimal for immediates?
What’s your strategy for multi-byte opcodes? x86 has 1-byte, 2-byte (0x0F), and 3-byte (0x0F 0x38/0x3A) opcodes.
Will you implement linear or recursive descent? Or both as a comparative tool?
How will you handle instruction prefixes? Prefixes modify opcodes—do you show them separately or integrate into the instruction?

Thinking Exercise

Before coding, manually disassemble these byte sequences:

Exercise 1: Simple Instructions Given bytes: 55 48 89 E5 48 83 EC 40

Using Intel SDM:

55 → Look up in opcode table → push rbp (or push ebp in 32-bit)
48 89 E5 → REX.W prefix, opcode 0x89, ModR/M 0xE5
- REX.W → 64-bit operands
- 0x89 → MOV r/m, r
- ModR/M 0xE5 → Mod=11 (register), Reg=100 (ESP/RSP), R/M=101 (EBP/RBP)
- Result: mov rbp, rsp
Continue for remaining bytes

Write out each step. This cements the decode process.

Exercise 2: Memory Operands Bytes: 48 8D 45 C0

Decode:

48 → REX.W (64-bit)
8D → LEA (Load Effective Address)
45 C0 → ModR/M + Displacement
- ModR/M 0x45 → Mod=01 (8-bit disp), Reg=000 (RAX), R/M=101 (RBP)
- Displacement: 0xC0 = -64 (signed byte)
Result: lea rax, [rbp-0x40]

Exercise 3: SIB Byte Usage Bytes: 48 89 8C CD 00 00 00 00

Decode manually:

REX prefix?
Opcode?
ModR/M byte → triggers SIB?
SIB byte → Scale, Index, Base?
Displacement?

Expected: Something like mov [rbp+rcx*8], rcx

Exercise 4: Compare Tools

echo -ne '\x55\x48\x89\xe5\x48\x83\xec\x40' > test.bin
objdump -D -b binary -m i386:x86-64 test.bin

Compare your manual work to objdump. Where do they differ? Why?

Also try:

ndisasm -b64 test.bin

Exercise 5: Misalignment Experiment Take a known instruction sequence. Start disassembling from offset+1 instead of offset 0.

What happens? You get nonsense—this demonstrates why alignment matters and why “desynchronization” attacks work on linear disassemblers.

The Interview Questions They’ll Ask

“What’s the difference between linear and recursive descent disassembly?”
- Linear: Start at entry, decode every byte sequentially. Fast, but fooled by inline data or obfuscation. Recursive descent: Follow control flow (jumps, calls), disassemble only reachable code. Accurate, but misses indirect jumps.
“How do you handle x86’s variable-length instructions?”
- Parse byte-by-byte: decode prefixes, opcode, ModR/M, SIB, displacement, immediate. Each field’s presence depends on previous fields. Requires state machine or careful offset tracking.
“What’s the REX prefix and why is it necessary?”
- REX extends x86-64 instructions. REX.W selects 64-bit operands. REX.R, REX.X, REX.B extend ModR/M Reg, SIB Index, and ModR/M R/M fields to access R8-R15 registers.
“Explain ModR/M encoding with an example.”
- ModR/M has 3 fields: Mod (2 bits), Reg (3 bits), R/M (3 bits). Example: mov eax, ebx (0x89 0xD8). 0x89 = MOV r/m, r. 0xD8 = Mod:11, Reg:011 (EBX), R/M:000 (EAX). Result: move EBX to EAX.
“When is a SIB byte present?”
- When ModR/M R/M field = 100 (binary) and Mod ≠ 11. SIB allows complex addressing: [base + index*scale + disp].
“How do you disassemble encrypted or packed code?”
- You can’t—encrypted bytes are meaningless until decrypted. Dynamic analysis: run the code, let it decrypt itself, then dump and disassemble memory.
“What are opcode extensions and why do they exist?”
- Some opcodes (like 0xF7) use ModR/M Reg field to select the actual instruction. 0xF7 /0 = TEST, /4 = MUL, /6 = DIV. Saves opcode space.
“How does x86 differ from ARM for disassembly?”
- ARM has fixed 32-bit (or 16-bit Thumb) instructions—disassembly is trivial (every 4 bytes is an instruction). x86 is variable-length (1-15 bytes) with prefix hell—disassembly is complex.
“What’s the challenge with self-modifying code?”
- Code that changes its own bytes at runtime. Your static disassembly is wrong after modification. Requires dynamic disassembly (disassemble from memory, not file).
“Why would a malware author use opaque predicates or junk bytes?”
- To break linear disassemblers. Insert jmp label; [garbage bytes]; label:. Linear disassemblers try to decode garbage. Recursive descent skips it.

Books That Will Help

Topic	Book	Chapter/Section
x86 Instruction Format	Intel 64/IA-32 Software Developer’s Manual Vol. 2A	Ch. 2: Instruction Format
Instruction Encoding	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch. 3.5: Arithmetic and Logical Operations (encoding examples)
Disassembly Algorithms	“Practical Binary Analysis” by Dennis Andriesse	Ch. 6.2: Static Disassembly (Linear vs Recursive Descent)
x86-64 Architecture	“Low-Level Programming” by Igor Zhirkov	Ch. 3: Assembly Language, Ch. 8: x86-64
ModR/M and SIB Bytes	Intel SDM Volume 2A	Section 2.1.3-2.1.5: ModR/M, SIB, and Displacement
REX Prefix	Intel SDM Volume 2A	Section 2.2.1: REX Prefixes
Opcode Map	Intel SDM Volume 2	Appendix A: Opcode Map
Addressing Modes	“Computer Systems: A Programmer’s Perspective”	Ch. 3.5.1: Operand Specifiers
Assembly Syntax	“Low-Level Programming”	Ch. 3.2: Assembly Language Syntax
Disassembly Tools	“Practical Binary Analysis”	Ch. 5: Basic Binary Analysis in Linux
Instruction Reference	Intel SDM Volume 2B-2D	Instruction Set Reference (A-Z)
Anti-Disassembly	“Practical Malware Analysis” by Sikorski & Honig	Ch. 15: Anti-Disassembly
Obfuscation Techniques	“Practical Binary Analysis”	Ch. 6.2.5: Code Obfuscation
Building Disassemblers	“Engineering a Compiler” by Cooper & Torczon	Ch. 4: Intermediate Representations (related concepts)

ASCII Diagram: x86-64 Instruction Structure

Maximum instruction length: 15 bytes

+----------+-----+-----+--------+-------+-----+--------------+-----------+
| Prefixes | REX | Opc | ModR/M |  SIB  | Dsp |  Immediate   |  Total    |
+----------+-----+-----+--------+-------+-----+--------------+-----------+
| 0-4 bytes| 0-1 | 1-3 |  0-1   |  0-1  | 0-4 |    0-8       | 1-15 bytes|
+----------+-----+-----+--------+-------+-----+--------------+-----------+
| Optional | Opt | Req | Opt    | Opt   | Opt |   Optional   |           |
+----------+-----+-----+--------+-------+-----+--------------+-----------+

Prefixes (0-4 bytes):
  - Lock and Repeat: F0, F2, F3
  - Segment Override: 2E, 36, 3E, 26, 64, 65
  - Operand-size Override: 66
  - Address-size Override: 67

REX Prefix (x64 only, 0-1 byte):
  0100WRXB
    W = 1: 64-bit operand size
    R = extends ModR/M Reg field
    X = extends SIB Index field
    B = extends ModR/M R/M or SIB Base field

Opcode (1-3 bytes):
  - 1-byte: Most common (add, mov, push, pop, etc.)
  - 2-byte: 0x0F escape code + opcode (syscall, movss, etc.)
  - 3-byte: 0x0F 0x38/0x3A + opcode (SSE4, AVX)

ModR/M (0-1 byte): Present for most instructions
  +----+----+----+
  |Mod |Reg |R/M |  (2 bits | 3 bits | 3 bits)
  +----+----+----+
  Mod: Addressing mode
    00 = [R/M]
    01 = [R/M + disp8]
    10 = [R/M + disp32]
    11 = R/M (register direct)
  Reg: Register operand or opcode extension
  R/M: Register or memory operand

SIB (0-1 byte): Present when ModR/M R/M = 100 and Mod ≠ 11
  +-----+-----+------+
  |Scale|Index| Base |  (2 bits | 3 bits | 3 bits)
  +-----+-----+------+
  Encodes: [Base + Index*Scale + Displacement]
  Scale: 1, 2, 4, or 8

Displacement (0-4 bytes):
  - 0 bytes: None
  - 1 byte: disp8 (signed -128 to +127)
  - 4 bytes: disp32 (signed)

Immediate (0-8 bytes):
  - 1, 2, 4, or 8 bytes depending on instruction
  - Constants in mov, add, sub, cmp, etc.

Example Instruction Breakdown: mov rax, [rbp+rcx*8-0x40]

Bytes: 48 8B 44 CD C0

48        = REX.W (64-bit operands)
8B        = Opcode (MOV r64, r/m64)
44        = ModR/M (Mod=01, Reg=000 (RAX), R/M=100 (needs SIB))
CD        = SIB (Scale=11 (8), Index=001 (RCX), Base=101 (RBP))
C0        = Displacement (-0x40 as signed byte)

Decoding:
  - REX.W → 64-bit operation
  - Opcode 0x8B → MOV destination, source (r, r/m)
  - ModR/M: Mod=01 (disp8), Reg=000 (RAX), R/M=100 (SIB follows)
  - SIB: Scale=11 (×8), Index=001 (RCX), Base=101 (RBP)
  - Displacement: 0xC0 = -64 decimal

Result: mov rax, [rbp + rcx*8 - 0x40]

Key Insight: Disassembly is deterministic at each byte but context-dependent across the stream. Starting from the wrong offset produces garbage. This is why malware uses “desynchronization” attacks—embedding unreachable bytes that look like valid instructions to confuse linear disassemblers.

Common Pitfalls and Debugging

Problem 1: “Your interpretation does not match runtime behavior”

Why: Static analysis can hide runtime-resolved addresses, lazy binding, and input-dependent branches.
Fix: Reproduce the path with debugger or tracer, then compare static assumptions against live register/memory state.
Quick test: Run the same sample through both your static workflow and a debugger transcript, and confirm control-flow decisions align.

Problem 2: “Tool output is inconsistent across machines”

Why: ASLR, tool version drift, and different binary build flags (PIE, RELRO, symbols stripped) change observed addresses and metadata.
Fix: Pin tool versions, capture checksec/metadata, and document environment assumptions in your report.
Quick test: Re-run analysis in a container or VM with pinned tools and compare hashes of generated outputs.

Problem 3: “Analysis accidentally executes unsafe code”

Why: Dynamic workflows run binaries in host context without sufficient isolation.
Fix: Use disposable snapshots, no-network execution, and non-privileged users for all unknown samples.
Quick test: Validate isolation controls first (network disabled, snapshot active, unprivileged user), then execute sample.

Definition of Done

Core functionality works on reference inputs
Edge cases are tested and documented
Results are reproducible (same binary, same tools, same report output)
Analysis notes clearly separate observations, assumptions, and conclusions
Lab safety controls were applied for any dynamic execution

4. Solution Architecture

Input Artifact -> Parse/Decode -> Analysis Engine -> Validation Layer -> Report

Design each stage so intermediate artifacts are inspectable (JSON/text/notes), which makes debugging and peer review much easier.

5. Implementation Phases

Phase 1: Foundation

Define input assumptions and format checks.
Produce a minimal golden output on one known sample.

Phase 2: Core Functionality

Implement full analysis pass for normal cases.
Add validation against an external ground-truth tool.

Phase 3: Hard Cases and Reporting

Add malformed/edge-case handling.
Finalize report template and reproducibility notes.

6. Testing Strategy

Unit-level checks for parser/decoder helpers.
Integration checks against known binaries/challenges.
Regression tests for previously failing cases.

7. Extensions & Challenges

Add automation for batch analysis and comparative reports.
Add confidence scoring for each major finding.
Add export formats suitable for CI/security pipelines.

8. Production Reflection

Map your project output to a production analogue: what reliability, observability, and security controls would be required to run this continuously in an engineering organization?