Project 15: Build a Decompiler
Expanded deep-dive guide for Project 15 from the Binary Analysis sprint.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 5: Master |
| Time Estimate | 2-3 months |
| Main Programming Language | Python |
| Alternative Programming Languages | C++, Rust |
| Coolness Level | Level 5: Pure Magic (Super Cool) |
| Business Potential | 4. The “Open Core” Infrastructure |
| Knowledge Area | Program Analysis / Code Generation |
| Software or Tool | Your disassembler, LLVM (optional) |
| Main Book | “Compilers: Principles, Techniques, and Tools” (Dragon Book) |
1. Learning Objectives
- Build a working implementation with reproducible outputs.
- Justify key design choices with binary-analysis principles.
- Produce an evidence-backed report of findings and limitations.
- Document hardening or next-step improvements.
2. All Theory Needed (Per-Concept Breakdown)
This project depends on concepts from the main sprint primer: loader semantics, control/data-flow recovery, runtime observation, and mitigation-aware vulnerability reasoning. Before implementation, restate the project’s core assumptions in your own words and define how you will validate them.
3. Project Specification
3.1 What You Will Build
A decompiler that converts assembly/IR back into readable C-like pseudocode.
3.2 Functional Requirements
- Accept the target binary/input and validate format assumptions.
- Produce analyzable outputs (console report and/or artifacts).
- Handle malformed inputs safely with explicit errors.
3.3 Non-Functional Requirements
- Reproducibility: same input should produce equivalent findings.
- Safety: unknown samples run only in isolated lab contexts.
- Clarity: separate facts, hypotheses, and inferred conclusions.
3.4 Expanded Project Brief
-
File: P15-build-a-decompiler.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Program Analysis / Code Generation
- Software or Tool: Your disassembler, LLVM (optional)
- Main Book: “Compilers: Principles, Techniques, and Tools” (Dragon Book)
What you’ll build: A decompiler that converts assembly/IR back into readable C-like pseudocode.
Why it teaches binary analysis: Decompilation is the ultimate reverse engineering skill. Building one means understanding control flow, data flow, and type recovery.
Core challenges you’ll face:
- Control flow recovery → maps to if/else, loops from jumps
- Data flow analysis → maps to variable identification
- Type inference → maps to int vs pointer vs struct
- Code generation → maps to producing readable output
Resources for key challenges:
- Hex-Rays Decompiler Introduction
- What is Decompilation? (PNF Software)
- “Engineering a Compiler” Ch. 8-9
Key Concepts:
- Control Flow Graphs: “Engineering a Compiler” Ch. 8
- SSA Form: “Engineering a Compiler” Ch. 9
- Type Recovery: Academic papers on type inference
Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects, compiler theory
Real World Outcome
Deliverables:
- Analysis output or tooling scripts
- Report with control/data flow notes
Validation checklist:
- Parses sample binaries correctly
- Findings are reproducible in debugger
- No unsafe execution outside lab ``` Input (disassembly): push rbp mov rbp, rsp sub rsp, 0x20 mov [rbp-0x14], edi mov [rbp-0x20], rsi cmp [rbp-0x14], 1 jle .fail mov rax, [rbp-0x20] mov rdi, [rax+8] call atoi cmp eax, 0x539 jne .fail lea rdi, [success_msg] call puts jmp .end .fail: lea rdi, [fail_msg] call puts .end: xor eax, eax leave ret
Output (decompiled): int main(int argc, char **argv) { int input;
if (argc <= 1) {
puts("Wrong!");
return 0;
}
input = atoi(argv[1]);
if (input != 1337) {
puts("Wrong!");
return 0;
}
puts("Correct!");
return 0;
} ```
Hints in Layers
Decompilation phases:
- Disassembly: Convert bytes to instructions
- Control Flow Graph: Build graph of basic blocks
- Data Flow Analysis: Track value flow through registers
- Type Analysis: Infer types from usage
- Control Flow Structuring: Convert jumps to if/while
- Code Generation: Output C-like code
Control flow structuring algorithms:
- If-then-else: Look for diamond patterns
- While loops: Back edges in CFG
- For loops: Canonical form with counter
Questions to consider:
- How do you detect loop vs if-else?
- How do you recover variable names?
- How do you handle optimized code?
- How do you represent structs?
Start simple:
- Handle single-block functions
- Add if-else handling
- Add while loop detection
- Add function call recovery
- Add type inference
Learning milestones:
- Build CFG from assembly → Basic blocks and edges
- Detect if-else → Diamond pattern recognition
- Detect loops → Back edge identification
- Generate readable code → Produce C-like output
The Core Question You Are Answering
“How do you transform low-level assembly instructions back into high-level readable code, and what makes decompilation fundamentally harder than compilation?”
This project tackles one of the most challenging problems in reverse engineering: recovering source-like code from compiled binaries. Unlike disassembly (which just translates machine code to assembly), decompilation attempts to reconstruct higher-level abstractions like if/while statements, function calls, and even variable types. This is the technology behind IDA’s Hex-Rays and Ghidra’s decompiler.
Concepts You Must Understand First
- Control Flow Graph (CFG) Construction
- CFG is a directed graph where nodes are basic blocks and edges represent jumps
- Basic block: maximal sequence of instructions with single entry and single exit
- CFG is the foundation for all decompilation—it represents program structure
Guiding Questions:
- How do you identify basic block boundaries from assembly?
- What happens to the CFG when indirect jumps (jump tables) are present?
- How do you handle overlapping code or self-modifying code?
Book References:
- “Engineering a Compiler” by Cooper & Torczon - Ch 8: Introduction to Optimization
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Ch 3: Machine-Level Representation
- Static Single Assignment (SSA) Form
- SSA: each variable assigned exactly once, making data flow explicit
- Phi functions merge values at control flow join points
- SSA simplifies many analyses: dead code elimination, constant propagation, type inference
Guiding Questions:
- Why is SSA form useful for decompilation?
- How do you convert assembly to SSA form?
- What are phi functions and when do you need them?
Book References:
- “Engineering a Compiler” by Cooper & Torczon - Ch 9: Data-Flow Analysis
- Control Flow Structuring
- Converting arbitrary jumps into if/else, while, for, switch statements
- Some CFGs cannot be perfectly structured (irreducible graphs)
- Algorithms: interval analysis, structural analysis, phoenix algorithm
Guiding Questions:
- How do you recognize an if-then-else pattern in a CFG (diamond shape)?
- How do you detect loops (back edges in the CFG)?
- What do you do with goto-spaghetti that can’t be structured?
Book References:
- “Compilers: Principles, Techniques, and Tools” (Dragon Book) - Ch 9: Machine-Independent Optimizations
- Research papers on control flow structuring algorithms
- Type Inference and Recovery
- Assembly has no types—everything is bits and bytes
- Type inference uses data flow, operations, and usage patterns
- Challenge: distinguishing int from pointer from struct
Guiding Questions:
- If a value is dereferenced, what does that tell you about its type?
- How do you recover struct layouts from memory access patterns?
- Can you perfectly recover types, or is it fundamentally ambiguous?
Book References:
- Research papers on type inference in binary analysis
- “Practical Binary Analysis” by Dennis Andriesse - Ch 7: Advanced Static Analysis
- Data Flow Analysis
- Tracking how data moves through the program
- Reaching definitions, live variables, available expressions
- Used for variable name recovery and optimization
Guiding Questions:
- How do you identify that two register uses refer to the same logical variable?
- What is def-use chain analysis?
- How does data flow analysis help with decompilation quality?
Book References:
- “Engineering a Compiler” by Cooper & Torczon - Ch 9: Data-Flow Analysis
- Code Generation
- Converting structured control flow and typed variables into readable C-like code
- Pretty-printing, variable naming, comment generation
- Balancing accuracy vs readability
Guiding Questions:
- How do you generate readable variable names when originals are lost?
- Should you preserve all assembly details or simplify for readability?
- How do you handle assembly idioms (e.g., xor eax, eax for zeroing)?
Book References:
- “Compilers: Principles, Techniques, and Tools” (Dragon Book) - Ch 8: Code Generation
Questions to Guide Your Design
-
What IR (Intermediate Representation)? Use LLVM IR, custom IR, or work directly on assembly?
-
How much do you simplify? Preserve every assembly detail or aggressively simplify to C-like code?
-
Handling irreducible control flow? Use goto statements or try to restructure?
-
Type system depth? Simple (int/pointer), or full (structs, arrays, function pointers)?
-
Variable naming strategy? Generic (var1, var2) or heuristic-based (counter, buffer)?
-
Testing approach? Compile simple C programs, decompile, compare with source?
-
Performance vs accuracy? Fast but imperfect, or slow but highly accurate?
-
Scope of support? Single functions or whole-program analysis with interprocedural optimization?
Thinking Exercise
Manual decompilation exercise:
- Given this assembly:
push rbp mov rbp, rsp sub rsp, 16 mov DWORD PTR [rbp-4], 0 ; local var at rbp-4 .L2: cmp DWORD PTR [rbp-4], 9 jg .L3 mov eax, DWORD PTR [rbp-4] mov edi, eax call print_number add DWORD PTR [rbp-4], 1 jmp .L2 .L3: leave ret - Manual decompilation steps:
- Identify basic blocks (entry, loop body, exit)
- Draw the CFG (entry → loop → exit, with back edge)
- Recognize loop pattern (back edge from .L2 to itself)
- Identify loop counter ([rbp-4])
- Translate to C:
void function() { int i = 0; while (i <= 9) { print_number(i); i++; } }
- Document:
- CFG: 3 blocks, 1 back edge
- Loop type: while loop (could be for loop)
- Variables: i (int, local at rbp-4)
- Function calls: print_number(int)
The Interview Questions They’ll Ask
- “What’s the difference between disassembly and decompilation?”
- Disassembly: machine code → assembly (1:1 mapping). Decompilation: assembly → high-level code (many:1, lossy).
- “Explain SSA form and why it’s used in decompilers.”
- SSA: each variable assigned once. Simplifies data flow analysis, makes variable usage explicit, enables optimizations.
- “How do you detect a for loop vs while loop vs do-while in assembly?”
- Pattern recognition in CFG: for has initialization, condition, increment. While: condition at start. Do-while: condition at end.
- “What makes control flow structuring hard?”
- Irreducible graphs (can’t be structured without goto), optimizations create complex patterns, jump tables are indirect.
- “How would you infer that a variable is a pointer vs an integer?”
- Pointer: dereferenced, used in lea, compared to addresses. Integer: used in arithmetic, compared to constants.
- “What’s a phi function in SSA form?”
- Merges values from different control flow paths. Example: at loop header, phi(initial_value, updated_value).
- “Explain how you’d recover a struct from memory accesses.”
- Group accesses by base pointer + offset. Offsets reveal field positions. Access types (byte/word/qword) reveal field sizes.
- “Why can’t decompilation be perfect?”
- Information loss: variable names, comments, types, macros lost. Optimization obfuscates structure. Multiple source codes compile to same assembly.
- “How would you handle switch statements with jump tables?”
- Detect: computed jump through table. Extract table from data section. Each entry is a case. Reconstruct switch statement.
- “Walk me through decompiling a simple function from scratch.”
- Disassemble → build CFG → identify control structures → convert to SSA → type inference → code generation → pretty print.
Books That Will Help
| Topic | Book | Chapters |
|---|---|---|
| Control Flow Analysis | “Engineering a Compiler” by Cooper & Torczon | Ch 8-9 |
| Compiler Fundamentals | “Compilers: Principles, Techniques, and Tools” (Dragon Book) | Ch 8-9 |
| Binary Analysis | “Practical Binary Analysis” by Dennis Andriesse | Ch 6-7 |
| Machine-Level Details | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch 3 |
| Assembly Language | “Low-Level Programming” by Igor Zhirkov | Ch 4-5 |
| Advanced Topics | Research papers on decompilation | “Native x86 Decompilation Using Semantics-Preserving Structural Analysis” “No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring” |
Common Pitfalls and Debugging
Problem 1: “Your interpretation does not match runtime behavior”
- Why: Static analysis can hide runtime-resolved addresses, lazy binding, and input-dependent branches.
- Fix: Reproduce the path with debugger or tracer, then compare static assumptions against live register/memory state.
- Quick test: Run the same sample through both your static workflow and a debugger transcript, and confirm control-flow decisions align.
Problem 2: “Tool output is inconsistent across machines”
- Why: ASLR, tool version drift, and different binary build flags (PIE, RELRO, symbols stripped) change observed addresses and metadata.
- Fix: Pin tool versions, capture
checksec/metadata, and document environment assumptions in your report. - Quick test: Re-run analysis in a container or VM with pinned tools and compare hashes of generated outputs.
Problem 3: “Analysis accidentally executes unsafe code”
- Why: Dynamic workflows run binaries in host context without sufficient isolation.
- Fix: Use disposable snapshots, no-network execution, and non-privileged users for all unknown samples.
- Quick test: Validate isolation controls first (network disabled, snapshot active, unprivileged user), then execute sample.
Definition of Done
- Core functionality works on reference inputs
- Edge cases are tested and documented
- Results are reproducible (same binary, same tools, same report output)
- Analysis notes clearly separate observations, assumptions, and conclusions
- Lab safety controls were applied for any dynamic execution
4. Solution Architecture
Input Artifact -> Parse/Decode -> Analysis Engine -> Validation Layer -> Report
Design each stage so intermediate artifacts are inspectable (JSON/text/notes), which makes debugging and peer review much easier.
5. Implementation Phases
Phase 1: Foundation
- Define input assumptions and format checks.
- Produce a minimal golden output on one known sample.
Phase 2: Core Functionality
- Implement full analysis pass for normal cases.
- Add validation against an external ground-truth tool.
Phase 3: Hard Cases and Reporting
- Add malformed/edge-case handling.
- Finalize report template and reproducibility notes.
6. Testing Strategy
- Unit-level checks for parser/decoder helpers.
- Integration checks against known binaries/challenges.
- Regression tests for previously failing cases.
7. Extensions & Challenges
- Add automation for batch analysis and comparative reports.
- Add confidence scoring for each major finding.
- Add export formats suitable for CI/security pipelines.
8. Production Reflection
Map your project output to a production analogue: what reliability, observability, and security controls would be required to run this continuously in an engineering organization?