Project 15: Build a Decompiler

Expanded deep-dive guide for Project 15 from the Binary Analysis sprint.

Quick Reference

Attribute	Value
Difficulty	Level 5: Master
Time Estimate	2-3 months
Main Programming Language	Python
Alternative Programming Languages	C++, Rust
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	4. The “Open Core” Infrastructure
Knowledge Area	Program Analysis / Code Generation
Software or Tool	Your disassembler, LLVM (optional)
Main Book	“Compilers: Principles, Techniques, and Tools” (Dragon Book)

1. Learning Objectives

Build a working implementation with reproducible outputs.
Justify key design choices with binary-analysis principles.
Produce an evidence-backed report of findings and limitations.
Document hardening or next-step improvements.

2. All Theory Needed (Per-Concept Breakdown)

This project depends on concepts from the main sprint primer: loader semantics, control/data-flow recovery, runtime observation, and mitigation-aware vulnerability reasoning. Before implementation, restate the project’s core assumptions in your own words and define how you will validate them.

3. Project Specification

3.1 What You Will Build

A decompiler that converts assembly/IR back into readable C-like pseudocode.

3.2 Functional Requirements

Accept the target binary/input and validate format assumptions.
Produce analyzable outputs (console report and/or artifacts).
Handle malformed inputs safely with explicit errors.

3.3 Non-Functional Requirements

Reproducibility: same input should produce equivalent findings.
Safety: unknown samples run only in isolated lab contexts.
Clarity: separate facts, hypotheses, and inferred conclusions.

3.4 Expanded Project Brief

File: P15-build-a-decompiler.md
Main Programming Language: Python
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Program Analysis / Code Generation
Software or Tool: Your disassembler, LLVM (optional)
Main Book: “Compilers: Principles, Techniques, and Tools” (Dragon Book)

What you’ll build: A decompiler that converts assembly/IR back into readable C-like pseudocode.

Why it teaches binary analysis: Decompilation is the ultimate reverse engineering skill. Building one means understanding control flow, data flow, and type recovery.

Core challenges you’ll face:

Control flow recovery → maps to if/else, loops from jumps
Data flow analysis → maps to variable identification
Type inference → maps to int vs pointer vs struct
Code generation → maps to producing readable output

Resources for key challenges:

Key Concepts:

Control Flow Graphs: “Engineering a Compiler” Ch. 8
SSA Form: “Engineering a Compiler” Ch. 9
Type Recovery: Academic papers on type inference

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects, compiler theory

Real World Outcome

Deliverables:

Analysis output or tooling scripts
Report with control/data flow notes

Validation checklist:

Parses sample binaries correctly
Findings are reproducible in debugger
No unsafe execution outside lab ``` Input (disassembly): push rbp mov rbp, rsp sub rsp, 0x20 mov [rbp-0x14], edi mov [rbp-0x20], rsi cmp [rbp-0x14], 1 jle .fail mov rax, [rbp-0x20] mov rdi, [rax+8] call atoi cmp eax, 0x539 jne .fail lea rdi, [success_msg] call puts jmp .end .fail: lea rdi, [fail_msg] call puts .end: xor eax, eax leave ret

Output (decompiled): int main(int argc, char **argv) { int input;

    if (argc <= 1) {
        puts("Wrong!");
        return 0;
    }

    input = atoi(argv[1]);

    if (input != 1337) {
        puts("Wrong!");
        return 0;
    }

    puts("Correct!");
    return 0;
} ```

Hints in Layers

Decompilation phases:

Disassembly: Convert bytes to instructions
Control Flow Graph: Build graph of basic blocks
Data Flow Analysis: Track value flow through registers
Type Analysis: Infer types from usage
Control Flow Structuring: Convert jumps to if/while
Code Generation: Output C-like code

Control flow structuring algorithms:

If-then-else: Look for diamond patterns
While loops: Back edges in CFG
For loops: Canonical form with counter

Questions to consider:

How do you detect loop vs if-else?
How do you recover variable names?
How do you handle optimized code?
How do you represent structs?

Start simple:

Handle single-block functions
Add if-else handling
Add while loop detection
Add function call recovery
Add type inference

Learning milestones:

Build CFG from assembly → Basic blocks and edges
Detect if-else → Diamond pattern recognition
Detect loops → Back edge identification
Generate readable code → Produce C-like output

The Core Question You Are Answering

“How do you transform low-level assembly instructions back into high-level readable code, and what makes decompilation fundamentally harder than compilation?”

This project tackles one of the most challenging problems in reverse engineering: recovering source-like code from compiled binaries. Unlike disassembly (which just translates machine code to assembly), decompilation attempts to reconstruct higher-level abstractions like if/while statements, function calls, and even variable types. This is the technology behind IDA’s Hex-Rays and Ghidra’s decompiler.

Concepts You Must Understand First

Control Flow Graph (CFG) Construction
- CFG is a directed graph where nodes are basic blocks and edges represent jumps
- Basic block: maximal sequence of instructions with single entry and single exit
- CFG is the foundation for all decompilation—it represents program structure
Guiding Questions:
- How do you identify basic block boundaries from assembly?
- What happens to the CFG when indirect jumps (jump tables) are present?
- How do you handle overlapping code or self-modifying code?
Book References:
- “Engineering a Compiler” by Cooper & Torczon - Ch 8: Introduction to Optimization
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Ch 3: Machine-Level Representation
Static Single Assignment (SSA) Form
- SSA: each variable assigned exactly once, making data flow explicit
- Phi functions merge values at control flow join points
- SSA simplifies many analyses: dead code elimination, constant propagation, type inference
Guiding Questions:
- Why is SSA form useful for decompilation?
- How do you convert assembly to SSA form?
- What are phi functions and when do you need them?
Book References:
- “Engineering a Compiler” by Cooper & Torczon - Ch 9: Data-Flow Analysis
Control Flow Structuring
- Converting arbitrary jumps into if/else, while, for, switch statements
- Some CFGs cannot be perfectly structured (irreducible graphs)
- Algorithms: interval analysis, structural analysis, phoenix algorithm
Guiding Questions:
- How do you recognize an if-then-else pattern in a CFG (diamond shape)?
- How do you detect loops (back edges in the CFG)?
- What do you do with goto-spaghetti that can’t be structured?
Book References:
- “Compilers: Principles, Techniques, and Tools” (Dragon Book) - Ch 9: Machine-Independent Optimizations
- Research papers on control flow structuring algorithms
Type Inference and Recovery
- Assembly has no types—everything is bits and bytes
- Type inference uses data flow, operations, and usage patterns
- Challenge: distinguishing int from pointer from struct
Guiding Questions:
- If a value is dereferenced, what does that tell you about its type?
- How do you recover struct layouts from memory access patterns?
- Can you perfectly recover types, or is it fundamentally ambiguous?
Book References:
- Research papers on type inference in binary analysis
- “Practical Binary Analysis” by Dennis Andriesse - Ch 7: Advanced Static Analysis
Data Flow Analysis
- Tracking how data moves through the program
- Reaching definitions, live variables, available expressions
- Used for variable name recovery and optimization
Guiding Questions:
- How do you identify that two register uses refer to the same logical variable?
- What is def-use chain analysis?
- How does data flow analysis help with decompilation quality?
Book References:
- “Engineering a Compiler” by Cooper & Torczon - Ch 9: Data-Flow Analysis
Code Generation
- Converting structured control flow and typed variables into readable C-like code
- Pretty-printing, variable naming, comment generation
- Balancing accuracy vs readability
Guiding Questions:
- How do you generate readable variable names when originals are lost?
- Should you preserve all assembly details or simplify for readability?
- How do you handle assembly idioms (e.g., xor eax, eax for zeroing)?
Book References:
- “Compilers: Principles, Techniques, and Tools” (Dragon Book) - Ch 8: Code Generation

Questions to Guide Your Design

What IR (Intermediate Representation)? Use LLVM IR, custom IR, or work directly on assembly?
How much do you simplify? Preserve every assembly detail or aggressively simplify to C-like code?
Handling irreducible control flow? Use goto statements or try to restructure?
Type system depth? Simple (int/pointer), or full (structs, arrays, function pointers)?
Variable naming strategy? Generic (var1, var2) or heuristic-based (counter, buffer)?
Testing approach? Compile simple C programs, decompile, compare with source?
Performance vs accuracy? Fast but imperfect, or slow but highly accurate?
Scope of support? Single functions or whole-program analysis with interprocedural optimization?

Thinking Exercise

Manual decompilation exercise:

Given this assembly:

push rbp
mov  rbp, rsp
sub  rsp, 16
mov  DWORD PTR [rbp-4], 0    ; local var at rbp-4
.L2:
cmp  DWORD PTR [rbp-4], 9
jg   .L3
mov  eax, DWORD PTR [rbp-4]
mov  edi, eax
call print_number
add  DWORD PTR [rbp-4], 1
jmp  .L2
.L3:
leave
ret

Manual decompilation steps:
- Identify basic blocks (entry, loop body, exit)
- Draw the CFG (entry → loop → exit, with back edge)
- Recognize loop pattern (back edge from .L2 to itself)
- Identify loop counter ([rbp-4])
- Translate to C:
```
void function() {
    int i = 0;
    while (i <= 9) {
        print_number(i);
        i++;
    }
}
```
Document:
- CFG: 3 blocks, 1 back edge
- Loop type: while loop (could be for loop)
- Variables: i (int, local at rbp-4)
- Function calls: print_number(int)

The Interview Questions They’ll Ask

“What’s the difference between disassembly and decompilation?”
- Disassembly: machine code → assembly (1:1 mapping). Decompilation: assembly → high-level code (many:1, lossy).
“Explain SSA form and why it’s used in decompilers.”
- SSA: each variable assigned once. Simplifies data flow analysis, makes variable usage explicit, enables optimizations.
“How do you detect a for loop vs while loop vs do-while in assembly?”
- Pattern recognition in CFG: for has initialization, condition, increment. While: condition at start. Do-while: condition at end.
“What makes control flow structuring hard?”
- Irreducible graphs (can’t be structured without goto), optimizations create complex patterns, jump tables are indirect.
“How would you infer that a variable is a pointer vs an integer?”
- Pointer: dereferenced, used in lea, compared to addresses. Integer: used in arithmetic, compared to constants.
“What’s a phi function in SSA form?”
- Merges values from different control flow paths. Example: at loop header, phi(initial_value, updated_value).
“Explain how you’d recover a struct from memory accesses.”
- Group accesses by base pointer + offset. Offsets reveal field positions. Access types (byte/word/qword) reveal field sizes.
“Why can’t decompilation be perfect?”
- Information loss: variable names, comments, types, macros lost. Optimization obfuscates structure. Multiple source codes compile to same assembly.
“How would you handle switch statements with jump tables?”
- Detect: computed jump through table. Extract table from data section. Each entry is a case. Reconstruct switch statement.
“Walk me through decompiling a simple function from scratch.”
- Disassemble → build CFG → identify control structures → convert to SSA → type inference → code generation → pretty print.

Books That Will Help

Topic	Book	Chapters
Control Flow Analysis	“Engineering a Compiler” by Cooper & Torczon	Ch 8-9
Compiler Fundamentals	“Compilers: Principles, Techniques, and Tools” (Dragon Book)	Ch 8-9
Binary Analysis	“Practical Binary Analysis” by Dennis Andriesse	Ch 6-7
Machine-Level Details	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron	Ch 3
Assembly Language	“Low-Level Programming” by Igor Zhirkov	Ch 4-5
Advanced Topics	Research papers on decompilation	“Native x86 Decompilation Using Semantics-Preserving Structural Analysis” “No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring”

Common Pitfalls and Debugging

Problem 1: “Your interpretation does not match runtime behavior”

Why: Static analysis can hide runtime-resolved addresses, lazy binding, and input-dependent branches.
Fix: Reproduce the path with debugger or tracer, then compare static assumptions against live register/memory state.
Quick test: Run the same sample through both your static workflow and a debugger transcript, and confirm control-flow decisions align.

Problem 2: “Tool output is inconsistent across machines”

Why: ASLR, tool version drift, and different binary build flags (PIE, RELRO, symbols stripped) change observed addresses and metadata.
Fix: Pin tool versions, capture checksec/metadata, and document environment assumptions in your report.
Quick test: Re-run analysis in a container or VM with pinned tools and compare hashes of generated outputs.

Problem 3: “Analysis accidentally executes unsafe code”

Why: Dynamic workflows run binaries in host context without sufficient isolation.
Fix: Use disposable snapshots, no-network execution, and non-privileged users for all unknown samples.
Quick test: Validate isolation controls first (network disabled, snapshot active, unprivileged user), then execute sample.

Definition of Done

Core functionality works on reference inputs
Edge cases are tested and documented
Results are reproducible (same binary, same tools, same report output)
Analysis notes clearly separate observations, assumptions, and conclusions
Lab safety controls were applied for any dynamic execution

4. Solution Architecture

Input Artifact -> Parse/Decode -> Analysis Engine -> Validation Layer -> Report

Design each stage so intermediate artifacts are inspectable (JSON/text/notes), which makes debugging and peer review much easier.

5. Implementation Phases

Phase 1: Foundation

Define input assumptions and format checks.
Produce a minimal golden output on one known sample.

Phase 2: Core Functionality

Implement full analysis pass for normal cases.
Add validation against an external ground-truth tool.

Phase 3: Hard Cases and Reporting

Add malformed/edge-case handling.
Finalize report template and reproducibility notes.

6. Testing Strategy

Unit-level checks for parser/decoder helpers.
Integration checks against known binaries/challenges.
Regression tests for previously failing cases.

7. Extensions & Challenges

Add automation for batch analysis and comparative reports.
Add confidence scoring for each major finding.
Add export formats suitable for CI/security pipelines.

8. Production Reflection

Map your project output to a production analogue: what reliability, observability, and security controls would be required to run this continuously in an engineering organization?