Project 1: ELF File Parser
Expanded deep-dive guide for Project 1 from the Binary Analysis sprint.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C |
| Alternative Programming Languages | Python, Rust, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 1. The “Resume Gold” |
| Knowledge Area | Binary Formats / File Parsing |
| Software or Tool | ELF binaries, hex editor |
| Main Book | “Practical Binary Analysis” by Dennis Andriesse |
1. Learning Objectives
- Build a working implementation with reproducible outputs.
- Justify key design choices with binary-analysis principles.
- Produce an evidence-backed report of findings and limitations.
- Document hardening or next-step improvements.
2. All Theory Needed (Per-Concept Breakdown)
This project depends on concepts from the main sprint primer: loader semantics, control/data-flow recovery, runtime observation, and mitigation-aware vulnerability reasoning. Before implementation, restate the project’s core assumptions in your own words and define how you will validate them.
3. Project Specification
3.1 What You Will Build
A command-line tool that parses ELF files and displays all headers, sections, segments, symbols, and relocations in a human-readable format—like a simplified readelf.
3.2 Functional Requirements
- Accept the target binary/input and validate format assumptions.
- Produce analyzable outputs (console report and/or artifacts).
- Handle malformed inputs safely with explicit errors.
3.3 Non-Functional Requirements
- Reproducibility: same input should produce equivalent findings.
- Safety: unknown samples run only in isolated lab contexts.
- Clarity: separate facts, hypotheses, and inferred conclusions.
3.4 Expanded Project Brief
-
File: P01-elf-file-parser.md
- Main Programming Language: C
- Alternative Programming Languages: Python, Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Binary Formats / File Parsing
- Software or Tool: ELF binaries, hex editor
- Main Book: “Practical Binary Analysis” by Dennis Andriesse
What you’ll build: A command-line tool that parses ELF files and displays all headers, sections, segments, symbols, and relocations in a human-readable format—like a simplified readelf.
Why it teaches binary analysis: Every reverse engineering task starts with understanding the file format. Building a parser forces you to understand every byte of the ELF structure.
Core challenges you’ll face:
- Parsing the ELF header → maps to understanding magic bytes, class (32/64-bit), endianness
- Reading program headers → maps to segments, what gets loaded into memory
- Reading section headers → maps to sections, symbols, strings
- Handling different architectures → maps to x86, ARM, MIPS variations
Resources for key challenges:
- Linux Audit - ELF Binaries - Excellent overview
- “Practical Binary Analysis” Chapter 2 - Comprehensive ELF explanation
man elf- The ELF specification
Key Concepts:
- ELF Header Structure: “Practical Binary Analysis” Ch. 2 - Andriesse
- Program vs Section Headers: elf(5) man page
- Symbol Tables: “Learning ELF” - Can Ozkan (Medium)
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming, understanding of pointers and structs, familiarity with hexadecimal
Real World Outcome
Deliverables:
- Analysis output or tooling scripts
- Report with control/data flow notes
Validation checklist:
- Parses sample binaries correctly
- Findings are reproducible in debugger
- No unsafe execution outside lab ```bash $ ./elf_parser /bin/ls ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2’s complement, little endian Version: 1 (current) OS/ABI: UNIX - System V Type: DYN (Shared object file) Machine: AMD x86-64 Entry: 0x6b10
Program Headers: Type Offset VirtAddr FileSiz MemSiz Flg PHDR 0x000040 0x0000000000000040 0x0002d8 0x0002d8 R INTERP 0x000318 0x0000000000000318 0x00001c 0x00001c R LOAD 0x000000 0x0000000000000000 0x003510 0x003510 R …
Sections: [Nr] Name Type Address Size [ 0] NULL 0x0000000000000000 0x0 [ 1] .interp PROGBITS 0x0000000000000318 0x1c [ 2] .note.gnu.build-id NOTE 0x0000000000000338 0x24 …
Symbols: Num: Value Size Type Bind Name 1: 0000000000000000 0 FUNC GLOBAL printf@GLIBC_2.2.5 2: 0000000000006b10 123 FUNC GLOBAL main …
#### Hints in Layers
Start by mapping the ELF header structure:
```c
// Don't write code, but understand this structure:
// Elf64_Ehdr contains:
// e_ident[16] - Magic number and other info
// e_type - Object file type (ET_EXEC, ET_DYN, etc.)
// e_machine - Architecture (EM_X86_64, EM_ARM, etc.)
// e_entry - Entry point virtual address
// e_phoff - Program header table file offset
// e_shoff - Section header table file offset
// e_phnum - Number of program headers
// e_shnum - Number of section headers
Questions to guide your implementation:
- How do you detect if a file is 32-bit or 64-bit ELF?
- How do you find the string table section to get section names?
- What’s the difference between
.dynsymand.symtab? - How do program headers map sections to memory segments?
Learning milestones:
- Parse ELF header correctly → Understand file identification
- Iterate program headers → Understand runtime memory layout
- Iterate section headers → Understand linking and symbols
- Resolve symbol names → Understand string tables
The Core Question You Are Answering
How does the operating system transform a static file on disk into a running process in memory, and what information does it need from the binary format to make this transformation?
This question drives everything in binary analysis. The ELF format exists to bridge the gap between storage and execution—understanding it means understanding how programs come to life.
Concepts You Must Understand First
1. Binary File Formats vs. In-Memory Representations
A binary file is just structured data on disk. When executed, the OS loader reads this file and creates a completely different structure in memory. Understanding the distinction is critical.
Guiding questions:
- Why can’t the OS just load a file directly into memory and jump to it?
- What transformations must happen between disk and memory?
- How does the loader know where to place code vs data in memory?
Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 7 (Linking), “Practical Binary Analysis” Ch. 2 (The ELF Format)
2. Virtual Memory and Address Spaces
Every process believes it has the entire address space to itself. The ELF file tells the OS where to map segments in this virtual space.
Guiding questions:
- What’s the difference between a file offset and a virtual address?
- Why do ELF files specify both
p_offsetandp_vaddr? - How does the loader handle position-independent executables (PIE)?
Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 9 (Virtual Memory), “Low-Level Programming” Ch. 4 (Virtual Memory)
3. Linking: Static, Dynamic, and Runtime
Programs rarely stand alone—they call library functions. ELF contains metadata for three types of linking.
Guiding questions:
- What’s in
.symtabvs.dynsymand why do we need both? - How does the dynamic linker find
printfat runtime? - What happens during relocation?
Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 7.7-7.10 (Dynamic Linking), “Practical Binary Analysis” Ch. 2.3 (Symbols and Relocations)
4. Sections vs. Segments: A Critical Distinction
Sections are for linking (compile-time), segments are for loading (runtime). This is the most confusing aspect of ELF.
Guiding questions:
- Can multiple sections map to one segment?
- Why does
readelfshow both section headers and program headers? - Which is more important for reverse engineering: sections or segments?
Key reading: “Practical Binary Analysis” Ch. 2.2.4 (Sections and Segments), man elf (NOTES section)
5. Byte Order (Endianness)
Binary formats encode multi-byte integers. The byte order matters when reading file structures.
Guiding questions:
- How do you detect endianness from the ELF header?
- What happens if you parse a big-endian ELF on a little-endian machine?
- Which fields in Elf64_Ehdr are multi-byte?
Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 2.1 (Information Storage), “Hacking: The Art of Exploitation” Ch. 2 (Programming)
6. String Tables and Symbol Resolution
Strings in ELF aren’t stored inline—they’re in dedicated string table sections referenced by offset.
Guiding questions:
- Why use offsets into
.strtabinstead of embedding strings? - How do you find the name of a section?
- What’s the relationship between
.symtaband.strtab?
Key reading: “Practical Binary Analysis” Ch. 2.3.1 (The Symbol Table), man elf (String Table section)
7. Position-Independent Code (PIC) and ASLR
Modern systems randomize addresses. ELF supports this through relocations and GOT/PLT.
Guiding questions:
- How can you tell if an ELF is position-independent?
- What’s the difference between
ET_EXECandET_DYN? - Why do some binaries have a base address of 0x400000 and others 0x0?
Key reading: “Practical Binary Analysis” Ch. 5.4 (Position-Independent Code), “Computer Systems: A Programmer’s Perspective” Ch. 7.12 (Position-Independent Code)
Questions to Guide Your Design
-
How will you handle both 32-bit and 64-bit ELF files? The structures are different (
Elf32_EhdrvsElf64_Ehdr). Will you use compile-time selection or runtime detection? -
What’s your error handling strategy? What if the file claims to have 50 section headers but the file is too small? Corrupted binaries are common in malware analysis.
-
How will you deal with endianness? Will you support parsing big-endian ELF files on little-endian hosts?
-
Should you use
mmap()orread()? Memory-mapping the file vs reading it into a buffer has different implications for large files. -
How will you represent and display multi-byte values? Should you show
e_machineas0x3eorEM_X86_64orAMD x86-64? -
What level of validation will you implement? Check magic bytes only, or validate every offset and size field?
-
How will you handle stripped binaries? What if
.symtabis missing but.dynsymexists? -
Should your parser be a library or a standalone tool? Consider reusability for future projects.
Thinking Exercise
Before writing any code, perform these manual exercises:
Exercise 1: Hex Dump Analysis
xxd -l 128 /bin/ls
Using only the hex dump and the ELF specification (man elf):
- Identify the magic number
- Determine if it’s 32-bit or 64-bit
- Find the entry point address
- Locate the program header table offset
- Count the number of program headers
Write down the byte offsets and values. This forces you to understand the exact layout.
Exercise 2: Compare readelf Output
readelf -h /bin/ls
readelf -l /bin/ls
readelf -S /bin/ls
Create a mapping:
- Which bytes in the hex dump correspond to “Entry point address”?
- How does
readelfcalculate the “Start of section headers”? - Why is “Number of section headers” sometimes wrong? (Hint: large binaries)
Exercise 3: Trace the String Table
Using readelf -x .strtab /bin/ls, manually:
- Find a symbol name in
.symtab - Extract its
st_nameoffset - Navigate to that offset in
.strtab - Verify the null-terminated string
This teaches you how indirection works in binary formats.
Exercise 4: Draw the Memory Map
Using readelf -l, draw a diagram showing:
- Which segments get loaded where in virtual memory
- How segments overlap or abut
- Where the
.textand.datasections end up
Virtual Memory:
0x0000000000400000 +------------------+
| LOAD (R+X) | <- .text, .rodata
0x0000000000600000 +------------------+
| LOAD (RW) | <- .data, .bss
0x0000000000601000 +------------------+
The Interview Questions They’ll Ask
- “What’s the difference between a section and a segment in ELF?”
- Sections are for linking (used by
ld), segments are for loading (used byexecve). One segment can contain multiple sections.
- Sections are for linking (used by
- “How does the dynamic linker know which libraries to load?”
- The
DT_NEEDEDentries in the.dynamicsection list required libraries. The linker searches paths inDT_RPATH,LD_LIBRARY_PATH, and default system paths.
- The
- “Can you explain the GOT and PLT?”
- Global Offset Table (GOT) stores addresses of external symbols. Procedure Linkage Table (PLT) provides lazy binding—only resolves functions when first called.
- “What happens when you execute a PIE binary?”
- The kernel chooses a random base address (ASLR), loads all
LOADsegments relative to that base, and updates the auxiliary vector with the base address.
- The kernel chooses a random base address (ASLR), loads all
- “How do you find the main() function in a stripped binary?”
- Even stripped,
_startis the entry point. Disassemble it—it calls__libc_start_mainwithmainas an argument. That argument is the address ofmain.
- Even stripped,
- “What’s the significance of the
.interpsection?”- It specifies the path to the dynamic linker (e.g.,
/lib64/ld-linux-x86-64.so.2). Without it, dynamically linked programs can’t run.
- It specifies the path to the dynamic linker (e.g.,
- “Explain how relocations work.”
- Relocations are fixups applied by the linker/loader. They adjust addresses based on where code is actually loaded.
R_X86_64_RELATIVEadds the base address to a field.
- Relocations are fixups applied by the linker/loader. They adjust addresses based on where code is actually loaded.
- “Why do some binaries have two symbol tables (
.symtaband.dynsym)?”.dynsymcontains only symbols needed for dynamic linking (kept in release builds)..symtabhas all symbols (often stripped from release builds).
- “How can you detect if a binary is packed or encrypted?”
- Look for high entropy in sections (should be code, but looks random), unusual section names, small
.textsections with large writable sections, or UPX headers.
- Look for high entropy in sections (should be code, but looks random), unusual section names, small
- “What’s the difference between
ET_EXEC,ET_DYN, andET_REL?”ET_EXEC: static executable, fixed addresses.ET_DYN: shared object or PIE executable.ET_REL: relocatable object file (.ofiles).
Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| ELF Format Overview | “Practical Binary Analysis” by Dennis Andriesse | Ch. 2: The ELF Format |
| ELF Loading Process | “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron | Ch. 7.9: Loading Executable Object Files |
| ELF Headers and Structures | “Low-Level Programming” by Igor Zhirkov | Ch. 12: System Calls, Ch. 13: Models of Computation |
| Symbol Tables | “Computer Systems: A Programmer’s Perspective” | Ch. 7.5: Symbols and Symbol Tables |
| Dynamic Linking | “Computer Systems: A Programmer’s Perspective” | Ch. 7.7-7.12: Dynamic Linking |
| Relocations | “Practical Binary Analysis” | Ch. 2.3.3: Relocations |
| Virtual Memory | “Computer Systems: A Programmer’s Perspective” | Ch. 9: Virtual Memory |
| File I/O in C | “Hacking: The Art of Exploitation” by Jon Erickson | Ch. 2: Programming (File Access section) |
| Binary Data Structures | “Low-Level Programming” by Igor Zhirkov | Ch. 3: Assembly Language, Ch. 4: Virtual Memory |
| GOT/PLT Internals | “Practical Binary Analysis” | Ch. 2.3.4: Dynamic Linking |
| Position-Independent Code | “Computer Systems: A Programmer’s Perspective” | Ch. 7.12: Position-Independent Code (PIC) |
| ASLR and Security | “Hacking: The Art of Exploitation” | Ch. 5: Shellcode (ASLR section) |
| Stripped Binary Analysis | “Practical Malware Analysis” by Sikorski & Honig | Ch. 6: Recognizing C Code Constructs |
| Reference: ELF Specification | man elf (Linux manual) |
All sections |
ASCII Diagram: ELF File Structure
+---------------------------+
| ELF Header | <-- Always at offset 0
| e_ident[16] | Contains magic number, class, endianness
| e_type, e_machine | File type and architecture
| e_entry | Entry point virtual address
| e_phoff, e_phnum | --> Points to Program Header Table
| e_shoff, e_shnum | --> Points to Section Header Table
+---------------------------+
| |
| Program Header Table | <-- For loader (runtime)
| [Elf64_Phdr entries] | Describes segments
| - LOAD (code) | e.g., map file offset X to vaddr Y
| - LOAD (data) | with permissions RWX
| - DYNAMIC |
| - INTERP |
+---------------------------+
| |
| .text section | <-- Executable code
| (machine code bytes) |
+---------------------------+
| .rodata section | <-- Read-only data (strings)
| "Hello, world\0" |
+---------------------------+
| .data section | <-- Initialized writable data
| global variables |
+---------------------------+
| .bss section | <-- Uninitialized data (zero-filled)
| (no bytes on disk!) | Only occupies memory at runtime
+---------------------------+
| .symtab section | <-- Symbol table (often stripped)
| [Elf64_Sym entries] | Function/variable names & addresses
+---------------------------+
| .strtab section | <-- String table for .symtab
| "\0printf\0main\0..." | Null-separated strings
+---------------------------+
| .dynsym section | <-- Dynamic symbols (not stripped)
+---------------------------+
| .dynstr section | <-- String table for .dynsym
+---------------------------+
| |
| Section Header Table | <-- For linker (link-time)
| [Elf64_Shdr entries] | Describes sections
| - sh_name, sh_type | Name offset, section type
| - sh_addr, sh_offset | Virtual addr, file offset
| - sh_size, sh_link | Size, link to related section
+---------------------------+
Key insight: Program headers (segments) are what matters at runtime. Section headers are metadata for tools like ld and gdb. A stripped binary may have no section headers but still runs fine.
Common Pitfalls and Debugging
Problem 1: “Your interpretation does not match runtime behavior”
- Why: Static analysis can hide runtime-resolved addresses, lazy binding, and input-dependent branches.
- Fix: Reproduce the path with debugger or tracer, then compare static assumptions against live register/memory state.
- Quick test: Run the same sample through both your static workflow and a debugger transcript, and confirm control-flow decisions align.
Problem 2: “Tool output is inconsistent across machines”
- Why: ASLR, tool version drift, and different binary build flags (PIE, RELRO, symbols stripped) change observed addresses and metadata.
- Fix: Pin tool versions, capture
checksec/metadata, and document environment assumptions in your report. - Quick test: Re-run analysis in a container or VM with pinned tools and compare hashes of generated outputs.
Problem 3: “Analysis accidentally executes unsafe code”
- Why: Dynamic workflows run binaries in host context without sufficient isolation.
- Fix: Use disposable snapshots, no-network execution, and non-privileged users for all unknown samples.
- Quick test: Validate isolation controls first (network disabled, snapshot active, unprivileged user), then execute sample.
Definition of Done
- Core functionality works on reference inputs
- Edge cases are tested and documented
- Results are reproducible (same binary, same tools, same report output)
- Analysis notes clearly separate observations, assumptions, and conclusions
- Lab safety controls were applied for any dynamic execution
4. Solution Architecture
Input Artifact -> Parse/Decode -> Analysis Engine -> Validation Layer -> Report
Design each stage so intermediate artifacts are inspectable (JSON/text/notes), which makes debugging and peer review much easier.
5. Implementation Phases
Phase 1: Foundation
- Define input assumptions and format checks.
- Produce a minimal golden output on one known sample.
Phase 2: Core Functionality
- Implement full analysis pass for normal cases.
- Add validation against an external ground-truth tool.
Phase 3: Hard Cases and Reporting
- Add malformed/edge-case handling.
- Finalize report template and reproducibility notes.
6. Testing Strategy
- Unit-level checks for parser/decoder helpers.
- Integration checks against known binaries/challenges.
- Regression tests for previously failing cases.
7. Extensions & Challenges
- Add automation for batch analysis and comparative reports.
- Add confidence scoring for each major finding.
- Add export formats suitable for CI/security pipelines.
8. Production Reflection
Map your project output to a production analogue: what reliability, observability, and security controls would be required to run this continuously in an engineering organization?