Project 1: ELF File Parser

Expanded deep-dive guide for Project 1 from the Binary Analysis sprint.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Programming Language C
Alternative Programming Languages Python, Rust, Go
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold”
Knowledge Area Binary Formats / File Parsing
Software or Tool ELF binaries, hex editor
Main Book “Practical Binary Analysis” by Dennis Andriesse

1. Learning Objectives

  1. Build a working implementation with reproducible outputs.
  2. Justify key design choices with binary-analysis principles.
  3. Produce an evidence-backed report of findings and limitations.
  4. Document hardening or next-step improvements.

2. All Theory Needed (Per-Concept Breakdown)

This project depends on concepts from the main sprint primer: loader semantics, control/data-flow recovery, runtime observation, and mitigation-aware vulnerability reasoning. Before implementation, restate the project’s core assumptions in your own words and define how you will validate them.

3. Project Specification

3.1 What You Will Build

A command-line tool that parses ELF files and displays all headers, sections, segments, symbols, and relocations in a human-readable format—like a simplified readelf.

3.2 Functional Requirements

  1. Accept the target binary/input and validate format assumptions.
  2. Produce analyzable outputs (console report and/or artifacts).
  3. Handle malformed inputs safely with explicit errors.

3.3 Non-Functional Requirements

  • Reproducibility: same input should produce equivalent findings.
  • Safety: unknown samples run only in isolated lab contexts.
  • Clarity: separate facts, hypotheses, and inferred conclusions.

3.4 Expanded Project Brief

  • File: P01-elf-file-parser.md

  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Formats / File Parsing
  • Software or Tool: ELF binaries, hex editor
  • Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you’ll build: A command-line tool that parses ELF files and displays all headers, sections, segments, symbols, and relocations in a human-readable format—like a simplified readelf.

Why it teaches binary analysis: Every reverse engineering task starts with understanding the file format. Building a parser forces you to understand every byte of the ELF structure.

Core challenges you’ll face:

  • Parsing the ELF header → maps to understanding magic bytes, class (32/64-bit), endianness
  • Reading program headers → maps to segments, what gets loaded into memory
  • Reading section headers → maps to sections, symbols, strings
  • Handling different architectures → maps to x86, ARM, MIPS variations

Resources for key challenges:

  • Linux Audit - ELF Binaries - Excellent overview
  • “Practical Binary Analysis” Chapter 2 - Comprehensive ELF explanation
  • man elf - The ELF specification

Key Concepts:

  • ELF Header Structure: “Practical Binary Analysis” Ch. 2 - Andriesse
  • Program vs Section Headers: elf(5) man page
  • Symbol Tables: “Learning ELF” - Can Ozkan (Medium)

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming, understanding of pointers and structs, familiarity with hexadecimal

Real World Outcome

Deliverables:

  • Analysis output or tooling scripts
  • Report with control/data flow notes

Validation checklist:

  • Parses sample binaries correctly
  • Findings are reproducible in debugger
  • No unsafe execution outside lab ```bash $ ./elf_parser /bin/ls ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2’s complement, little endian Version: 1 (current) OS/ABI: UNIX - System V Type: DYN (Shared object file) Machine: AMD x86-64 Entry: 0x6b10

Program Headers: Type Offset VirtAddr FileSiz MemSiz Flg PHDR 0x000040 0x0000000000000040 0x0002d8 0x0002d8 R INTERP 0x000318 0x0000000000000318 0x00001c 0x00001c R LOAD 0x000000 0x0000000000000000 0x003510 0x003510 R …

Sections: [Nr] Name Type Address Size [ 0] NULL 0x0000000000000000 0x0 [ 1] .interp PROGBITS 0x0000000000000318 0x1c [ 2] .note.gnu.build-id NOTE 0x0000000000000338 0x24 …

Symbols: Num: Value Size Type Bind Name 1: 0000000000000000 0 FUNC GLOBAL printf@GLIBC_2.2.5 2: 0000000000006b10 123 FUNC GLOBAL main …


#### Hints in Layers
Start by mapping the ELF header structure:

```c
// Don't write code, but understand this structure:
// Elf64_Ehdr contains:
//   e_ident[16]  - Magic number and other info
//   e_type       - Object file type (ET_EXEC, ET_DYN, etc.)
//   e_machine    - Architecture (EM_X86_64, EM_ARM, etc.)
//   e_entry      - Entry point virtual address
//   e_phoff      - Program header table file offset
//   e_shoff      - Section header table file offset
//   e_phnum      - Number of program headers
//   e_shnum      - Number of section headers

Questions to guide your implementation:

  1. How do you detect if a file is 32-bit or 64-bit ELF?
  2. How do you find the string table section to get section names?
  3. What’s the difference between .dynsym and .symtab?
  4. How do program headers map sections to memory segments?

Learning milestones:

  1. Parse ELF header correctly → Understand file identification
  2. Iterate program headers → Understand runtime memory layout
  3. Iterate section headers → Understand linking and symbols
  4. Resolve symbol names → Understand string tables

The Core Question You Are Answering

How does the operating system transform a static file on disk into a running process in memory, and what information does it need from the binary format to make this transformation?

This question drives everything in binary analysis. The ELF format exists to bridge the gap between storage and execution—understanding it means understanding how programs come to life.

Concepts You Must Understand First

1. Binary File Formats vs. In-Memory Representations

A binary file is just structured data on disk. When executed, the OS loader reads this file and creates a completely different structure in memory. Understanding the distinction is critical.

Guiding questions:

  • Why can’t the OS just load a file directly into memory and jump to it?
  • What transformations must happen between disk and memory?
  • How does the loader know where to place code vs data in memory?

Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 7 (Linking), “Practical Binary Analysis” Ch. 2 (The ELF Format)

2. Virtual Memory and Address Spaces

Every process believes it has the entire address space to itself. The ELF file tells the OS where to map segments in this virtual space.

Guiding questions:

  • What’s the difference between a file offset and a virtual address?
  • Why do ELF files specify both p_offset and p_vaddr?
  • How does the loader handle position-independent executables (PIE)?

Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 9 (Virtual Memory), “Low-Level Programming” Ch. 4 (Virtual Memory)

3. Linking: Static, Dynamic, and Runtime

Programs rarely stand alone—they call library functions. ELF contains metadata for three types of linking.

Guiding questions:

  • What’s in .symtab vs .dynsym and why do we need both?
  • How does the dynamic linker find printf at runtime?
  • What happens during relocation?

Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 7.7-7.10 (Dynamic Linking), “Practical Binary Analysis” Ch. 2.3 (Symbols and Relocations)

4. Sections vs. Segments: A Critical Distinction

Sections are for linking (compile-time), segments are for loading (runtime). This is the most confusing aspect of ELF.

Guiding questions:

  • Can multiple sections map to one segment?
  • Why does readelf show both section headers and program headers?
  • Which is more important for reverse engineering: sections or segments?

Key reading: “Practical Binary Analysis” Ch. 2.2.4 (Sections and Segments), man elf (NOTES section)

5. Byte Order (Endianness)

Binary formats encode multi-byte integers. The byte order matters when reading file structures.

Guiding questions:

  • How do you detect endianness from the ELF header?
  • What happens if you parse a big-endian ELF on a little-endian machine?
  • Which fields in Elf64_Ehdr are multi-byte?

Key reading: “Computer Systems: A Programmer’s Perspective” Ch. 2.1 (Information Storage), “Hacking: The Art of Exploitation” Ch. 2 (Programming)

6. String Tables and Symbol Resolution

Strings in ELF aren’t stored inline—they’re in dedicated string table sections referenced by offset.

Guiding questions:

  • Why use offsets into .strtab instead of embedding strings?
  • How do you find the name of a section?
  • What’s the relationship between .symtab and .strtab?

Key reading: “Practical Binary Analysis” Ch. 2.3.1 (The Symbol Table), man elf (String Table section)

7. Position-Independent Code (PIC) and ASLR

Modern systems randomize addresses. ELF supports this through relocations and GOT/PLT.

Guiding questions:

  • How can you tell if an ELF is position-independent?
  • What’s the difference between ET_EXEC and ET_DYN?
  • Why do some binaries have a base address of 0x400000 and others 0x0?

Key reading: “Practical Binary Analysis” Ch. 5.4 (Position-Independent Code), “Computer Systems: A Programmer’s Perspective” Ch. 7.12 (Position-Independent Code)

Questions to Guide Your Design

  1. How will you handle both 32-bit and 64-bit ELF files? The structures are different (Elf32_Ehdr vs Elf64_Ehdr). Will you use compile-time selection or runtime detection?

  2. What’s your error handling strategy? What if the file claims to have 50 section headers but the file is too small? Corrupted binaries are common in malware analysis.

  3. How will you deal with endianness? Will you support parsing big-endian ELF files on little-endian hosts?

  4. Should you use mmap() or read()? Memory-mapping the file vs reading it into a buffer has different implications for large files.

  5. How will you represent and display multi-byte values? Should you show e_machine as 0x3e or EM_X86_64 or AMD x86-64?

  6. What level of validation will you implement? Check magic bytes only, or validate every offset and size field?

  7. How will you handle stripped binaries? What if .symtab is missing but .dynsym exists?

  8. Should your parser be a library or a standalone tool? Consider reusability for future projects.

Thinking Exercise

Before writing any code, perform these manual exercises:

Exercise 1: Hex Dump Analysis

xxd -l 128 /bin/ls

Using only the hex dump and the ELF specification (man elf):

  1. Identify the magic number
  2. Determine if it’s 32-bit or 64-bit
  3. Find the entry point address
  4. Locate the program header table offset
  5. Count the number of program headers

Write down the byte offsets and values. This forces you to understand the exact layout.

Exercise 2: Compare readelf Output

readelf -h /bin/ls
readelf -l /bin/ls
readelf -S /bin/ls

Create a mapping:

  • Which bytes in the hex dump correspond to “Entry point address”?
  • How does readelf calculate the “Start of section headers”?
  • Why is “Number of section headers” sometimes wrong? (Hint: large binaries)

Exercise 3: Trace the String Table Using readelf -x .strtab /bin/ls, manually:

  1. Find a symbol name in .symtab
  2. Extract its st_name offset
  3. Navigate to that offset in .strtab
  4. Verify the null-terminated string

This teaches you how indirection works in binary formats.

Exercise 4: Draw the Memory Map Using readelf -l, draw a diagram showing:

  • Which segments get loaded where in virtual memory
  • How segments overlap or abut
  • Where the .text and .data sections end up
Virtual Memory:
0x0000000000400000  +------------------+
                    | LOAD (R+X)       |  <- .text, .rodata
0x0000000000600000  +------------------+
                    | LOAD (RW)        |  <- .data, .bss
0x0000000000601000  +------------------+

The Interview Questions They’ll Ask

  1. “What’s the difference between a section and a segment in ELF?”
    • Sections are for linking (used by ld), segments are for loading (used by execve). One segment can contain multiple sections.
  2. “How does the dynamic linker know which libraries to load?”
    • The DT_NEEDED entries in the .dynamic section list required libraries. The linker searches paths in DT_RPATH, LD_LIBRARY_PATH, and default system paths.
  3. “Can you explain the GOT and PLT?”
    • Global Offset Table (GOT) stores addresses of external symbols. Procedure Linkage Table (PLT) provides lazy binding—only resolves functions when first called.
  4. “What happens when you execute a PIE binary?”
    • The kernel chooses a random base address (ASLR), loads all LOAD segments relative to that base, and updates the auxiliary vector with the base address.
  5. “How do you find the main() function in a stripped binary?”
    • Even stripped, _start is the entry point. Disassemble it—it calls __libc_start_main with main as an argument. That argument is the address of main.
  6. “What’s the significance of the .interp section?”
    • It specifies the path to the dynamic linker (e.g., /lib64/ld-linux-x86-64.so.2). Without it, dynamically linked programs can’t run.
  7. “Explain how relocations work.”
    • Relocations are fixups applied by the linker/loader. They adjust addresses based on where code is actually loaded. R_X86_64_RELATIVE adds the base address to a field.
  8. “Why do some binaries have two symbol tables (.symtab and .dynsym)?”
    • .dynsym contains only symbols needed for dynamic linking (kept in release builds). .symtab has all symbols (often stripped from release builds).
  9. “How can you detect if a binary is packed or encrypted?”
    • Look for high entropy in sections (should be code, but looks random), unusual section names, small .text sections with large writable sections, or UPX headers.
  10. “What’s the difference between ET_EXEC, ET_DYN, and ET_REL?”
    • ET_EXEC: static executable, fixed addresses. ET_DYN: shared object or PIE executable. ET_REL: relocatable object file (.o files).

Books That Will Help

Topic Book Chapter/Section
ELF Format Overview “Practical Binary Analysis” by Dennis Andriesse Ch. 2: The ELF Format
ELF Loading Process “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron Ch. 7.9: Loading Executable Object Files
ELF Headers and Structures “Low-Level Programming” by Igor Zhirkov Ch. 12: System Calls, Ch. 13: Models of Computation
Symbol Tables “Computer Systems: A Programmer’s Perspective” Ch. 7.5: Symbols and Symbol Tables
Dynamic Linking “Computer Systems: A Programmer’s Perspective” Ch. 7.7-7.12: Dynamic Linking
Relocations “Practical Binary Analysis” Ch. 2.3.3: Relocations
Virtual Memory “Computer Systems: A Programmer’s Perspective” Ch. 9: Virtual Memory
File I/O in C “Hacking: The Art of Exploitation” by Jon Erickson Ch. 2: Programming (File Access section)
Binary Data Structures “Low-Level Programming” by Igor Zhirkov Ch. 3: Assembly Language, Ch. 4: Virtual Memory
GOT/PLT Internals “Practical Binary Analysis” Ch. 2.3.4: Dynamic Linking
Position-Independent Code “Computer Systems: A Programmer’s Perspective” Ch. 7.12: Position-Independent Code (PIC)
ASLR and Security “Hacking: The Art of Exploitation” Ch. 5: Shellcode (ASLR section)
Stripped Binary Analysis “Practical Malware Analysis” by Sikorski & Honig Ch. 6: Recognizing C Code Constructs
Reference: ELF Specification man elf (Linux manual) All sections

ASCII Diagram: ELF File Structure

+---------------------------+
|      ELF Header           |  <-- Always at offset 0
|  e_ident[16]              |      Contains magic number, class, endianness
|  e_type, e_machine        |      File type and architecture
|  e_entry                  |      Entry point virtual address
|  e_phoff, e_phnum         |  --> Points to Program Header Table
|  e_shoff, e_shnum         |  --> Points to Section Header Table
+---------------------------+
|                           |
|   Program Header Table    |  <-- For loader (runtime)
|   [Elf64_Phdr entries]    |      Describes segments
|   - LOAD (code)           |      e.g., map file offset X to vaddr Y
|   - LOAD (data)           |           with permissions RWX
|   - DYNAMIC               |
|   - INTERP                |
+---------------------------+
|                           |
|   .text section           |  <-- Executable code
|   (machine code bytes)    |
+---------------------------+
|   .rodata section         |  <-- Read-only data (strings)
|   "Hello, world\0"        |
+---------------------------+
|   .data section           |  <-- Initialized writable data
|   global variables        |
+---------------------------+
|   .bss section            |  <-- Uninitialized data (zero-filled)
|   (no bytes on disk!)     |      Only occupies memory at runtime
+---------------------------+
|   .symtab section         |  <-- Symbol table (often stripped)
|   [Elf64_Sym entries]     |      Function/variable names & addresses
+---------------------------+
|   .strtab section         |  <-- String table for .symtab
|   "\0printf\0main\0..."   |      Null-separated strings
+---------------------------+
|   .dynsym section         |  <-- Dynamic symbols (not stripped)
+---------------------------+
|   .dynstr section         |  <-- String table for .dynsym
+---------------------------+
|                           |
|   Section Header Table    |  <-- For linker (link-time)
|   [Elf64_Shdr entries]    |      Describes sections
|   - sh_name, sh_type      |      Name offset, section type
|   - sh_addr, sh_offset    |      Virtual addr, file offset
|   - sh_size, sh_link      |      Size, link to related section
+---------------------------+

Key insight: Program headers (segments) are what matters at runtime. Section headers are metadata for tools like ld and gdb. A stripped binary may have no section headers but still runs fine.


Common Pitfalls and Debugging

Problem 1: “Your interpretation does not match runtime behavior”

  • Why: Static analysis can hide runtime-resolved addresses, lazy binding, and input-dependent branches.
  • Fix: Reproduce the path with debugger or tracer, then compare static assumptions against live register/memory state.
  • Quick test: Run the same sample through both your static workflow and a debugger transcript, and confirm control-flow decisions align.

Problem 2: “Tool output is inconsistent across machines”

  • Why: ASLR, tool version drift, and different binary build flags (PIE, RELRO, symbols stripped) change observed addresses and metadata.
  • Fix: Pin tool versions, capture checksec/metadata, and document environment assumptions in your report.
  • Quick test: Re-run analysis in a container or VM with pinned tools and compare hashes of generated outputs.

Problem 3: “Analysis accidentally executes unsafe code”

  • Why: Dynamic workflows run binaries in host context without sufficient isolation.
  • Fix: Use disposable snapshots, no-network execution, and non-privileged users for all unknown samples.
  • Quick test: Validate isolation controls first (network disabled, snapshot active, unprivileged user), then execute sample.

Definition of Done

  • Core functionality works on reference inputs
  • Edge cases are tested and documented
  • Results are reproducible (same binary, same tools, same report output)
  • Analysis notes clearly separate observations, assumptions, and conclusions
  • Lab safety controls were applied for any dynamic execution

4. Solution Architecture

Input Artifact -> Parse/Decode -> Analysis Engine -> Validation Layer -> Report

Design each stage so intermediate artifacts are inspectable (JSON/text/notes), which makes debugging and peer review much easier.

5. Implementation Phases

Phase 1: Foundation

  • Define input assumptions and format checks.
  • Produce a minimal golden output on one known sample.

Phase 2: Core Functionality

  • Implement full analysis pass for normal cases.
  • Add validation against an external ground-truth tool.

Phase 3: Hard Cases and Reporting

  • Add malformed/edge-case handling.
  • Finalize report template and reproducibility notes.

6. Testing Strategy

  • Unit-level checks for parser/decoder helpers.
  • Integration checks against known binaries/challenges.
  • Regression tests for previously failing cases.

7. Extensions & Challenges

  • Add automation for batch analysis and comparative reports.
  • Add confidence scoring for each major finding.
  • Add export formats suitable for CI/security pipelines.

8. Production Reflection

Map your project output to a production analogue: what reliability, observability, and security controls would be required to run this continuously in an engineering organization?