Project 2: PE File Parser

Expanded deep-dive guide for Project 2 from the Binary Analysis sprint.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Programming Language C
Alternative Programming Languages Python, Rust
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold”
Knowledge Area Binary Formats / Windows Executables
Software or Tool PE files, Windows or Wine
Main Book “Practical Malware Analysis” by Sikorski & Honig

1. Learning Objectives

  1. Build a working implementation with reproducible outputs.
  2. Justify key design choices with binary-analysis principles.
  3. Produce an evidence-backed report of findings and limitations.
  4. Document hardening or next-step improvements.

2. All Theory Needed (Per-Concept Breakdown)

This project depends on concepts from the main sprint primer: loader semantics, control/data-flow recovery, runtime observation, and mitigation-aware vulnerability reasoning. Before implementation, restate the project’s core assumptions in your own words and define how you will validate them.

3. Project Specification

3.1 What You Will Build

A PE file parser that extracts headers, sections, imports, exports, and resources from Windows executables.

3.2 Functional Requirements

  1. Accept the target binary/input and validate format assumptions.
  2. Produce analyzable outputs (console report and/or artifacts).
  3. Handle malformed inputs safely with explicit errors.

3.3 Non-Functional Requirements

  • Reproducibility: same input should produce equivalent findings.
  • Safety: unknown samples run only in isolated lab contexts.
  • Clarity: separate facts, hypotheses, and inferred conclusions.

3.4 Expanded Project Brief

  • File: P02-pe-file-parser.md

  • Main Programming Language: C
  • Alternative Programming Languages: Python, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Binary Formats / Windows Executables
  • Software or Tool: PE files, Windows or Wine
  • Main Book: “Practical Malware Analysis” by Sikorski & Honig

What you’ll build: A PE file parser that extracts headers, sections, imports, exports, and resources from Windows executables.

Why it teaches binary analysis: Windows malware analysis requires understanding PE format. Most real-world targets are Windows binaries.

Core challenges you’ll face:

  • DOS header and stub → maps to legacy compatibility
  • COFF and Optional headers → maps to PE32 vs PE32+
  • Import Address Table (IAT) → maps to dynamic linking, API calls
  • Export directory → maps to DLL functions

Resources for key challenges:

Key Concepts:

  • PE Structure: “Practical Malware Analysis” Ch. 1
  • Import Table: PE Format specification
  • Resources: CFF Explorer documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 (ELF Parser), understanding of Windows APIs

Real World Outcome

Deliverables:

  • Analysis output or tooling scripts
  • Report with control/data flow notes

Validation checklist:

  • Parses sample binaries correctly
  • Findings are reproducible in debugger
  • No unsafe execution outside lab ```bash $ ./pe_parser suspicious.exe DOS Header: Magic: MZ (0x5a4d) PE Offset: 0x100

PE Header: Signature: PE (0x4550) Machine: x64 (0x8664) Sections: 5 Timestamp: 2024-01-15 14:32:01

Optional Header: Magic: PE32+ (0x20b) Entry Point: 0x1400012a0 Image Base: 0x140000000

Sections: Name VirtAddr VirtSize RawSize Flags .text 0x1000 0x5a00 0x5c00 CODE,EXECUTE,READ .rdata 0x7000 0x1e00 0x2000 READ .data 0x9000 0x400 0x200 READ,WRITE

Imports: KERNEL32.dll: - CreateFileA - ReadFile - WriteFile - VirtualAlloc ← Suspicious! WS2_32.dll: - socket ← Network activity! - connect - send - recv


#### Hints in Layers
The PE format has a layered structure. Parse it step by step:
1. Read DOS header at offset 0
2. Follow `e_lfanew` to find PE signature
3. Parse COFF header immediately after signature
4. Parse Optional Header (size varies by PE32 vs PE32+)
5. Parse section headers after Optional Header
6. Use Data Directories to find imports, exports, resources

Key questions:
- What does `IMAGE_DIRECTORY_ENTRY_IMPORT` point to?
- How are imported function names resolved (hint: thunks)?
- What's the difference between RVA and file offset?

**Learning milestones**:
1. **Parse headers correctly** → Understand PE structure
2. **Extract imports** → See what APIs the program uses
3. **Extract exports** → Understand DLLs
4. **Handle both PE32 and PE32+** → Support all Windows binaries

---

#### The Core Question You Are Answering

**How does Windows organize executable code, manage dynamic linking differently from Unix, and why has this format become the primary target for malware authors worldwide?**

The PE format reveals Windows' architectural philosophy: backward compatibility at all costs, rich metadata for tools, and a structure that has evolved from MS-DOS through to modern 64-bit Windows. Understanding PE is understanding the Windows ecosystem.

#### Concepts You Must Understand First

**1. The DOS Legacy and Stub Programs**

Every PE file begins with an MS-DOS executable. This seems bizarre until you understand Windows' commitment to backward compatibility.

*Guiding questions:*
- Why does a Windows 11 executable start with "MZ" from 1981?
- What happens if you run a PE file in pure DOS?
- How does the DOS stub hand off to the real PE code?

*Key reading:* "Practical Malware Analysis" Ch. 1.2 (Portable Executable File Format), "Practical Binary Analysis" Ch. 2.4 (The PE Format)

**2. Relative Virtual Addresses (RVAs) vs. File Offsets**

Unlike ELF which uses both, PE heavily relies on RVAs. Almost every pointer in PE is an RVA, not a raw file offset.

*Guiding questions:*
- What is an RVA relative to? (Hint: ImageBase)
- How do you convert an RVA to a file offset?
- Why does malware often modify ImageBase?

*Key reading:* "Practical Malware Analysis" Ch. 1.2 (The PE File Structure), Microsoft PE/COFF Specification Section 3 (COFF File Header)

**3. Import Address Table (IAT) and Dynamic Linking**

Windows programs discover API functions differently than Unix. The IAT is the gateway to understanding what a program can do.

*Guiding questions:*
- What's the difference between the Import Name Table and the Import Address Table?
- How does the Windows loader populate the IAT at load time?
- Why do malware analysts always check the IAT first?

*Key reading:* "Practical Malware Analysis" Ch. 1.2.5 (The .idata Section), "Practical Binary Analysis" Ch. 2.4.5 (Import Directory)

**4. Sections vs. Segments (Windows Style)**

Windows doesn't call them segments—everything is sections. But sections have two alignments: on disk and in memory.

*Guiding questions:*
- What is `SectionAlignment` vs `FileAlignment`?
- Why is `.text` section often larger in memory than on disk?
- How does section padding affect packing detection?

*Key reading:* Microsoft PE/COFF Specification Section 4 (Section Table), "Practical Malware Analysis" Ch. 18.1 (Packers and Unpacking)

**5. PE32 vs. PE32+ (32-bit vs. 64-bit)**

Unlike ELF's Elf32/Elf64, PE uses the same structures with different Optional Header sizes.

*Guiding questions:*
- How do you detect PE32 vs. PE32+? (Hint: Magic field)
- What fields change between PE32 and PE32+?
- Can a 64-bit Windows process load 32-bit DLLs?

*Key reading:* Microsoft PE/COFF Specification Section 3.4 (Optional Header), "Practical Binary Analysis" Ch. 2.4.3 (PE Optional Header)

**6. Export Directory and DLL Internals**

DLLs are PE files that export functions. Understanding exports is key to understanding how Windows APIs work.

*Guiding questions:*
- How are exported functions named vs. numbered (ordinals)?
- What's the export forwarding chain?
- Why do some DLLs export thousands of functions?

*Key reading:* "Practical Malware Analysis" Ch. 1.2.6 (The .edata Section), Microsoft PE/COFF Specification Section 6.3 (Export Directory Table)

**7. PE Resources and the .rsrc Section**

Unlike ELF, PE files contain a rich resource tree: icons, dialogs, version info, and sometimes malware payloads.

*Guiding questions:*
- How is the resource tree structured?
- What is a resource ID vs. a resource name?
- Why do analysts check `.rsrc` for embedded executables?

*Key reading:* "Practical Malware Analysis" Ch. 1.2.7 (PE File Headers and Sections), Microsoft PE/COFF Specification Section 6.9 (Resource Format)

**8. Data Directories and the Optional Header**

The PE Optional Header contains 16 data directory entries pointing to critical structures.

*Guiding questions:*
- What are the most important data directories for malware analysis?
- How does `IMAGE_DIRECTORY_ENTRY_IMPORT` relate to the IAT?
- What does `IMAGE_DIRECTORY_ENTRY_SECURITY` contain?

*Key reading:* "Practical Binary Analysis" Ch. 2.4.4 (Data Directories), Microsoft PE/COFF Specification Section 3.4.4 (Optional Header Data Directories)

#### Questions to Guide Your Design

1. **How will you handle RVA-to-file-offset conversion?** You'll need this constantly. Should you pre-build a lookup table from section headers?

2. **Will you parse imports by name or by ordinal?** Some DLLs export by ordinal only. Your parser needs to handle both.

3. **How deeply will you parse the resource tree?** Resources can be nested multiple levels. Will you recurse fully or just show top-level?

4. **What validation will you perform?** PE files from malware are often malformed intentionally to break tools.

5. **How will you display suspicious indicators?** Highlight imports like `VirtualAlloc`, `WriteProcessMemory`, or unusual section names?

6. **Will you support bound imports?** Bound imports pre-cache IAT addresses for performance. Most modern Windows ignores them.

7. **How will you handle exports in executables?** EXEs can export functions (rare but legal). Will you check for this?

8. **Should you calculate entropy per section?** High entropy suggests packing or encryption—a key malware indicator.

#### Thinking Exercise

**Before writing code, perform these manual exercises:**

**Exercise 1: Manual PE Parsing**
```bash
xxd -l 512 /path/to/some.exe  # or use a Windows PE file

Using a hex editor and the PE specification:

  1. Find the “MZ” signature at offset 0
  2. Navigate to offset 0x3C and read the 4-byte value (e_lfanew)
  3. Jump to that offset and verify “PE\0\0” signature
  4. Parse the COFF header: Machine type, Number of sections, Timestamp
  5. Calculate where the Section Table begins

Write down each calculation. This cements the layered structure.

Exercise 2: Trace an Import Using a tool like CFF Explorer or pefile (Python):

import pefile
pe = pefile.PE('suspicious.exe')
for entry in pe.DIRECTORY_ENTRY_IMPORT:
    print(entry.dll.decode())
    for imp in entry.imports:
        print(f'  {imp.name.decode() if imp.name else f"Ordinal {imp.ordinal}"}')

Pick one import (e.g., CreateFileA from KERNEL32.dll):

  1. Find its entry in the Import Directory
  2. Locate the Import Name Table entry
  3. Find the corresponding Import Address Table entry
  4. Understand how the loader will patch this at runtime

Exercise 3: Section Alignment Analysis For a sample PE file:

  1. Note FileAlignment and SectionAlignment from Optional Header
  2. For each section, calculate:
    • VirtualAddress (where it loads in memory)
    • VirtualSize (size in memory)
    • PointerToRawData (offset in file)
    • SizeOfRawData (size in file)
  3. Identify any discrepancies—common in packed malware

Exercise 4: Resource Tree Exploration

# On Linux, use wrestool from icoutils
wrestool -x --output=. sample.exe
# Lists and extracts all resources

Explore the .rsrc section:

  1. How many resource types exist? (RT_ICON, RT_DIALOG, etc.)
  2. Are there any unusual resource names?
  3. Check for embedded PEs or suspicious binary blobs

Draw the resource tree structure manually.

The Interview Questions They’ll Ask

  1. “What’s the difference between the DOS header and the PE header?”
    • The DOS header (MZ header) is at offset 0 for DOS compatibility. Its e_lfanew field points to the real PE header (PE\0\0 signature). The PE header contains the COFF header and Optional Header.
  2. “How do you convert an RVA to a file offset?”
    • Find which section contains the RVA by checking section VirtualAddress ranges. Then: FileOffset = RVA - SectionVirtualAddress + SectionPointerToRawData.
  3. “Explain how the Windows loader populates the IAT.”
    • The loader reads each DLL from the Import Directory, calls LoadLibrary to load the DLL, uses GetProcAddress to resolve each imported function, and writes the actual addresses into the Import Address Table.
  4. “What are some red flags in a PE file that suggest malware?”
    • Unusual section names (.aspack, .upx), high entropy sections, mismatched timestamps, imports of process injection APIs (CreateRemoteThread, VirtualAllocEx), tiny .text section with huge .data, resources larger than code sections.
  5. “What’s the difference between IMAGE_FILE_EXECUTABLE_IMAGE and IMAGE_FILE_DLL?”
    • Both are flags in the COFF Characteristics field. IMAGE_FILE_EXECUTABLE_IMAGE means it’s a valid executable. IMAGE_FILE_DLL means it’s a DLL (cannot be run directly, must be loaded by another process).
  6. “How does ASLR work in Windows PE files?”
    • If DllCharacteristics includes IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE, the OS can relocate the image to a random base address. The .reloc section contains fixup information for this.
  7. “What is ordinal importing and why is it used?”
    • Instead of importing by name (“CreateFileA”), you import by number (ordinal 1234). It’s smaller and slightly faster but breaks if DLL versions change. Often used in system DLLs.
  8. “What’s in the .reloc section?”
    • Base relocation entries. If the PE can’t load at its preferred ImageBase, the loader uses these entries to fix up all absolute addresses. Required for DLLs and ASLR-enabled EXEs.
  9. “How can you tell if a PE is packed?”
    • Check section names (UPX, ASPack, etc.), compare SizeOfImage to raw file size, calculate entropy (packed sections have high entropy ~7.5-8.0), look for abnormal entry point (EP in last section or writable section).
  10. “What’s the significance of the TLS (Thread Local Storage) directory?”
    • TLS callbacks execute before the main entry point—earlier than AddressOfEntryPoint. Malware uses TLS callbacks for anti-debugging and to run code before analysts expect.

Books That Will Help

Topic Book Chapter/Section
PE Format Overview “Practical Malware Analysis” by Sikorski & Honig Ch. 1: Basic Static Techniques (PE File Format)
PE File Structure “Practical Binary Analysis” by Dennis Andriesse Ch. 2.4: The PE Format - A Comparison with ELF
Import/Export Tables “Practical Malware Analysis” Ch. 1.2.5-1.2.6: Imports and Exports
PE Headers Deep Dive “Practical Binary Analysis” Ch. 2.4.3-2.4.5: PE Headers and Data Directories
Dynamic Linking on Windows “Practical Malware Analysis” Ch. 7: Analyzing Malicious Windows Programs
Resource Sections “Practical Malware Analysis” Ch. 1.2.7: PE Sections (Resources)
Packers and Obfuscation “Practical Malware Analysis” Ch. 18: Packers and Unpacking
RVA and Address Calculations Microsoft PE/COFF Specification Section 4: Section Table (online)
Windows Internals (Loading) “Windows Internals” by Russinovich et al. Part 1, Ch. 3: System Mechanisms (Image Loader)
Malware Analysis Techniques “Practical Malware Analysis” Ch. 3: Basic Dynamic Analysis
File Format Reversing “Hacking: The Art of Exploitation” by Erickson Ch. 4: Exploitation (Binary Formats section)
Section Characteristics Microsoft PE/COFF Specification Section 4.1: Section Flags (online)
TLS Callbacks “Practical Malware Analysis” Ch. 14: Malware-Focused Network Signatures (TLS section)
Reference: PE/COFF Spec Microsoft PE/COFF Specification All sections (online documentation)

ASCII Diagram: PE File Structure

+--------------------------------+
|        DOS Header (MZ)         |  <-- Offset 0x00
|  e_magic = "MZ" (0x5A4D)       |      DOS compatibility stub
|  ...                           |
|  e_lfanew = 0x000000E0         |  --> Points to PE Header
+--------------------------------+
|                                |
|        DOS Stub Program        |      "This program cannot be run in DOS mode"
|  (can be modified/enlarged)    |
+--------------------------------+
|                                |
|   PE Signature (PE\0\0)        |  <-- Offset e_lfanew (e.g., 0xE0)
|   0x00004550                   |
+--------------------------------+
|        COFF Header             |
|  Machine (0x8664 = x64)        |
|  NumberOfSections              |
|  TimeDateStamp                 |
|  SizeOfOptionalHeader          |
|  Characteristics               |
+--------------------------------+
|      Optional Header           |
|  Magic (0x10B=PE32/0x20B=PE32+)|
|  AddressOfEntryPoint           |  --> Where execution begins (RVA)
|  ImageBase                     |      Preferred load address
|  SectionAlignment              |      Alignment in memory (usually 0x1000)
|  FileAlignment                 |      Alignment on disk (usually 0x200)
|  SizeOfImage                   |      Total size when loaded
|  SizeOfHeaders                 |
|  Subsystem (GUI/Console)       |
|  DllCharacteristics            |      ASLR, DEP flags, etc.
|  NumberOfRvaAndSizes           |      Usually 16
|                                |
|    Data Directories [16]       |
|    [0] Export Table            |
|    [1] Import Table            |  --> Critical for analysis
|    [2] Resource Table          |
|    [3] Exception Table         |
|    [5] Base Relocation         |
|    [9] TLS Table               |      TLS callbacks
|    [12] IAT                    |      Import Address Table
|    [14] COM Descriptor         |      .NET assemblies
+--------------------------------+
|                                |
|      Section Table             |      NumberOfSections entries
|   [Section Header for .text]  |
|     Name = ".text"             |
|     VirtualSize                |
|     VirtualAddress (RVA)       |      Where it loads in memory
|     SizeOfRawData              |
|     PointerToRawData           |      Offset in file
|     Characteristics (RX)       |      Readable + Executable
|                                |
|   [Section Header for .rdata] |
|   [Section Header for .data]  |
|   [Section Header for .rsrc]  |
|   [Section Header for .reloc] |
+--------------------------------+
|                                |
|     .text Section              |  <-- Executable code
|   (machine code bytes)         |
|                                |
+--------------------------------+
|     .rdata Section             |  <-- Read-only data
|   Import Directory             |      Import Name Table (INT)
|   Import Address Table (IAT)   |      Function pointers (patched by loader)
|   String literals              |
+--------------------------------+
|     .data Section              |  <-- Initialized writable data
|   Global variables             |
+--------------------------------+
|     .rsrc Section              |  <-- Resources (icons, strings, dialogs)
|   Resource Directory Tree      |
|   Icons, Version Info          |
+--------------------------------+
|     .reloc Section             |  <-- Base relocations for ASLR
|   Relocation blocks            |
+--------------------------------+

Key Insight: The Import Address Table (IAT) is one of the first things to analyze. It reveals every API function the program can call—a behavioral fingerprint. In malware, suspicious imports like VirtualAlloc + WriteProcessMemory + CreateRemoteThread indicate process injection.

Common Pitfalls and Debugging

Problem 1: “Your interpretation does not match runtime behavior”

  • Why: Static analysis can hide runtime-resolved addresses, lazy binding, and input-dependent branches.
  • Fix: Reproduce the path with debugger or tracer, then compare static assumptions against live register/memory state.
  • Quick test: Run the same sample through both your static workflow and a debugger transcript, and confirm control-flow decisions align.

Problem 2: “Tool output is inconsistent across machines”

  • Why: ASLR, tool version drift, and different binary build flags (PIE, RELRO, symbols stripped) change observed addresses and metadata.
  • Fix: Pin tool versions, capture checksec/metadata, and document environment assumptions in your report.
  • Quick test: Re-run analysis in a container or VM with pinned tools and compare hashes of generated outputs.

Problem 3: “Analysis accidentally executes unsafe code”

  • Why: Dynamic workflows run binaries in host context without sufficient isolation.
  • Fix: Use disposable snapshots, no-network execution, and non-privileged users for all unknown samples.
  • Quick test: Validate isolation controls first (network disabled, snapshot active, unprivileged user), then execute sample.

Definition of Done

  • Core functionality works on reference inputs
  • Edge cases are tested and documented
  • Results are reproducible (same binary, same tools, same report output)
  • Analysis notes clearly separate observations, assumptions, and conclusions
  • Lab safety controls were applied for any dynamic execution

4. Solution Architecture

Input Artifact -> Parse/Decode -> Analysis Engine -> Validation Layer -> Report

Design each stage so intermediate artifacts are inspectable (JSON/text/notes), which makes debugging and peer review much easier.

5. Implementation Phases

Phase 1: Foundation

  • Define input assumptions and format checks.
  • Produce a minimal golden output on one known sample.

Phase 2: Core Functionality

  • Implement full analysis pass for normal cases.
  • Add validation against an external ground-truth tool.

Phase 3: Hard Cases and Reporting

  • Add malformed/edge-case handling.
  • Finalize report template and reproducibility notes.

6. Testing Strategy

  • Unit-level checks for parser/decoder helpers.
  • Integration checks against known binaries/challenges.
  • Regression tests for previously failing cases.

7. Extensions & Challenges

  • Add automation for batch analysis and comparative reports.
  • Add confidence scoring for each major finding.
  • Add export formats suitable for CI/security pipelines.

8. Production Reflection

Map your project output to a production analogue: what reliability, observability, and security controls would be required to run this continuously in an engineering organization?