Project 7: The Minidump Parser

Build a parser for Google Breakpad’s minidump format—the industry-standard for cross-platform crash reporting.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	2-3 weeks
Language	Python
Prerequisites	Project 1, strong binary parsing skills
Key Topics	Minidump format, Breakpad/Crashpad, binary stream parsing

1. Learning Objectives

By completing this project, you will:

Understand the minidump file format used by major browsers and applications
Learn to parse complex binary structures with nested stream directories
Implement a practical tool that extracts crash information from minidumps
Understand why minidumps are preferred over full core dumps in production
Gain experience with Python’s struct module for binary parsing
Learn the Breakpad/Crashpad ecosystem used by Google, Mozilla, and Microsoft

2. Theoretical Foundation

2.1 Core Concepts

What is a Minidump?

A minidump is a smaller, more portable crash dump format originally developed by Microsoft for Windows. Google’s Breakpad project extended it for cross-platform use, making it the standard for:

Chrome, Firefox, Edge browsers
Electron applications
Sentry, Crashlytics, and other crash reporting services

┌─────────────────────────────────────────────────────────────┐
│              FULL CORE DUMP vs MINIDUMP                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Full Core Dump                    Minidump                  │
│  ──────────────────────           ──────────────────────     │
│  Size: 500MB - 10GB               Size: 50KB - 5MB           │
│  Contains ALL memory              Contains crash essentials  │
│  Platform-specific (ELF)          Cross-platform             │
│  Local analysis only              Upload-friendly            │
│  Complete but heavy               Optimized for triage       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Minidump Structure

┌────────────────────────────────────────────────────────────────┐
│                     MINIDUMP FILE STRUCTURE                     │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  HEADER (32 bytes)                                      │   │
│  │  ─────────────────                                      │   │
│  │  Signature: "MDMP" (0x504D444D)                        │   │
│  │  Version: 0xa793                                        │   │
│  │  NumberOfStreams: N                                     │   │
│  │  StreamDirectoryRva: offset to directory                │   │
│  │  Checksum, TimeDateStamp, Flags                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STREAM DIRECTORY (N × 12 bytes)                        │   │
│  │  ─────────────────                                      │   │
│  │  Entry 1: [StreamType, DataSize, Rva]                   │   │
│  │  Entry 2: [StreamType, DataSize, Rva]                   │   │
│  │  Entry 3: [StreamType, DataSize, Rva]                   │   │
│  │  ...                                                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STREAM DATA (variable size)                            │   │
│  │  ─────────────────                                      │   │
│  │                                                         │   │
│  │  ThreadListStream:                                      │   │
│  │    Thread 1: [id, stack, context]                       │   │
│  │    Thread 2: [id, stack, context]                       │   │
│  │                                                         │   │
│  │  ModuleListStream:                                      │   │
│  │    Module 1: [name, base_addr, size, version]          │   │
│  │    Module 2: [name, base_addr, size, version]          │   │
│  │                                                         │   │
│  │  SystemInfoStream:                                      │   │
│  │    [OS, CPU arch, version, etc.]                        │   │
│  │                                                         │   │
│  │  ExceptionStream:                                       │   │
│  │    [Exception code, address, thread]                    │   │
│  │                                                         │   │
│  │  MemoryListStream:                                      │   │
│  │    [Selected memory regions]                            │   │
│  │                                                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Stream Types

Common stream types you’ll encounter:

Stream Type	ID	Contents
ThreadListStream	3	All threads with stack and registers
ModuleListStream	4	Loaded modules/libraries
MemoryListStream	5	Selected memory regions
ExceptionStream	6	Exception/crash information
SystemInfoStream	7	OS and CPU information
ThreadExListStream	8	Extended thread info
Memory64ListStream	9	64-bit memory regions
LinuxProcStatus	0x47670002	Linux /proc/status
LinuxCpuInfo	0x47670003	Linux /proc/cpuinfo
LinuxMaps	0x47670007	Linux /proc/maps

Breakpad/Crashpad Ecosystem

┌───────────────────────────────────────────────────────────────┐
│                   CRASH REPORTING PIPELINE                     │
├───────────────────────────────────────────────────────────────┤
│                                                                │
│  Application                                                   │
│      │                                                         │
│      ▼                                                         │
│  ┌─────────────────┐                                          │
│  │ Crashpad Client │  Catches crash, writes minidump          │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │  Minidump File  │  .dmp file with crash data               │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │  Upload Handler │  Sends to crash reporting server         │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │ Symbol Server   │  Provides debug symbols                  │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │ minidump_       │  Processes dump + symbols = stack trace  │
│  │ stackwalk       │                                          │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │   Bug Report    │  Human-readable crash report             │
│  └─────────────────┘                                          │
│                                                                │
└───────────────────────────────────────────────────────────────┘

2.2 Why This Matters

Understanding minidumps is essential because:

Industry Standard: Chrome, Firefox, Slack, Discord, VS Code all use minidumps
Cloud-Friendly: Small size enables crash upload at scale
Cross-Platform: Same format works on Windows, macOS, Linux
Privacy-Conscious: Contains only crash-relevant data, not full memory

2.3 Historical Context

1999: Microsoft introduces minidump format in Windows 2000 2006: Google creates Breakpad for cross-platform crash reporting 2015: Google develops Crashpad as Breakpad’s successor Today: Minidump is the de facto standard for crash reporting

2.4 Common Misconceptions

Misconception 1: “Minidumps are just small core dumps”

Reality: They’re a different format with structured streams, not just truncated memory

Misconception 2: “You can’t get a full backtrace from a minidump”

Reality: With proper symbols, you get complete stack traces

Misconception 3: “Minidumps are Windows-only”

Reality: Breakpad/Crashpad work on all major platforms

3. Project Specification

3.1 What You Will Build

A Python tool that parses minidump files and extracts:

System information (OS, architecture)
Crash information (signal, address)
Thread list with basic stack information
Loaded module list with base addresses

3.2 Functional Requirements

Parse minidump header
- Validate “MDMP” signature
- Extract stream count and directory offset
Navigate stream directory
- Enumerate all streams
- Identify stream types by ID
Parse SystemInfoStream
- Extract OS type and version
- Extract CPU architecture
Parse ModuleListStream
- List all loaded modules
- Extract module name, base address, size
Parse ExceptionStream
- Extract exception code
- Identify crashing thread and address
Generate summary report
- Output human-readable crash summary

3.3 Non-Functional Requirements

Handle both 32-bit and 64-bit minidumps
Support Linux-specific streams (from Breakpad)
Graceful handling of missing/unknown streams
Clear error messages for corrupted files

3.4 Example Usage / Output

$ python3 minidump_parser.py crash_report.dmp

╔══════════════════════════════════════════════════════════════╗
║                    MINIDUMP ANALYSIS                          ║
╠══════════════════════════════════════════════════════════════╣
║  File:      crash_report.dmp                                  ║
║  Size:      245,760 bytes                                     ║
║  Timestamp: 2025-12-20 14:30:45 UTC                          ║
╚══════════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════════
                       SYSTEM INFO
═══════════════════════════════════════════════════════════════
  OS:           Linux (6.1.0-13-amd64)
  CPU:          AMD64 (x86_64)
  CPU Count:    8
  CPU Vendor:   GenuineIntel

═══════════════════════════════════════════════════════════════
                       CRASH INFO
═══════════════════════════════════════════════════════════════
  Exception:    SIGSEGV (Signal 11)
  Address:      0x0000000000000000
  Thread ID:    12345

═══════════════════════════════════════════════════════════════
                    LOADED MODULES (15)
═══════════════════════════════════════════════════════════════
  Base Address         Size        Name
  ─────────────────────────────────────────────────────────────
  0x000055a8b7c00000   102,400     /usr/bin/my_application
  0x00007f8a12400000   1,920,000   /lib/x86_64-linux-gnu/libc.so.6
  0x00007f8a12800000   151,552     /lib/x86_64-linux-gnu/libpthread.so.0
  0x00007f8a12c00000   167,936     /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
  ...

═══════════════════════════════════════════════════════════════
                      STREAM SUMMARY
═══════════════════════════════════════════════════════════════
  Stream                   Offset      Size
  ─────────────────────────────────────────────────────────────
  ThreadListStream         0x00001000  2,048
  ModuleListStream         0x00001800  4,096
  ExceptionStream          0x00002800  168
  SystemInfoStream         0x00002900  56
  MemoryListStream         0x00002940  32,768
  LinuxMaps                0x0000a940  8,192
  ...

3.5 Real World Outcome

After completing this project, you’ll be able to:

Understand crash reports from Sentry, Crashlytics, or Breakpad
Build custom crash analysis tools for your organization
Contribute to open-source crash reporting projects
Debug issues with crash uploads and processing pipelines

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                      MINIDUMP PARSER                             │
└─────────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┴────────────────────┐
         ▼                                         ▼
┌─────────────────────┐                 ┌─────────────────────┐
│  File Reader        │                 │  Report Generator   │
│  ────────────────   │                 │  ────────────────   │
│  Binary file I/O    │                 │  Format output      │
│  Seek & read        │                 │  Text/JSON/HTML     │
└──────────┬──────────┘                 └──────────▲──────────┘
           │                                       │
           ▼                                       │
┌─────────────────────────────────────────────────────────────────┐
│                        PARSER CORE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐            │
│  │   Header    │──▶│  Directory  │──▶│   Streams   │            │
│  │   Parser    │   │   Parser    │   │   Parser    │            │
│  └─────────────┘   └─────────────┘   └──────┬──────┘            │
│                                              │                   │
│                    ┌───────────────────────────────────────────┐│
│                    │                                           ││
│                    ▼         ▼         ▼         ▼            ││
│              ┌──────────┐ ┌────────┐ ┌────────┐ ┌────────┐    ││
│              │SystemInfo│ │ Module │ │Exception│ │ Thread │    ││
│              │ Parser   │ │ Parser │ │ Parser  │ │ Parser │    ││
│              └──────────┘ └────────┘ └────────┘ └────────┘    ││
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4.2 Key Components

MinidumpHeader: Parses the 32-byte file header
StreamDirectory: Parses and indexes all streams
SystemInfoStream: Extracts OS and CPU information
ModuleListStream: Lists loaded modules
ExceptionStream: Crash details
ThreadListStream: Thread information
ReportGenerator: Formats output

4.3 Data Structures

# Header structure (32 bytes)
MINIDUMP_HEADER = struct.Struct('<4sIIIIII')
# Signature, Version, NumberOfStreams, StreamDirectoryRva,
# Checksum, TimeDateStamp, Flags

# Directory entry (12 bytes)
MINIDUMP_DIRECTORY = struct.Struct('<III')
# StreamType, DataSize, Rva

# Location descriptor (8 bytes)
MINIDUMP_LOCATION_DESCRIPTOR = struct.Struct('<II')
# DataSize, Rva

# Module structure (108 bytes on 64-bit)
MINIDUMP_MODULE = struct.Struct('<QIIQQ...')
# BaseOfImage, SizeOfImage, CheckSum, TimeDateStamp,...

4.4 Algorithm Overview

def parse_minidump(file_path):
    with open(file_path, 'rb') as f:
        # 1. Read and validate header
        header = parse_header(f)
        validate_signature(header)

        # 2. Seek to stream directory
        f.seek(header.stream_directory_rva)

        # 3. Parse all directory entries
        streams = {}
        for _ in range(header.number_of_streams):
            entry = parse_directory_entry(f)
            streams[entry.stream_type] = entry

        # 4. Parse each stream of interest
        result = MinidumpInfo()

        if SYSTEM_INFO in streams:
            result.system_info = parse_system_info(f, streams[SYSTEM_INFO])

        if MODULE_LIST in streams:
            result.modules = parse_module_list(f, streams[MODULE_LIST])

        if EXCEPTION in streams:
            result.exception = parse_exception(f, streams[EXCEPTION])

        return result

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
mkdir minidump_parser
cd minidump_parser

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# No external dependencies needed - using only stdlib
# (struct, mmap, io modules)

5.2 Project Structure

minidump_parser/
├── minidump/
│   ├── __init__.py
│   ├── header.py           # Header parsing
│   ├── directory.py        # Stream directory
│   ├── streams/
│   │   ├── __init__.py
│   │   ├── system_info.py  # SystemInfoStream
│   │   ├── modules.py      # ModuleListStream
│   │   ├── exception.py    # ExceptionStream
│   │   ├── threads.py      # ThreadListStream
│   │   └── memory.py       # MemoryListStream
│   └── parser.py           # Main parser
├── tests/
│   ├── test_header.py
│   ├── test_streams.py
│   └── fixtures/           # Sample minidumps
├── minidump_parser.py      # CLI entry point
└── README.md

5.3 The Core Question You’re Answering

“How do you extract crash information from a compact binary format designed for efficient crash reporting at scale?”

This requires:

Understanding binary file structure and navigation
Parsing variable-length records with offsets
Handling platform-specific variations
Converting binary data to human-readable output

5.4 Concepts You Must Understand First

Binary file I/O in Python
- struct module for unpacking binary data
- File seeking and reading
Little-endian byte ordering
- Intel/AMD use little-endian
- Format string < in struct
RVA (Relative Virtual Address)
- Offsets from start of file
- How stream data is located
Unicode string encoding
- Module names are UTF-16LE in minidumps
Platform-specific data
- Linux vs Windows stream types
- Architecture-specific register contexts

5.5 Questions to Guide Your Design

Architecture Questions:

How will you handle unknown stream types gracefully?
Should parsing be lazy (on-demand) or eager (all at once)?
How will you handle 32-bit vs 64-bit differences?

Implementation Questions:

Will you memory-map the file or use regular reads?
How will you represent parsed data (dataclasses? dicts?)?
How will you handle corrupt or truncated files?

Output Questions:

What output formats will you support (text, JSON, etc.)?
How verbose should the output be?
Should you support filtering (e.g., only show modules)?

5.6 Thinking Exercise

Before coding, manually parse this hex dump of a minidump header:

4D 44 4D 50  93 A7 00 00  08 00 00 00  20 00 00 00
00 00 00 00  A1 B2 C3 D4  02 00 00 00

Questions:

What is the signature? (First 4 bytes)
What is the version? (Bytes 4-7)
How many streams are there? (Bytes 8-11)
Where is the stream directory? (Bytes 12-15)
What is the timestamp? (Bytes 20-23)

5.7 Hints in Layers

Hint 1 - Start Simple: Begin by just parsing and printing the header. Verify the signature is “MDMP”.

Hint 2 - Stream Types:

STREAM_TYPES = {
    0: "UnusedStream",
    3: "ThreadListStream",
    4: "ModuleListStream",
    5: "MemoryListStream",
    6: "ExceptionStream",
    7: "SystemInfoStream",
    # Linux-specific (Breakpad)
    0x47670002: "LinuxProcStatus",
    0x47670003: "LinuxCpuInfo",
    0x47670007: "LinuxMaps",
}

Hint 3 - String Handling: Module names are stored as MINIDUMP_STRING structures:

def read_minidump_string(f, rva):
    f.seek(rva)
    length = struct.unpack('<I', f.read(4))[0]  # Length in bytes
    data = f.read(length)
    return data.decode('utf-16-le')

Hint 4 - Getting Test Files: Generate a minidump from a crashing program using Breakpad or Google’s dump_syms:

# Or use a sample from Sentry's test fixtures
wget https://github.com/getsentry/symbolic/raw/master/symbolic-testutils/fixtures/linux.dmp

5.8 The Interview Questions They’ll Ask

“How would you handle a minidump that’s 10GB?”
- Expected: Use memory mapping or streaming, don’t load entire file
“What’s the difference between RVA and VA?”
- Expected: RVA is file offset, VA is runtime memory address
“How do you symbolicate a stack trace from a minidump?”
- Expected: Need symbol files, use stack pointer from threads, walk frames
“How would you add support for a new stream type?”
- Expected: Describe modular design, adding new parser while handling unknown gracefully
“How does Breakpad capture a minidump without crashing the handler?”
- Expected: Uses out-of-process handler, signal-safe operations

5.9 Books That Will Help

Topic	Book	Chapter(s)
Binary Parsing	“Practical Binary Analysis” - Andriesse	Ch. 1-2
Python struct	Python documentation	struct module
File Formats	“File Format Design” - various	General principles
Crash Reporting	Breakpad/Crashpad docs	Getting Started guides

5.10 Implementation Phases

Phase 1: Header Parsing (Days 1-2)

Parse MINIDUMP_HEADER
Validate signature and version
Print basic file info

Phase 2: Stream Directory (Days 3-4)

Parse stream directory
Enumerate all streams
Map stream types to names

Phase 3: SystemInfoStream (Days 5-6)

Parse system information
Extract OS and CPU details
Handle different CPU architectures

Phase 4: ModuleListStream (Days 7-9)

Parse module list
Extract module names (handle UTF-16)
Display base addresses and sizes

Phase 5: ExceptionStream (Days 10-12)

Parse exception information
Map exception codes to signal names
Extract crash address

Phase 6: Polish & Testing (Days 13-15)

Add command-line interface
Test with various minidumps
Handle edge cases and errors

5.11 Key Implementation Decisions

Use dataclasses for parsed structures (Python 3.7+)
Lazy parsing - only parse streams when accessed
Memory mapping for large files (mmap module)
Error handling - wrap all parsing in try/except, never crash on bad input

6. Testing Strategy

Unit Tests

def test_header_parsing():
    # Create a minimal valid header
    header_bytes = b'MDMP' + struct.pack('<IIIIII',
        0xa793,  # version
        1,       # number of streams
        32,      # stream directory offset
        0,       # checksum
        0x12345678,  # timestamp
        0        # flags
    )
    header = parse_header(io.BytesIO(header_bytes))
    assert header.signature == b'MDMP'
    assert header.number_of_streams == 1

Integration Tests

Parse real minidumps from Sentry’s test fixtures
Compare output with minidump_dump tool (from Breakpad)
Verify module list matches minidump_stackwalk output

Verification Checklist

Correctly identifies minidump signature
Parses all standard stream types
Handles Linux-specific streams
Correctly decodes UTF-16 module names
Handles 32-bit and 64-bit minidumps

7. Common Pitfalls & Debugging

Pitfall 1: Byte Order Confusion

Problem: Numbers parse as garbage

Cause: Minidumps are little-endian; forgetting < prefix

Solution:

# Wrong
struct.unpack('I', data)

# Right
struct.unpack('<I', data)

Pitfall 2: String Encoding

Problem: Module names are garbled

Cause: Minidump strings are UTF-16-LE, not UTF-8

Solution:

# Wrong
name = data.decode('utf-8')

# Right
name = data.decode('utf-16-le').rstrip('\x00')

Pitfall 3: Off-by-One in Stream Sizes

Problem: Parsing reads past stream boundary

Cause: Size includes header or doesn’t include padding

Solution: Always verify position after parsing:

expected_end = stream.rva + stream.size
# ... parse...
actual_end = f.tell()
assert actual_end <= expected_end, f"Read past stream end: {actual_end} > {expected_end}"

Pitfall 4: Architecture Mismatch

Problem: 64-bit minidump parses incorrectly

Cause: Pointer sizes differ between 32-bit and 64-bit

Solution: Check SystemInfoStream.ProcessorArchitecture first, then use appropriate struct sizes

8. Extensions & Challenges

Extension 1: Stack Walking

Implement basic stack unwinding:

Use thread context registers (RSP/RBP)
Walk stack frames using frame pointers
Without symbols, show raw addresses

Extension 2: Symbol Integration

Add symbolication support:

Parse Breakpad .sym files
Map addresses to function names
Generate readable stack traces

Extension 3: Memory Analysis

Parse MemoryListStream:

Extract memory regions
Search for patterns in memory
Display heap/stack contents

Extension 4: Crash Clustering

Compare multiple minidumps:

Generate crash signatures
Group similar crashes
Identify most common crash patterns

9. Real-World Connections

Chrome Crash Reporting

Chrome uses Crashpad to:

Catch crashes in a separate process
Generate minidumps with relevant streams
Upload to Google’s crash servers
Symbolicate and deduplicate

Sentry Integration

Sentry processes minidumps:

Receive uploaded minidump
Parse using symbolic-minidump
Fetch symbols from symbol servers
Generate readable stack traces
Group by crash signature

Mozilla Socorro

Firefox’s crash reporting:

Breakpad client captures crash
Uploads to Socorro server
Processes with minidump-analyzer
Displays in crash-stats dashboard

10. Resources

Official Documentation

Source Code References

Sample Minidumps

11. Self-Assessment Checklist

Before You Start

Comfortable with Python’s struct module
Understand binary file I/O
Know what little-endian means
Have access to sample minidump files

After Completion

Can explain minidump format structure
Can parse headers, directories, and streams
Can extract system info, modules, and crash details
Can handle different platforms and architectures
Understand the Breakpad/Crashpad ecosystem
Can extend parser to handle new stream types

12. Submission / Completion Criteria

Your project is complete when you can:

Parse Sample Minidumps
- Successfully parse Linux minidumps from Breakpad
- Successfully parse at least one Windows minidump
- Handle corrupt/truncated files gracefully
Generate Useful Output
- Show system information
- List all loaded modules
- Display crash/exception information
- Support at least text and JSON output
Demonstrate Understanding
- Explain the minidump structure
- Describe how stream navigation works
- Answer interview questions from section 5.8
Handle Edge Cases
- Unknown stream types (log, don’t crash)
- Missing streams (handle gracefully)
- Large files (don’t run out of memory)

Next: Project 8: Introduction to Kernel Panics - Enter the kernel crash domain