Project 7: The Minidump Parser

Build a parser for Google Breakpad’s minidump format—the industry-standard for cross-platform crash reporting.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 2-3 weeks
Language Python
Prerequisites Project 1, strong binary parsing skills
Key Topics Minidump format, Breakpad/Crashpad, binary stream parsing

1. Learning Objectives

By completing this project, you will:

  • Understand the minidump file format used by major browsers and applications
  • Learn to parse complex binary structures with nested stream directories
  • Implement a practical tool that extracts crash information from minidumps
  • Understand why minidumps are preferred over full core dumps in production
  • Gain experience with Python’s struct module for binary parsing
  • Learn the Breakpad/Crashpad ecosystem used by Google, Mozilla, and Microsoft

2. Theoretical Foundation

2.1 Core Concepts

What is a Minidump?

A minidump is a smaller, more portable crash dump format originally developed by Microsoft for Windows. Google’s Breakpad project extended it for cross-platform use, making it the standard for:

  • Chrome, Firefox, Edge browsers
  • Electron applications
  • Sentry, Crashlytics, and other crash reporting services
┌─────────────────────────────────────────────────────────────┐
│              FULL CORE DUMP vs MINIDUMP                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Full Core Dump                    Minidump                  │
│  ──────────────────────           ──────────────────────     │
│  Size: 500MB - 10GB               Size: 50KB - 5MB           │
│  Contains ALL memory              Contains crash essentials  │
│  Platform-specific (ELF)          Cross-platform             │
│  Local analysis only              Upload-friendly            │
│  Complete but heavy               Optimized for triage       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Minidump Structure

┌────────────────────────────────────────────────────────────────┐
│                     MINIDUMP FILE STRUCTURE                     │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  HEADER (32 bytes)                                      │   │
│  │  ─────────────────                                      │   │
│  │  Signature: "MDMP" (0x504D444D)                        │   │
│  │  Version: 0xa793                                        │   │
│  │  NumberOfStreams: N                                     │   │
│  │  StreamDirectoryRva: offset to directory                │   │
│  │  Checksum, TimeDateStamp, Flags                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STREAM DIRECTORY (N × 12 bytes)                        │   │
│  │  ─────────────────                                      │   │
│  │  Entry 1: [StreamType, DataSize, Rva]                   │   │
│  │  Entry 2: [StreamType, DataSize, Rva]                   │   │
│  │  Entry 3: [StreamType, DataSize, Rva]                   │   │
│  │  ...                                                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  STREAM DATA (variable size)                            │   │
│  │  ─────────────────                                      │   │
│  │                                                         │   │
│  │  ThreadListStream:                                      │   │
│  │    Thread 1: [id, stack, context]                       │   │
│  │    Thread 2: [id, stack, context]                       │   │
│  │                                                         │   │
│  │  ModuleListStream:                                      │   │
│  │    Module 1: [name, base_addr, size, version]          │   │
│  │    Module 2: [name, base_addr, size, version]          │   │
│  │                                                         │   │
│  │  SystemInfoStream:                                      │   │
│  │    [OS, CPU arch, version, etc.]                        │   │
│  │                                                         │   │
│  │  ExceptionStream:                                       │   │
│  │    [Exception code, address, thread]                    │   │
│  │                                                         │   │
│  │  MemoryListStream:                                      │   │
│  │    [Selected memory regions]                            │   │
│  │                                                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└────────────────────────────────────────────────────────────────┘

Stream Types

Common stream types you’ll encounter:

Stream Type ID Contents
ThreadListStream 3 All threads with stack and registers
ModuleListStream 4 Loaded modules/libraries
MemoryListStream 5 Selected memory regions
ExceptionStream 6 Exception/crash information
SystemInfoStream 7 OS and CPU information
ThreadExListStream 8 Extended thread info
Memory64ListStream 9 64-bit memory regions
LinuxProcStatus 0x47670002 Linux /proc/status
LinuxCpuInfo 0x47670003 Linux /proc/cpuinfo
LinuxMaps 0x47670007 Linux /proc/maps

Breakpad/Crashpad Ecosystem

┌───────────────────────────────────────────────────────────────┐
│                   CRASH REPORTING PIPELINE                     │
├───────────────────────────────────────────────────────────────┤
│                                                                │
│  Application                                                   │
│      │                                                         │
│      ▼                                                         │
│  ┌─────────────────┐                                          │
│  │ Crashpad Client │  Catches crash, writes minidump          │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │  Minidump File  │  .dmp file with crash data               │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │  Upload Handler │  Sends to crash reporting server         │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │ Symbol Server   │  Provides debug symbols                  │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │ minidump_       │  Processes dump + symbols = stack trace  │
│  │ stackwalk       │                                          │
│  └────────┬────────┘                                          │
│           │                                                    │
│           ▼                                                    │
│  ┌─────────────────┐                                          │
│  │   Bug Report    │  Human-readable crash report             │
│  └─────────────────┘                                          │
│                                                                │
└───────────────────────────────────────────────────────────────┘

2.2 Why This Matters

Understanding minidumps is essential because:

  1. Industry Standard: Chrome, Firefox, Slack, Discord, VS Code all use minidumps
  2. Cloud-Friendly: Small size enables crash upload at scale
  3. Cross-Platform: Same format works on Windows, macOS, Linux
  4. Privacy-Conscious: Contains only crash-relevant data, not full memory

2.3 Historical Context

1999: Microsoft introduces minidump format in Windows 2000 2006: Google creates Breakpad for cross-platform crash reporting 2015: Google develops Crashpad as Breakpad’s successor Today: Minidump is the de facto standard for crash reporting

2.4 Common Misconceptions

Misconception 1: “Minidumps are just small core dumps”

  • Reality: They’re a different format with structured streams, not just truncated memory

Misconception 2: “You can’t get a full backtrace from a minidump”

  • Reality: With proper symbols, you get complete stack traces

Misconception 3: “Minidumps are Windows-only”

  • Reality: Breakpad/Crashpad work on all major platforms

3. Project Specification

3.1 What You Will Build

A Python tool that parses minidump files and extracts:

  1. System information (OS, architecture)
  2. Crash information (signal, address)
  3. Thread list with basic stack information
  4. Loaded module list with base addresses

3.2 Functional Requirements

  1. Parse minidump header
    • Validate “MDMP” signature
    • Extract stream count and directory offset
  2. Navigate stream directory
    • Enumerate all streams
    • Identify stream types by ID
  3. Parse SystemInfoStream
    • Extract OS type and version
    • Extract CPU architecture
  4. Parse ModuleListStream
    • List all loaded modules
    • Extract module name, base address, size
  5. Parse ExceptionStream
    • Extract exception code
    • Identify crashing thread and address
  6. Generate summary report
    • Output human-readable crash summary

3.3 Non-Functional Requirements

  • Handle both 32-bit and 64-bit minidumps
  • Support Linux-specific streams (from Breakpad)
  • Graceful handling of missing/unknown streams
  • Clear error messages for corrupted files

3.4 Example Usage / Output

$ python3 minidump_parser.py crash_report.dmp

╔══════════════════════════════════════════════════════════════╗
║                    MINIDUMP ANALYSIS                          ║
╠══════════════════════════════════════════════════════════════╣
║  File:      crash_report.dmp                                  ║
║  Size:      245,760 bytes                                     ║
║  Timestamp: 2025-12-20 14:30:45 UTC                          ║
╚══════════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════════
                       SYSTEM INFO
═══════════════════════════════════════════════════════════════
  OS:           Linux (6.1.0-13-amd64)
  CPU:          AMD64 (x86_64)
  CPU Count:    8
  CPU Vendor:   GenuineIntel

═══════════════════════════════════════════════════════════════
                       CRASH INFO
═══════════════════════════════════════════════════════════════
  Exception:    SIGSEGV (Signal 11)
  Address:      0x0000000000000000
  Thread ID:    12345

═══════════════════════════════════════════════════════════════
                    LOADED MODULES (15)
═══════════════════════════════════════════════════════════════
  Base Address         Size        Name
  ─────────────────────────────────────────────────────────────
  0x000055a8b7c00000   102,400     /usr/bin/my_application
  0x00007f8a12400000   1,920,000   /lib/x86_64-linux-gnu/libc.so.6
  0x00007f8a12800000   151,552     /lib/x86_64-linux-gnu/libpthread.so.0
  0x00007f8a12c00000   167,936     /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
  ...

═══════════════════════════════════════════════════════════════
                      STREAM SUMMARY
═══════════════════════════════════════════════════════════════
  Stream                   Offset      Size
  ─────────────────────────────────────────────────────────────
  ThreadListStream         0x00001000  2,048
  ModuleListStream         0x00001800  4,096
  ExceptionStream          0x00002800  168
  SystemInfoStream         0x00002900  56
  MemoryListStream         0x00002940  32,768
  LinuxMaps                0x0000a940  8,192
  ...

3.5 Real World Outcome

After completing this project, you’ll be able to:

  • Understand crash reports from Sentry, Crashlytics, or Breakpad
  • Build custom crash analysis tools for your organization
  • Contribute to open-source crash reporting projects
  • Debug issues with crash uploads and processing pipelines

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                      MINIDUMP PARSER                             │
└─────────────────────────────────────────────────────────────────┘
                              │
         ┌────────────────────┴────────────────────┐
         ▼                                         ▼
┌─────────────────────┐                 ┌─────────────────────┐
│  File Reader        │                 │  Report Generator   │
│  ────────────────   │                 │  ────────────────   │
│  Binary file I/O    │                 │  Format output      │
│  Seek & read        │                 │  Text/JSON/HTML     │
└──────────┬──────────┘                 └──────────▲──────────┘
           │                                       │
           ▼                                       │
┌─────────────────────────────────────────────────────────────────┐
│                        PARSER CORE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐            │
│  │   Header    │──▶│  Directory  │──▶│   Streams   │            │
│  │   Parser    │   │   Parser    │   │   Parser    │            │
│  └─────────────┘   └─────────────┘   └──────┬──────┘            │
│                                              │                   │
│                    ┌───────────────────────────────────────────┐│
│                    │                                           ││
│                    ▼         ▼         ▼         ▼            ││
│              ┌──────────┐ ┌────────┐ ┌────────┐ ┌────────┐    ││
│              │SystemInfo│ │ Module │ │Exception│ │ Thread │    ││
│              │ Parser   │ │ Parser │ │ Parser  │ │ Parser │    ││
│              └──────────┘ └────────┘ └────────┘ └────────┘    ││
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

4.2 Key Components

  1. MinidumpHeader: Parses the 32-byte file header
  2. StreamDirectory: Parses and indexes all streams
  3. SystemInfoStream: Extracts OS and CPU information
  4. ModuleListStream: Lists loaded modules
  5. ExceptionStream: Crash details
  6. ThreadListStream: Thread information
  7. ReportGenerator: Formats output

4.3 Data Structures

# Header structure (32 bytes)
MINIDUMP_HEADER = struct.Struct('<4sIIIIII')
# Signature, Version, NumberOfStreams, StreamDirectoryRva,
# Checksum, TimeDateStamp, Flags

# Directory entry (12 bytes)
MINIDUMP_DIRECTORY = struct.Struct('<III')
# StreamType, DataSize, Rva

# Location descriptor (8 bytes)
MINIDUMP_LOCATION_DESCRIPTOR = struct.Struct('<II')
# DataSize, Rva

# Module structure (108 bytes on 64-bit)
MINIDUMP_MODULE = struct.Struct('<QIIQQ...')
# BaseOfImage, SizeOfImage, CheckSum, TimeDateStamp, ...

4.4 Algorithm Overview

def parse_minidump(file_path):
    with open(file_path, 'rb') as f:
        # 1. Read and validate header
        header = parse_header(f)
        validate_signature(header)

        # 2. Seek to stream directory
        f.seek(header.stream_directory_rva)

        # 3. Parse all directory entries
        streams = {}
        for _ in range(header.number_of_streams):
            entry = parse_directory_entry(f)
            streams[entry.stream_type] = entry

        # 4. Parse each stream of interest
        result = MinidumpInfo()

        if SYSTEM_INFO in streams:
            result.system_info = parse_system_info(f, streams[SYSTEM_INFO])

        if MODULE_LIST in streams:
            result.modules = parse_module_list(f, streams[MODULE_LIST])

        if EXCEPTION in streams:
            result.exception = parse_exception(f, streams[EXCEPTION])

        return result

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
mkdir minidump_parser
cd minidump_parser

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# No external dependencies needed - using only stdlib
# (struct, mmap, io modules)

5.2 Project Structure

minidump_parser/
├── minidump/
│   ├── __init__.py
│   ├── header.py           # Header parsing
│   ├── directory.py        # Stream directory
│   ├── streams/
│   │   ├── __init__.py
│   │   ├── system_info.py  # SystemInfoStream
│   │   ├── modules.py      # ModuleListStream
│   │   ├── exception.py    # ExceptionStream
│   │   ├── threads.py      # ThreadListStream
│   │   └── memory.py       # MemoryListStream
│   └── parser.py           # Main parser
├── tests/
│   ├── test_header.py
│   ├── test_streams.py
│   └── fixtures/           # Sample minidumps
├── minidump_parser.py      # CLI entry point
└── README.md

5.3 The Core Question You’re Answering

“How do you extract crash information from a compact binary format designed for efficient crash reporting at scale?”

This requires:

  1. Understanding binary file structure and navigation
  2. Parsing variable-length records with offsets
  3. Handling platform-specific variations
  4. Converting binary data to human-readable output

5.4 Concepts You Must Understand First

  1. Binary file I/O in Python
    • struct module for unpacking binary data
    • File seeking and reading
  2. Little-endian byte ordering
    • Intel/AMD use little-endian
    • Format string < in struct
  3. RVA (Relative Virtual Address)
    • Offsets from start of file
    • How stream data is located
  4. Unicode string encoding
    • Module names are UTF-16LE in minidumps
  5. Platform-specific data
    • Linux vs Windows stream types
    • Architecture-specific register contexts

5.5 Questions to Guide Your Design

Architecture Questions:

  • How will you handle unknown stream types gracefully?
  • Should parsing be lazy (on-demand) or eager (all at once)?
  • How will you handle 32-bit vs 64-bit differences?

Implementation Questions:

  • Will you memory-map the file or use regular reads?
  • How will you represent parsed data (dataclasses? dicts?)?
  • How will you handle corrupt or truncated files?

Output Questions:

  • What output formats will you support (text, JSON, etc.)?
  • How verbose should the output be?
  • Should you support filtering (e.g., only show modules)?

5.6 Thinking Exercise

Before coding, manually parse this hex dump of a minidump header:

4D 44 4D 50  93 A7 00 00  08 00 00 00  20 00 00 00
00 00 00 00  A1 B2 C3 D4  02 00 00 00

Questions:

  1. What is the signature? (First 4 bytes)
  2. What is the version? (Bytes 4-7)
  3. How many streams are there? (Bytes 8-11)
  4. Where is the stream directory? (Bytes 12-15)
  5. What is the timestamp? (Bytes 20-23)

5.7 Hints in Layers

Hint 1 - Start Simple: Begin by just parsing and printing the header. Verify the signature is “MDMP”.

Hint 2 - Stream Types:

STREAM_TYPES = {
    0: "UnusedStream",
    3: "ThreadListStream",
    4: "ModuleListStream",
    5: "MemoryListStream",
    6: "ExceptionStream",
    7: "SystemInfoStream",
    # Linux-specific (Breakpad)
    0x47670002: "LinuxProcStatus",
    0x47670003: "LinuxCpuInfo",
    0x47670007: "LinuxMaps",
}

Hint 3 - String Handling: Module names are stored as MINIDUMP_STRING structures:

def read_minidump_string(f, rva):
    f.seek(rva)
    length = struct.unpack('<I', f.read(4))[0]  # Length in bytes
    data = f.read(length)
    return data.decode('utf-16-le')

Hint 4 - Getting Test Files: Generate a minidump from a crashing program using Breakpad or Google’s dump_syms:

# Or use a sample from Sentry's test fixtures
wget https://github.com/getsentry/symbolic/raw/master/symbolic-testutils/fixtures/linux.dmp

5.8 The Interview Questions They’ll Ask

  1. “How would you handle a minidump that’s 10GB?”
    • Expected: Use memory mapping or streaming, don’t load entire file
  2. “What’s the difference between RVA and VA?”
    • Expected: RVA is file offset, VA is runtime memory address
  3. “How do you symbolicate a stack trace from a minidump?”
    • Expected: Need symbol files, use stack pointer from threads, walk frames
  4. “How would you add support for a new stream type?”
    • Expected: Describe modular design, adding new parser while handling unknown gracefully
  5. “How does Breakpad capture a minidump without crashing the handler?”
    • Expected: Uses out-of-process handler, signal-safe operations

5.9 Books That Will Help

Topic Book Chapter(s)
Binary Parsing “Practical Binary Analysis” - Andriesse Ch. 1-2
Python struct Python documentation struct module
File Formats “File Format Design” - various General principles
Crash Reporting Breakpad/Crashpad docs Getting Started guides

5.10 Implementation Phases

Phase 1: Header Parsing (Days 1-2)

  • Parse MINIDUMP_HEADER
  • Validate signature and version
  • Print basic file info

Phase 2: Stream Directory (Days 3-4)

  • Parse stream directory
  • Enumerate all streams
  • Map stream types to names

Phase 3: SystemInfoStream (Days 5-6)

  • Parse system information
  • Extract OS and CPU details
  • Handle different CPU architectures

Phase 4: ModuleListStream (Days 7-9)

  • Parse module list
  • Extract module names (handle UTF-16)
  • Display base addresses and sizes

Phase 5: ExceptionStream (Days 10-12)

  • Parse exception information
  • Map exception codes to signal names
  • Extract crash address

Phase 6: Polish & Testing (Days 13-15)

  • Add command-line interface
  • Test with various minidumps
  • Handle edge cases and errors

5.11 Key Implementation Decisions

  1. Use dataclasses for parsed structures (Python 3.7+)
  2. Lazy parsing - only parse streams when accessed
  3. Memory mapping for large files (mmap module)
  4. Error handling - wrap all parsing in try/except, never crash on bad input

6. Testing Strategy

Unit Tests

def test_header_parsing():
    # Create a minimal valid header
    header_bytes = b'MDMP' + struct.pack('<IIIIII',
        0xa793,  # version
        1,       # number of streams
        32,      # stream directory offset
        0,       # checksum
        0x12345678,  # timestamp
        0        # flags
    )
    header = parse_header(io.BytesIO(header_bytes))
    assert header.signature == b'MDMP'
    assert header.number_of_streams == 1

Integration Tests

  • Parse real minidumps from Sentry’s test fixtures
  • Compare output with minidump_dump tool (from Breakpad)
  • Verify module list matches minidump_stackwalk output

Verification Checklist

  • Correctly identifies minidump signature
  • Parses all standard stream types
  • Handles Linux-specific streams
  • Correctly decodes UTF-16 module names
  • Handles 32-bit and 64-bit minidumps

7. Common Pitfalls & Debugging

Pitfall 1: Byte Order Confusion

Problem: Numbers parse as garbage

Cause: Minidumps are little-endian; forgetting < prefix

Solution:

# Wrong
struct.unpack('I', data)

# Right
struct.unpack('<I', data)

Pitfall 2: String Encoding

Problem: Module names are garbled

Cause: Minidump strings are UTF-16-LE, not UTF-8

Solution:

# Wrong
name = data.decode('utf-8')

# Right
name = data.decode('utf-16-le').rstrip('\x00')

Pitfall 3: Off-by-One in Stream Sizes

Problem: Parsing reads past stream boundary

Cause: Size includes header or doesn’t include padding

Solution: Always verify position after parsing:

expected_end = stream.rva + stream.size
# ... parse ...
actual_end = f.tell()
assert actual_end <= expected_end, f"Read past stream end: {actual_end} > {expected_end}"

Pitfall 4: Architecture Mismatch

Problem: 64-bit minidump parses incorrectly

Cause: Pointer sizes differ between 32-bit and 64-bit

Solution: Check SystemInfoStream.ProcessorArchitecture first, then use appropriate struct sizes


8. Extensions & Challenges

Extension 1: Stack Walking

Implement basic stack unwinding:

  • Use thread context registers (RSP/RBP)
  • Walk stack frames using frame pointers
  • Without symbols, show raw addresses

Extension 2: Symbol Integration

Add symbolication support:

  • Parse Breakpad .sym files
  • Map addresses to function names
  • Generate readable stack traces

Extension 3: Memory Analysis

Parse MemoryListStream:

  • Extract memory regions
  • Search for patterns in memory
  • Display heap/stack contents

Extension 4: Crash Clustering

Compare multiple minidumps:

  • Generate crash signatures
  • Group similar crashes
  • Identify most common crash patterns

9. Real-World Connections

Chrome Crash Reporting

Chrome uses Crashpad to:

  1. Catch crashes in a separate process
  2. Generate minidumps with relevant streams
  3. Upload to Google’s crash servers
  4. Symbolicate and deduplicate

Sentry Integration

Sentry processes minidumps:

  1. Receive uploaded minidump
  2. Parse using symbolic-minidump
  3. Fetch symbols from symbol servers
  4. Generate readable stack traces
  5. Group by crash signature

Mozilla Socorro

Firefox’s crash reporting:

  1. Breakpad client captures crash
  2. Uploads to Socorro server
  3. Processes with minidump-analyzer
  4. Displays in crash-stats dashboard

10. Resources

Official Documentation

Source Code References

Sample Minidumps


11. Self-Assessment Checklist

Before You Start

  • Comfortable with Python’s struct module
  • Understand binary file I/O
  • Know what little-endian means
  • Have access to sample minidump files

After Completion

  • Can explain minidump format structure
  • Can parse headers, directories, and streams
  • Can extract system info, modules, and crash details
  • Can handle different platforms and architectures
  • Understand the Breakpad/Crashpad ecosystem
  • Can extend parser to handle new stream types

12. Submission / Completion Criteria

Your project is complete when you can:

  1. Parse Sample Minidumps
    • Successfully parse Linux minidumps from Breakpad
    • Successfully parse at least one Windows minidump
    • Handle corrupt/truncated files gracefully
  2. Generate Useful Output
    • Show system information
    • List all loaded modules
    • Display crash/exception information
    • Support at least text and JSON output
  3. Demonstrate Understanding
    • Explain the minidump structure
    • Describe how stream navigation works
    • Answer interview questions from section 5.8
  4. Handle Edge Cases
    • Unknown stream types (log, don’t crash)
    • Missing streams (handle gracefully)
    • Large files (don’t run out of memory)

Next: Project 8: Introduction to Kernel Panics - Enter the kernel crash domain