Project 7: The Minidump Parser
Build a parser for Google Breakpad’s minidump format—the industry-standard for cross-platform crash reporting.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Language | Python |
| Prerequisites | Project 1, strong binary parsing skills |
| Key Topics | Minidump format, Breakpad/Crashpad, binary stream parsing |
1. Learning Objectives
By completing this project, you will:
- Understand the minidump file format used by major browsers and applications
- Learn to parse complex binary structures with nested stream directories
- Implement a practical tool that extracts crash information from minidumps
- Understand why minidumps are preferred over full core dumps in production
- Gain experience with Python’s
structmodule for binary parsing - Learn the Breakpad/Crashpad ecosystem used by Google, Mozilla, and Microsoft
2. Theoretical Foundation
2.1 Core Concepts
What is a Minidump?
A minidump is a smaller, more portable crash dump format originally developed by Microsoft for Windows. Google’s Breakpad project extended it for cross-platform use, making it the standard for:
- Chrome, Firefox, Edge browsers
- Electron applications
- Sentry, Crashlytics, and other crash reporting services
┌─────────────────────────────────────────────────────────────┐
│ FULL CORE DUMP vs MINIDUMP │
├─────────────────────────────────────────────────────────────┤
│ │
│ Full Core Dump Minidump │
│ ────────────────────── ────────────────────── │
│ Size: 500MB - 10GB Size: 50KB - 5MB │
│ Contains ALL memory Contains crash essentials │
│ Platform-specific (ELF) Cross-platform │
│ Local analysis only Upload-friendly │
│ Complete but heavy Optimized for triage │
│ │
└─────────────────────────────────────────────────────────────┘
Minidump Structure
┌────────────────────────────────────────────────────────────────┐
│ MINIDUMP FILE STRUCTURE │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ HEADER (32 bytes) │ │
│ │ ───────────────── │ │
│ │ Signature: "MDMP" (0x504D444D) │ │
│ │ Version: 0xa793 │ │
│ │ NumberOfStreams: N │ │
│ │ StreamDirectoryRva: offset to directory │ │
│ │ Checksum, TimeDateStamp, Flags │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STREAM DIRECTORY (N × 12 bytes) │ │
│ │ ───────────────── │ │
│ │ Entry 1: [StreamType, DataSize, Rva] │ │
│ │ Entry 2: [StreamType, DataSize, Rva] │ │
│ │ Entry 3: [StreamType, DataSize, Rva] │ │
│ │ ... │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ STREAM DATA (variable size) │ │
│ │ ───────────────── │ │
│ │ │ │
│ │ ThreadListStream: │ │
│ │ Thread 1: [id, stack, context] │ │
│ │ Thread 2: [id, stack, context] │ │
│ │ │ │
│ │ ModuleListStream: │ │
│ │ Module 1: [name, base_addr, size, version] │ │
│ │ Module 2: [name, base_addr, size, version] │ │
│ │ │ │
│ │ SystemInfoStream: │ │
│ │ [OS, CPU arch, version, etc.] │ │
│ │ │ │
│ │ ExceptionStream: │ │
│ │ [Exception code, address, thread] │ │
│ │ │ │
│ │ MemoryListStream: │ │
│ │ [Selected memory regions] │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Stream Types
Common stream types you’ll encounter:
| Stream Type | ID | Contents |
|---|---|---|
| ThreadListStream | 3 | All threads with stack and registers |
| ModuleListStream | 4 | Loaded modules/libraries |
| MemoryListStream | 5 | Selected memory regions |
| ExceptionStream | 6 | Exception/crash information |
| SystemInfoStream | 7 | OS and CPU information |
| ThreadExListStream | 8 | Extended thread info |
| Memory64ListStream | 9 | 64-bit memory regions |
| LinuxProcStatus | 0x47670002 | Linux /proc/status |
| LinuxCpuInfo | 0x47670003 | Linux /proc/cpuinfo |
| LinuxMaps | 0x47670007 | Linux /proc/maps |
Breakpad/Crashpad Ecosystem
┌───────────────────────────────────────────────────────────────┐
│ CRASH REPORTING PIPELINE │
├───────────────────────────────────────────────────────────────┤
│ │
│ Application │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Crashpad Client │ Catches crash, writes minidump │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Minidump File │ .dmp file with crash data │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Upload Handler │ Sends to crash reporting server │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Symbol Server │ Provides debug symbols │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ minidump_ │ Processes dump + symbols = stack trace │
│ │ stackwalk │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Bug Report │ Human-readable crash report │
│ └─────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────┘
2.2 Why This Matters
Understanding minidumps is essential because:
- Industry Standard: Chrome, Firefox, Slack, Discord, VS Code all use minidumps
- Cloud-Friendly: Small size enables crash upload at scale
- Cross-Platform: Same format works on Windows, macOS, Linux
- Privacy-Conscious: Contains only crash-relevant data, not full memory
2.3 Historical Context
1999: Microsoft introduces minidump format in Windows 2000 2006: Google creates Breakpad for cross-platform crash reporting 2015: Google develops Crashpad as Breakpad’s successor Today: Minidump is the de facto standard for crash reporting
2.4 Common Misconceptions
Misconception 1: “Minidumps are just small core dumps”
- Reality: They’re a different format with structured streams, not just truncated memory
Misconception 2: “You can’t get a full backtrace from a minidump”
- Reality: With proper symbols, you get complete stack traces
Misconception 3: “Minidumps are Windows-only”
- Reality: Breakpad/Crashpad work on all major platforms
3. Project Specification
3.1 What You Will Build
A Python tool that parses minidump files and extracts:
- System information (OS, architecture)
- Crash information (signal, address)
- Thread list with basic stack information
- Loaded module list with base addresses
3.2 Functional Requirements
- Parse minidump header
- Validate “MDMP” signature
- Extract stream count and directory offset
- Navigate stream directory
- Enumerate all streams
- Identify stream types by ID
- Parse SystemInfoStream
- Extract OS type and version
- Extract CPU architecture
- Parse ModuleListStream
- List all loaded modules
- Extract module name, base address, size
- Parse ExceptionStream
- Extract exception code
- Identify crashing thread and address
- Generate summary report
- Output human-readable crash summary
3.3 Non-Functional Requirements
- Handle both 32-bit and 64-bit minidumps
- Support Linux-specific streams (from Breakpad)
- Graceful handling of missing/unknown streams
- Clear error messages for corrupted files
3.4 Example Usage / Output
$ python3 minidump_parser.py crash_report.dmp
╔══════════════════════════════════════════════════════════════╗
║ MINIDUMP ANALYSIS ║
╠══════════════════════════════════════════════════════════════╣
║ File: crash_report.dmp ║
║ Size: 245,760 bytes ║
║ Timestamp: 2025-12-20 14:30:45 UTC ║
╚══════════════════════════════════════════════════════════════╝
═══════════════════════════════════════════════════════════════
SYSTEM INFO
═══════════════════════════════════════════════════════════════
OS: Linux (6.1.0-13-amd64)
CPU: AMD64 (x86_64)
CPU Count: 8
CPU Vendor: GenuineIntel
═══════════════════════════════════════════════════════════════
CRASH INFO
═══════════════════════════════════════════════════════════════
Exception: SIGSEGV (Signal 11)
Address: 0x0000000000000000
Thread ID: 12345
═══════════════════════════════════════════════════════════════
LOADED MODULES (15)
═══════════════════════════════════════════════════════════════
Base Address Size Name
─────────────────────────────────────────────────────────────
0x000055a8b7c00000 102,400 /usr/bin/my_application
0x00007f8a12400000 1,920,000 /lib/x86_64-linux-gnu/libc.so.6
0x00007f8a12800000 151,552 /lib/x86_64-linux-gnu/libpthread.so.0
0x00007f8a12c00000 167,936 /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
...
═══════════════════════════════════════════════════════════════
STREAM SUMMARY
═══════════════════════════════════════════════════════════════
Stream Offset Size
─────────────────────────────────────────────────────────────
ThreadListStream 0x00001000 2,048
ModuleListStream 0x00001800 4,096
ExceptionStream 0x00002800 168
SystemInfoStream 0x00002900 56
MemoryListStream 0x00002940 32,768
LinuxMaps 0x0000a940 8,192
...
3.5 Real World Outcome
After completing this project, you’ll be able to:
- Understand crash reports from Sentry, Crashlytics, or Breakpad
- Build custom crash analysis tools for your organization
- Contribute to open-source crash reporting projects
- Debug issues with crash uploads and processing pipelines
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ MINIDUMP PARSER │
└─────────────────────────────────────────────────────────────────┘
│
┌────────────────────┴────────────────────┐
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ File Reader │ │ Report Generator │
│ ──────────────── │ │ ──────────────── │
│ Binary file I/O │ │ Format output │
│ Seek & read │ │ Text/JSON/HTML │
└──────────┬──────────┘ └──────────▲──────────┘
│ │
▼ │
┌─────────────────────────────────────────────────────────────────┐
│ PARSER CORE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Header │──▶│ Directory │──▶│ Streams │ │
│ │ Parser │ │ Parser │ │ Parser │ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ┌───────────────────────────────────────────┐│
│ │ ││
│ ▼ ▼ ▼ ▼ ││
│ ┌──────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ││
│ │SystemInfo│ │ Module │ │Exception│ │ Thread │ ││
│ │ Parser │ │ Parser │ │ Parser │ │ Parser │ ││
│ └──────────┘ └────────┘ └────────┘ └────────┘ ││
│ │
└─────────────────────────────────────────────────────────────────┘
4.2 Key Components
- MinidumpHeader: Parses the 32-byte file header
- StreamDirectory: Parses and indexes all streams
- SystemInfoStream: Extracts OS and CPU information
- ModuleListStream: Lists loaded modules
- ExceptionStream: Crash details
- ThreadListStream: Thread information
- ReportGenerator: Formats output
4.3 Data Structures
# Header structure (32 bytes)
MINIDUMP_HEADER = struct.Struct('<4sIIIIII')
# Signature, Version, NumberOfStreams, StreamDirectoryRva,
# Checksum, TimeDateStamp, Flags
# Directory entry (12 bytes)
MINIDUMP_DIRECTORY = struct.Struct('<III')
# StreamType, DataSize, Rva
# Location descriptor (8 bytes)
MINIDUMP_LOCATION_DESCRIPTOR = struct.Struct('<II')
# DataSize, Rva
# Module structure (108 bytes on 64-bit)
MINIDUMP_MODULE = struct.Struct('<QIIQQ...')
# BaseOfImage, SizeOfImage, CheckSum, TimeDateStamp, ...
4.4 Algorithm Overview
def parse_minidump(file_path):
with open(file_path, 'rb') as f:
# 1. Read and validate header
header = parse_header(f)
validate_signature(header)
# 2. Seek to stream directory
f.seek(header.stream_directory_rva)
# 3. Parse all directory entries
streams = {}
for _ in range(header.number_of_streams):
entry = parse_directory_entry(f)
streams[entry.stream_type] = entry
# 4. Parse each stream of interest
result = MinidumpInfo()
if SYSTEM_INFO in streams:
result.system_info = parse_system_info(f, streams[SYSTEM_INFO])
if MODULE_LIST in streams:
result.modules = parse_module_list(f, streams[MODULE_LIST])
if EXCEPTION in streams:
result.exception = parse_exception(f, streams[EXCEPTION])
return result
5. Implementation Guide
5.1 Development Environment Setup
# Create project directory
mkdir minidump_parser
cd minidump_parser
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# No external dependencies needed - using only stdlib
# (struct, mmap, io modules)
5.2 Project Structure
minidump_parser/
├── minidump/
│ ├── __init__.py
│ ├── header.py # Header parsing
│ ├── directory.py # Stream directory
│ ├── streams/
│ │ ├── __init__.py
│ │ ├── system_info.py # SystemInfoStream
│ │ ├── modules.py # ModuleListStream
│ │ ├── exception.py # ExceptionStream
│ │ ├── threads.py # ThreadListStream
│ │ └── memory.py # MemoryListStream
│ └── parser.py # Main parser
├── tests/
│ ├── test_header.py
│ ├── test_streams.py
│ └── fixtures/ # Sample minidumps
├── minidump_parser.py # CLI entry point
└── README.md
5.3 The Core Question You’re Answering
“How do you extract crash information from a compact binary format designed for efficient crash reporting at scale?”
This requires:
- Understanding binary file structure and navigation
- Parsing variable-length records with offsets
- Handling platform-specific variations
- Converting binary data to human-readable output
5.4 Concepts You Must Understand First
- Binary file I/O in Python
structmodule for unpacking binary data- File seeking and reading
- Little-endian byte ordering
- Intel/AMD use little-endian
- Format string
<in struct
- RVA (Relative Virtual Address)
- Offsets from start of file
- How stream data is located
- Unicode string encoding
- Module names are UTF-16LE in minidumps
- Platform-specific data
- Linux vs Windows stream types
- Architecture-specific register contexts
5.5 Questions to Guide Your Design
Architecture Questions:
- How will you handle unknown stream types gracefully?
- Should parsing be lazy (on-demand) or eager (all at once)?
- How will you handle 32-bit vs 64-bit differences?
Implementation Questions:
- Will you memory-map the file or use regular reads?
- How will you represent parsed data (dataclasses? dicts?)?
- How will you handle corrupt or truncated files?
Output Questions:
- What output formats will you support (text, JSON, etc.)?
- How verbose should the output be?
- Should you support filtering (e.g., only show modules)?
5.6 Thinking Exercise
Before coding, manually parse this hex dump of a minidump header:
4D 44 4D 50 93 A7 00 00 08 00 00 00 20 00 00 00
00 00 00 00 A1 B2 C3 D4 02 00 00 00
Questions:
- What is the signature? (First 4 bytes)
- What is the version? (Bytes 4-7)
- How many streams are there? (Bytes 8-11)
- Where is the stream directory? (Bytes 12-15)
- What is the timestamp? (Bytes 20-23)
5.7 Hints in Layers
Hint 1 - Start Simple: Begin by just parsing and printing the header. Verify the signature is “MDMP”.
Hint 2 - Stream Types:
STREAM_TYPES = {
0: "UnusedStream",
3: "ThreadListStream",
4: "ModuleListStream",
5: "MemoryListStream",
6: "ExceptionStream",
7: "SystemInfoStream",
# Linux-specific (Breakpad)
0x47670002: "LinuxProcStatus",
0x47670003: "LinuxCpuInfo",
0x47670007: "LinuxMaps",
}
Hint 3 - String Handling: Module names are stored as MINIDUMP_STRING structures:
def read_minidump_string(f, rva):
f.seek(rva)
length = struct.unpack('<I', f.read(4))[0] # Length in bytes
data = f.read(length)
return data.decode('utf-16-le')
Hint 4 - Getting Test Files:
Generate a minidump from a crashing program using Breakpad or Google’s dump_syms:
# Or use a sample from Sentry's test fixtures
wget https://github.com/getsentry/symbolic/raw/master/symbolic-testutils/fixtures/linux.dmp
5.8 The Interview Questions They’ll Ask
- “How would you handle a minidump that’s 10GB?”
- Expected: Use memory mapping or streaming, don’t load entire file
- “What’s the difference between RVA and VA?”
- Expected: RVA is file offset, VA is runtime memory address
- “How do you symbolicate a stack trace from a minidump?”
- Expected: Need symbol files, use stack pointer from threads, walk frames
- “How would you add support for a new stream type?”
- Expected: Describe modular design, adding new parser while handling unknown gracefully
- “How does Breakpad capture a minidump without crashing the handler?”
- Expected: Uses out-of-process handler, signal-safe operations
5.9 Books That Will Help
| Topic | Book | Chapter(s) |
|---|---|---|
| Binary Parsing | “Practical Binary Analysis” - Andriesse | Ch. 1-2 |
| Python struct | Python documentation | struct module |
| File Formats | “File Format Design” - various | General principles |
| Crash Reporting | Breakpad/Crashpad docs | Getting Started guides |
5.10 Implementation Phases
Phase 1: Header Parsing (Days 1-2)
- Parse MINIDUMP_HEADER
- Validate signature and version
- Print basic file info
Phase 2: Stream Directory (Days 3-4)
- Parse stream directory
- Enumerate all streams
- Map stream types to names
Phase 3: SystemInfoStream (Days 5-6)
- Parse system information
- Extract OS and CPU details
- Handle different CPU architectures
Phase 4: ModuleListStream (Days 7-9)
- Parse module list
- Extract module names (handle UTF-16)
- Display base addresses and sizes
Phase 5: ExceptionStream (Days 10-12)
- Parse exception information
- Map exception codes to signal names
- Extract crash address
Phase 6: Polish & Testing (Days 13-15)
- Add command-line interface
- Test with various minidumps
- Handle edge cases and errors
5.11 Key Implementation Decisions
- Use dataclasses for parsed structures (Python 3.7+)
- Lazy parsing - only parse streams when accessed
- Memory mapping for large files (
mmapmodule) - Error handling - wrap all parsing in try/except, never crash on bad input
6. Testing Strategy
Unit Tests
def test_header_parsing():
# Create a minimal valid header
header_bytes = b'MDMP' + struct.pack('<IIIIII',
0xa793, # version
1, # number of streams
32, # stream directory offset
0, # checksum
0x12345678, # timestamp
0 # flags
)
header = parse_header(io.BytesIO(header_bytes))
assert header.signature == b'MDMP'
assert header.number_of_streams == 1
Integration Tests
- Parse real minidumps from Sentry’s test fixtures
- Compare output with
minidump_dumptool (from Breakpad) - Verify module list matches
minidump_stackwalkoutput
Verification Checklist
- Correctly identifies minidump signature
- Parses all standard stream types
- Handles Linux-specific streams
- Correctly decodes UTF-16 module names
- Handles 32-bit and 64-bit minidumps
7. Common Pitfalls & Debugging
Pitfall 1: Byte Order Confusion
Problem: Numbers parse as garbage
Cause: Minidumps are little-endian; forgetting < prefix
Solution:
# Wrong
struct.unpack('I', data)
# Right
struct.unpack('<I', data)
Pitfall 2: String Encoding
Problem: Module names are garbled
Cause: Minidump strings are UTF-16-LE, not UTF-8
Solution:
# Wrong
name = data.decode('utf-8')
# Right
name = data.decode('utf-16-le').rstrip('\x00')
Pitfall 3: Off-by-One in Stream Sizes
Problem: Parsing reads past stream boundary
Cause: Size includes header or doesn’t include padding
Solution: Always verify position after parsing:
expected_end = stream.rva + stream.size
# ... parse ...
actual_end = f.tell()
assert actual_end <= expected_end, f"Read past stream end: {actual_end} > {expected_end}"
Pitfall 4: Architecture Mismatch
Problem: 64-bit minidump parses incorrectly
Cause: Pointer sizes differ between 32-bit and 64-bit
Solution: Check SystemInfoStream.ProcessorArchitecture first, then use appropriate struct sizes
8. Extensions & Challenges
Extension 1: Stack Walking
Implement basic stack unwinding:
- Use thread context registers (RSP/RBP)
- Walk stack frames using frame pointers
- Without symbols, show raw addresses
Extension 2: Symbol Integration
Add symbolication support:
- Parse Breakpad .sym files
- Map addresses to function names
- Generate readable stack traces
Extension 3: Memory Analysis
Parse MemoryListStream:
- Extract memory regions
- Search for patterns in memory
- Display heap/stack contents
Extension 4: Crash Clustering
Compare multiple minidumps:
- Generate crash signatures
- Group similar crashes
- Identify most common crash patterns
9. Real-World Connections
Chrome Crash Reporting
Chrome uses Crashpad to:
- Catch crashes in a separate process
- Generate minidumps with relevant streams
- Upload to Google’s crash servers
- Symbolicate and deduplicate
Sentry Integration
Sentry processes minidumps:
- Receive uploaded minidump
- Parse using symbolic-minidump
- Fetch symbols from symbol servers
- Generate readable stack traces
- Group by crash signature
Mozilla Socorro
Firefox’s crash reporting:
- Breakpad client captures crash
- Uploads to Socorro server
- Processes with minidump-analyzer
- Displays in crash-stats dashboard
10. Resources
Official Documentation
Source Code References
Sample Minidumps
11. Self-Assessment Checklist
Before You Start
- Comfortable with Python’s
structmodule - Understand binary file I/O
- Know what little-endian means
- Have access to sample minidump files
After Completion
- Can explain minidump format structure
- Can parse headers, directories, and streams
- Can extract system info, modules, and crash details
- Can handle different platforms and architectures
- Understand the Breakpad/Crashpad ecosystem
- Can extend parser to handle new stream types
12. Submission / Completion Criteria
Your project is complete when you can:
- Parse Sample Minidumps
- Successfully parse Linux minidumps from Breakpad
- Successfully parse at least one Windows minidump
- Handle corrupt/truncated files gracefully
- Generate Useful Output
- Show system information
- List all loaded modules
- Display crash/exception information
- Support at least text and JSON output
- Demonstrate Understanding
- Explain the minidump structure
- Describe how stream navigation works
- Answer interview questions from section 5.8
- Handle Edge Cases
- Unknown stream types (log, don’t crash)
- Missing streams (handle gracefully)
- Large files (don’t run out of memory)
Next: Project 8: Introduction to Kernel Panics - Enter the kernel crash domain