Project 10: Building a Centralized Crash Reporter
Design and build a “mini-Sentry”—a complete crash reporting infrastructure for production systems.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Master |
| Time Estimate | 1 month+ |
| Language | Python (with Flask/FastAPI) |
| Prerequisites | Projects 4 and 7, web API development experience |
| Key Topics | System design, crash pipelines, deduplication, distributed systems |
1. Learning Objectives
By completing this project, you will:
- Design a production-grade crash reporting system
- Implement system-wide crash capture using core_pattern
- Build a server to receive, store, and analyze crash dumps
- Implement crash deduplication using stack trace fingerprinting
- Create a dashboard to visualize crash trends
- Understand the architecture behind services like Sentry, Crashlytics, and Raygun
2. Theoretical Foundation
2.1 Core Concepts
Crash Reporting System Architecture
A production crash reporting system has multiple components:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CRASH REPORTING SYSTEM ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLIENT SIDE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │ │ Application │ │ Application │ │
│ │ (crash) │ │ (crash) │ │ (crash) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Crash Capture Agent │ │
│ │ ───────────────────────────────────────────────────────────────── │ │
│ │ • Configured via core_pattern or signal handler │ │
│ │ • Generates minidump or uploads core dump │ │
│ │ • Collects metadata (hostname, version, env) │ │
│ │ • Handles upload with retry logic │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ │ HTTPS POST │
│ │ (multipart/form-data) │
│ │ │
└─────────────────────────────────┼───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVER SIDE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ API Gateway / Load Balancer │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Ingestion Service │ │
│ │ ───────────────────────────────────────────────────────────────── │ │
│ │ • Validates uploaded crash │ │
│ │ • Stores raw dump in blob storage │ │
│ │ • Queues processing job │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┴───────────────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Blob Storage │ │ Message Queue │ │
│ │ (S3/MinIO) │ │ (Redis/RabbitMQ)│ │
│ │ │ │ │ │
│ │ crash_001.dmp │ │ { job_id: 001, │ │
│ │ crash_002.dmp │ │ dump_path: ..}│ │
│ └──────────────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Processing Worker │ │
│ │ ───────────────────────────────────────────────────────────────── │ │
│ │ • Downloads dump from blob storage │ │
│ │ • Runs GDB/minidump_stackwalk │ │
│ │ • Symbolicates stack traces │ │
│ │ • Generates crash fingerprint │ │
│ │ • Stores analysis results in database │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Database │ │
│ │ ───────────────────────────────────────────────────────────────── │ │
│ │ crash_groups: │ │
│ │ id | fingerprint | count | first_seen | last_seen | title │ │
│ │ │ │
│ │ crash_events: │ │
│ │ id | group_id | timestamp | hostname | version | dump_path │ │
│ │ │ │
│ │ stack_traces: │ │
│ │ id | event_id | thread_id | frames (json) │ │
│ └──────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Dashboard / API │ │
│ │ ───────────────────────────────────────────────────────────────── │ │
│ │ • List crash groups with occurrence counts │ │
│ │ • Drill down into individual crash events │ │
│ │ • View symbolicated stack traces │ │
│ │ • Trend charts over time │ │
│ │ • Integration APIs (Slack, PagerDuty) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Crash Fingerprinting
The key to crash deduplication is generating a stable “fingerprint”:
┌─────────────────────────────────────────────────────────────────┐
│ CRASH FINGERPRINTING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Raw Stack Trace: │
│ ───────────────── │
│ #0 0x7f8a1234abcd in malloc+0x15 │
│ #1 0x55555555513d in process_data+0x45 at app.c:142 │
│ #2 0x555555555200 in main+0x80 at app.c:200 │
│ #3 0x7f8a12345678 in __libc_start_main+0x100 │
│ │
│ Fingerprint Input (normalized): │
│ ───────────────────────────────── │
│ • Remove memory addresses (they vary) │
│ • Keep function names │
│ • Keep file names (optional: line numbers) │
│ • Use top N frames (3-5 typically) │
│ │
│ Fingerprint String: │
│ ─────────────────── │
│ "malloc|process_data:app.c|main:app.c" │
│ │
│ Fingerprint Hash: │
│ ───────────────── │
│ SHA256("malloc|process_data:app.c|main:app.c") │
│ = "a1b2c3d4e5f6..." │
│ │
│ Crashes with same fingerprint hash are grouped together │
│ │
└─────────────────────────────────────────────────────────────────┘
core_pattern Configuration
Linux allows piping core dumps to a program:
# Default: write core to current directory
# core_pattern = core
# Pipe to crash handler program:
echo '|/usr/local/bin/crash_handler %p %e %t %s' > /proc/sys/kernel/core_pattern
# %p = PID
# %e = executable name
# %t = timestamp
# %s = signal number
# %h = hostname
# %E = executable path (with / replaced by !)
# The crash_handler receives:
# - Core dump on stdin
# - Arguments from format specifiers
2.2 Why This Matters
Building a crash reporter teaches you:
- System design at scale - handling thousands of crashes
- Data pipeline architecture - ingestion, processing, storage
- Deduplication algorithms - grouping similar issues
- DevOps integration - alerting, dashboards, APIs
2.3 Historical Context
- 2004: GNOME Bugzilla’s crash reporter
- 2010: Sentry founded for error tracking
- 2012: Crashlytics founded (acquired by Google)
- Today: Every major platform has crash reporting (Apple, Google, Microsoft)
2.4 Common Misconceptions
Misconception 1: “Just store all the crashes”
- Reality: Without deduplication, you’ll drown in data
Misconception 2: “Stack traces are enough”
- Reality: Need metadata (version, OS, memory) for context
Misconception 3: “One monolithic service is simpler”
- Reality: Separation of concerns (ingestion vs processing) is essential
3. Project Specification
3.1 What You Will Build
A complete crash reporting system with:
- Crash capture agent - Pipes crashes to an uploader
- Ingestion API - Receives and stores crash uploads
- Processing worker - Analyzes crashes with GDB
- Database - Stores crash groups and events
- Dashboard - Web UI to view crash reports
3.2 Functional Requirements
Client Side:
- Configure core_pattern to pipe to crash handler
- Crash handler generates minidump or processes core dump
- Upload crash data to server with metadata
- Retry on network failures
Server Side:
- API endpoint to receive crash uploads
- Store raw dumps in blob storage
- Queue crashes for processing
- Worker to run GDB analysis
- Generate fingerprint and deduplicate
- Store in database with grouping
Dashboard:
- List crash groups with count
- Show crash details and stack trace
- Time-based filtering
- Search by crash content
3.3 Non-Functional Requirements
- Handle 100+ crashes per minute at peak
- Store crashes for at least 30 days
- Dashboard response time < 2 seconds
- Secure: authenticate uploads, protect crash data
3.4 Example Usage / Output
Client Side (on crashing server):
# Configure core_pattern
$ echo '|/usr/local/bin/crash_uploader %p %e %t %s' | sudo tee /proc/sys/kernel/core_pattern
# When a crash happens:
$ ./buggy_program
Segmentation fault
# Behind the scenes:
# 1. Kernel pipes core to crash_uploader
# 2. crash_uploader creates minidump
# 3. Uploads to crash server
# 4. Returns success
Dashboard Output:
╔══════════════════════════════════════════════════════════════════════════╗
║ CRASH DASHBOARD ║
╠══════════════════════════════════════════════════════════════════════════╣
║ ║
║ Last 24 Hours: 47 crashes across 8 unique issues ║
║ ║
║ ┌────────────────────────────────────────────────────────────────────┐ ║
║ │ Issue │ Count │ Last Seen │ Status │ ║
║ ├────────────────────────────────────┼───────┼───────────┼───────────┤ ║
║ │ SIGSEGV in process_data() app.c:142│ 23 │ 2 min ago │ New │ ║
║ │ SIGABRT in malloc() │ 15 │ 1 hr ago │ Ongoing │ ║
║ │ SIGSEGV in parse_json() parser.c:88│ 5 │ 3 hr ago │ Ongoing │ ║
║ │ SIGFPE in calculate() math.c:201 │ 2 │ 12 hr ago │ New │ ║
║ │ Stack overflow in recursive() │ 2 │ 18 hr ago │ Resolved │ ║
║ └────────────────────────────────────┴───────┴───────────┴───────────┘ ║
║ ║
║ [View Details] [Mark Resolved] [Create Bug] [Slack Alert] ║
║ ║
╚══════════════════════════════════════════════════════════════════════════╝
═══════════════════════════════════════════════════════════════════════════
CRASH DETAIL: SIGSEGV in process_data()
═══════════════════════════════════════════════════════════════════════════
Signal: SIGSEGV (11)
Address: 0x0000000000000000
First Seen: 2025-12-20 10:00:00
Last Seen: 2025-12-20 15:58:00
Occurrences: 23
Affected Versions:
• v2.1.0 (18 crashes)
• v2.0.9 (5 crashes)
Affected Hosts:
• prod-web-01 (12)
• prod-web-02 (8)
• prod-web-03 (3)
Stack Trace:
#0 0x000055555555513d in process_data (input=0x0) at app.c:142
#1 0x0000555555555200 in handle_request (req=0x7fff...) at app.c:200
#2 0x0000555555555300 in main () at app.c:250
#3 0x00007f8a12345678 in __libc_start_main
Root Cause Analysis:
The crash occurs when process_data receives a NULL pointer.
This happens when handle_request fails to validate input before calling.
[View Raw Dump] [Download Core] [Similar Issues]
3.5 Real World Outcome
After this project, you’ll have:
- A working crash reporting system
- Experience with production system design
- Understanding of how Sentry/Crashlytics work
- Portfolio project demonstrating infrastructure skills
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ SYSTEM COMPONENTS │
└─────────────────────────────────────────────────────────────────┘
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Crash Handler │────▶│ Ingestion API │────▶│ Blob Storage │
│ (Client) │ │ (Server) │ │ (MinIO/S3) │
└───────────────┘ └───────┬───────┘ └───────────────┘
│
▼
┌───────────────┐
│ Message Queue │
│ (Redis) │
└───────┬───────┘
│
▼
┌───────────────┐ ┌───────────────┐
│ Processing │────▶│ Database │
│ Worker │ │ (PostgreSQL) │
└───────────────┘ └───────┬───────┘
│
▼
┌───────────────┐
│ Dashboard │
│ (Web UI) │
└───────────────┘
4.2 Key Components
- Crash Handler (crash_uploader.py)
- Receives core dump on stdin
- Creates minidump or processes directly
- Uploads to server
- Ingestion API (server/api.py)
- Flask/FastAPI application
- Receives multipart uploads
- Stores to blob storage
- Queues for processing
- Processing Worker (server/worker.py)
- Pulls jobs from queue
- Runs GDB analysis
- Generates fingerprint
- Updates database
- Dashboard (server/dashboard.py)
- Web UI for viewing crashes
- REST API for integrations
- Charts and statistics
4.3 Data Structures
Database Schema:
-- Crash groups (deduplicated issues)
CREATE TABLE crash_groups (
id SERIAL PRIMARY KEY,
fingerprint VARCHAR(64) UNIQUE NOT NULL,
title VARCHAR(255) NOT NULL,
first_seen TIMESTAMP NOT NULL,
last_seen TIMESTAMP NOT NULL,
occurrence_count INTEGER DEFAULT 1,
status VARCHAR(20) DEFAULT 'new', -- new, ongoing, resolved
created_at TIMESTAMP DEFAULT NOW()
);
-- Individual crash events
CREATE TABLE crash_events (
id SERIAL PRIMARY KEY,
group_id INTEGER REFERENCES crash_groups(id),
timestamp TIMESTAMP NOT NULL,
hostname VARCHAR(255),
app_version VARCHAR(50),
signal_number INTEGER,
crash_address VARCHAR(32),
dump_path VARCHAR(500),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Stack frames for each event
CREATE TABLE stack_frames (
id SERIAL PRIMARY KEY,
event_id INTEGER REFERENCES crash_events(id),
thread_id INTEGER,
frame_number INTEGER,
address VARCHAR(32),
function_name VARCHAR(255),
file_name VARCHAR(255),
line_number INTEGER,
module_name VARCHAR(255)
);
-- Indexes for common queries
CREATE INDEX idx_crash_groups_last_seen ON crash_groups(last_seen DESC);
CREATE INDEX idx_crash_events_group_id ON crash_events(group_id);
CREATE INDEX idx_crash_events_timestamp ON crash_events(timestamp DESC);
4.4 Algorithm Overview
Fingerprint Generation:
def generate_fingerprint(stack_frames, depth=5):
"""Generate stable fingerprint from stack trace."""
# Take top N frames
relevant_frames = stack_frames[:depth]
# Extract key information
frame_strings = []
for frame in relevant_frames:
if frame.function_name:
# Use function and file (not line - too specific)
frame_str = f"{frame.function_name}"
if frame.file_name:
frame_str += f":{frame.file_name}"
frame_strings.append(frame_str)
elif frame.address:
# If no symbol, use module name
if frame.module_name:
frame_strings.append(f"??:{frame.module_name}")
# Create fingerprint string
fingerprint_str = "|".join(frame_strings)
# Hash it
return hashlib.sha256(fingerprint_str.encode()).hexdigest()[:16]
Processing Pipeline:
def process_crash(job):
# 1. Download dump from blob storage
dump_path = download_dump(job.dump_url)
# 2. Run GDB analysis
analysis = run_gdb_analysis(dump_path, job.executable_path)
# 3. Generate fingerprint
fingerprint = generate_fingerprint(analysis.stack_frames)
# 4. Find or create crash group
group = find_or_create_group(fingerprint, analysis)
# 5. Create crash event
event = create_event(group, job, analysis)
# 6. Store stack frames
store_frames(event, analysis.stack_frames)
# 7. Update group statistics
update_group_stats(group)
# 8. Trigger alerts if needed
check_and_send_alerts(group, event)
5. Implementation Guide
5.1 Development Environment Setup
# Create project directory
mkdir crash_reporter
cd crash_reporter
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install flask redis sqlalchemy psycopg2-binary minio boto3
# Start local services (using Docker)
docker run -d --name redis -p 6379:6379 redis
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=crashpass postgres
docker run -d --name minio -p 9000:9000 -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin minio/minio server /data
5.2 Project Structure
crash_reporter/
├── client/
│ ├── crash_uploader.py # Core pattern handler
│ ├── config.py # Client configuration
│ └── install.sh # Installation script
│
├── server/
│ ├── api/
│ │ ├── __init__.py
│ │ ├── app.py # Flask application
│ │ ├── routes.py # API endpoints
│ │ └── auth.py # Authentication
│ │
│ ├── worker/
│ │ ├── __init__.py
│ │ ├── processor.py # Processing worker
│ │ ├── analyzer.py # GDB analysis
│ │ └── fingerprint.py # Fingerprinting logic
│ │
│ ├── models/
│ │ ├── __init__.py
│ │ ├── database.py # SQLAlchemy setup
│ │ └── models.py # ORM models
│ │
│ ├── dashboard/
│ │ ├── __init__.py
│ │ ├── views.py # Dashboard routes
│ │ ├── templates/ # Jinja2 templates
│ │ └── static/ # CSS, JS
│ │
│ └── config.py # Server configuration
│
├── tests/
│ ├── test_fingerprint.py
│ ├── test_api.py
│ └── test_worker.py
│
├── docker-compose.yml # Local development
├── requirements.txt
└── README.md
5.3 The Core Question You’re Answering
“How do you build infrastructure to capture, analyze, and manage thousands of crashes across a distributed system?”
This requires:
- Reliable crash capture at the OS level
- Scalable ingestion and storage
- Automated analysis pipeline
- Intelligent deduplication
- Actionable presentation
5.4 Concepts You Must Understand First
- Linux core_pattern mechanism
- Reference:
man core, kernel documentation
- Reference:
- REST API design
- Reference: Flask/FastAPI documentation
- Message queue patterns
- Reference: Redis documentation, RabbitMQ concepts
- Database design
- Reference: PostgreSQL documentation
- GDB automation
- Reference: Project 4 (Automated Crash Detective)
5.5 Questions to Guide Your Design
Architecture Questions:
- How will you handle server failures during upload?
- How will you scale processing if crashes spike?
- How will you manage disk space for crash dumps?
Security Questions:
- How will you authenticate crash uploads?
- How will you protect sensitive data in dumps?
- Who should have access to crash data?
Operations Questions:
- How will you monitor the crash reporter itself?
- How will you handle the reporter crashing?
- How will you upgrade without losing data?
5.6 Thinking Exercise
Design on paper:
- Draw the data flow from crash to dashboard
- List all failure modes and how to handle them
- Design the fingerprinting algorithm - what makes two crashes “the same”?
- Plan the database schema - what queries will be common?
- Sketch the dashboard UI - what information matters most?
5.7 Hints in Layers
Hint 1 - Start Simple:
# Minimal crash uploader
#!/usr/bin/env python3
import sys
import requests
def main():
# Read core dump from stdin
core_data = sys.stdin.buffer.read()
# Upload to server
response = requests.post(
'http://crash-server/api/upload',
files={'dump': ('core', core_data)},
data={
'pid': sys.argv[1],
'executable': sys.argv[2],
'timestamp': sys.argv[3],
'signal': sys.argv[4],
}
)
print(f"Uploaded: {response.status_code}")
if __name__ == '__main__':
main()
Hint 2 - Minimal API:
from flask import Flask, request
import uuid
app = Flask(__name__)
@app.route('/api/upload', methods=['POST'])
def upload_crash():
dump_file = request.files.get('dump')
if not dump_file:
return {'error': 'No dump file'}, 400
# Generate unique ID
crash_id = str(uuid.uuid4())
# Save to blob storage
dump_path = f'/var/crash_dumps/{crash_id}.dmp'
dump_file.save(dump_path)
# Queue for processing
queue_job(crash_id, dump_path, request.form)
return {'crash_id': crash_id}, 202
Hint 3 - Processing Worker:
import redis
import json
r = redis.Redis()
def worker_loop():
while True:
# Block waiting for job
_, job_data = r.blpop('crash_jobs')
job = json.loads(job_data)
try:
process_crash(job)
except Exception as e:
# Log error, maybe retry
print(f"Error processing {job['crash_id']}: {e}")
Hint 4 - Fingerprint Algorithm:
def generate_fingerprint(frames):
# Skip frames from system libraries
user_frames = [f for f in frames if not is_system_frame(f)]
# Take top 5 user frames
key_frames = user_frames[:5]
# Create stable string
parts = []
for f in key_frames:
if f.function:
parts.append(f"{f.function}@{f.module or 'unknown'}")
return hashlib.md5('|'.join(parts).encode()).hexdigest()[:12]
5.8 The Interview Questions They’ll Ask
- “How would you scale this to handle 10,000 crashes per minute?”
- Expected: Load balancing, multiple workers, async processing, rate limiting
- “How do you handle duplicate crash reports?”
- Expected: Fingerprinting algorithm, hash-based grouping
- “What if the crash reporter itself crashes?”
- Expected: Separate process, watchdog, graceful degradation
- “How do you secure crash data?”
- Expected: Authentication, encryption, access control, data retention
- “How would you add symbolication support?”
- Expected: Symbol server, build ID matching, symbol upload API
- “How do you decide when to alert on a crash?”
- Expected: Threshold-based, rate of change, new issue detection
5.9 Books That Will Help
| Topic | Book | Chapter(s) |
|---|---|---|
| System Design | “Designing Data-Intensive Applications” - Kleppmann | Ch. 1-4 |
| API Design | “REST API Design Rulebook” - Masse | All |
| Queueing | “Enterprise Integration Patterns” - Hohpe | Ch. 6 |
| Databases | “SQL Antipatterns” - Karwin | Ch. 1-10 |
5.10 Implementation Phases
Phase 1: Core Infrastructure (Week 1)
- Set up development environment
- Create basic API with upload endpoint
- Implement blob storage
- Set up database schema
Phase 2: Client Side (Week 2)
- Create crash handler script
- Configure core_pattern
- Test upload flow
- Handle network errors
Phase 3: Processing Pipeline (Week 2-3)
- Implement message queue
- Create processing worker
- Integrate GDB analysis
- Implement fingerprinting
Phase 4: Dashboard (Week 3-4)
- Create basic web UI
- Implement crash list view
- Add detail view
- Add filtering/search
Phase 5: Polish (Week 4+)
- Add authentication
- Improve error handling
- Add monitoring
- Write documentation
5.11 Key Implementation Decisions
- Storage: Use MinIO for local dev, S3 for production
- Queue: Redis is simple and sufficient for this scale
- Database: PostgreSQL for reliability and JSON support
- Web Framework: Flask for simplicity, FastAPI for performance
- Analysis: Batch with GDB for now, add minidump_stackwalk later
6. Testing Strategy
Unit Tests
def test_fingerprint_stability():
"""Same stack should produce same fingerprint."""
frames = [
Frame(function='main', file='app.c'),
Frame(function='process', file='app.c'),
]
fp1 = generate_fingerprint(frames)
fp2 = generate_fingerprint(frames)
assert fp1 == fp2
def test_fingerprint_different_addresses():
"""Different addresses should produce same fingerprint."""
frames1 = [Frame(function='main', address='0x1000')]
frames2 = [Frame(function='main', address='0x2000')]
assert generate_fingerprint(frames1) == generate_fingerprint(frames2)
Integration Tests
- Test full upload → process → dashboard flow
- Test with real crash dumps
- Test error handling (network failures, corrupt dumps)
Load Tests
- Simulate 100 concurrent uploads
- Verify worker keeps up
- Check database performance
7. Common Pitfalls & Debugging
Pitfall 1: core_pattern Not Working
Problem: Crashes don’t trigger handler
Solution:
# Check current pattern
cat /proc/sys/kernel/core_pattern
# Make script executable
chmod +x /usr/local/bin/crash_uploader
# Test with sysrq
echo c > /proc/sysrq-trigger # WARNING: crashes system!
# Check handler logs
journalctl -f
Pitfall 2: Worker Falling Behind
Problem: Processing queue grows unboundedly
Solution:
- Add more workers
- Implement priority queues
- Add rate limiting on ingestion
- Monitor queue depth
Pitfall 3: Duplicate Fingerprints
Problem: Different crashes getting same fingerprint
Solution:
- Include more frames in fingerprint
- Consider crash address
- Review fingerprint algorithm
- Add manual grouping override
Pitfall 4: Disk Space Exhaustion
Problem: Too many crash dumps stored
Solution:
- Implement retention policy
- Delete processed dumps after N days
- Compress stored dumps
- Monitor disk usage
8. Extensions & Challenges
Extension 1: Symbolication Server
Build a symbol server:
- Upload symbols from builds
- Match by build ID
- Symbolicate on demand
Extension 2: Source Integration
Link crashes to code:
- Git integration
- Show source context
- Assign to code owners
Extension 3: Alerting System
Implement smart alerting:
- New crash type detection
- Spike detection
- On-call rotation integration
Extension 4: Mobile/Web SDKs
Create SDKs for:
- iOS/Android crash reporting
- JavaScript error tracking
- Native app integration
9. Real-World Connections
Sentry Architecture
Sentry handles millions of events:
- Relay for ingestion (Rust)
- Snuba for storage (ClickHouse)
- Symbolicator for stack traces
- Web UI in React
Crashlytics
Google’s Crashlytics:
- SDK embedded in apps
- Real-time crash reporting
- BigQuery integration
- Firebase integration
Mozilla Socorro
Firefox crash reporting:
- Breakpad client
- Collector service
- Processor with symbols
- Crash-stats dashboard
10. Resources
Similar Projects
Documentation
11. Self-Assessment Checklist
Before You Start
- Completed Projects 4 and 7
- Comfortable with web API development
- Basic database knowledge
- Understanding of message queues
After Completion
- Can capture crashes system-wide
- Can upload and store crash dumps
- Can process and analyze crashes automatically
- Can generate stable fingerprints
- Can deduplicate crashes into groups
- Can display crashes in a dashboard
- Understand how to scale the system
- Could extend with additional features
12. Submission / Completion Criteria
Your project is complete when you have:
- Working Client
- core_pattern configured
- Crashes successfully uploaded
- Handles network failures
- Working Server
- API accepts uploads
- Dumps stored in blob storage
- Processing queue operational
- Working Analysis
- GDB analysis runs automatically
- Fingerprints generated
- Crashes grouped correctly
- Working Dashboard
- Lists crash groups
- Shows crash details
- Displays stack traces
- Documentation
- Setup instructions
- Architecture overview
- API documentation
Congratulations!
You’ve completed the Linux Crash Dump Analysis learning path! You’ve gone from analyzing your first core dump to building production crash infrastructure. These skills are used by kernel developers, SREs, and platform engineers worldwide.
What you’ve learned:
- GDB post-mortem debugging
- Memory analysis techniques
- Multi-threaded crash debugging
- Stripped binary analysis
- Minidump file formats
- Kernel module development
- Kernel crash capture with kdump
- crash utility for kernel debugging
- Production crash reporting systems
Where to go next:
- Contribute to crash reporting open source projects
- Learn kernel development deeper
- Study advanced debugging techniques
- Build custom analysis tools for your organization
Congratulations on completing this comprehensive crash analysis journey!