Project 10: Building a Centralized Crash Reporter

Design and build a “mini-Sentry”—a complete crash reporting infrastructure for production systems.

Quick Reference

Attribute	Value
Difficulty	Master
Time Estimate	1 month+
Language	Python (with Flask/FastAPI)
Prerequisites	Projects 4 and 7, web API development experience
Key Topics	System design, crash pipelines, deduplication, distributed systems

1. Learning Objectives

By completing this project, you will:

Design a production-grade crash reporting system
Implement system-wide crash capture using core_pattern
Build a server to receive, store, and analyze crash dumps
Implement crash deduplication using stack trace fingerprinting
Create a dashboard to visualize crash trends
Understand the architecture behind services like Sentry, Crashlytics, and Raygun

2. Theoretical Foundation

2.1 Core Concepts

Crash Reporting System Architecture

A production crash reporting system has multiple components:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CRASH REPORTING SYSTEM ARCHITECTURE                       │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT SIDE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐               │
│  │ Application  │      │ Application  │      │ Application  │               │
│  │    (crash)   │      │    (crash)   │      │    (crash)   │               │
│  └──────┬───────┘      └──────┬───────┘      └──────┬───────┘               │
│         │                     │                     │                        │
│         ▼                     ▼                     ▼                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                      Crash Capture Agent                             │   │
│  │  ─────────────────────────────────────────────────────────────────   │   │
│  │  • Configured via core_pattern or signal handler                     │   │
│  │  • Generates minidump or uploads core dump                           │   │
│  │  • Collects metadata (hostname, version, env)                        │   │
│  │  • Handles upload with retry logic                                   │   │
│  └──────────────────────────────┬──────────────────────────────────────┘   │
│                                 │                                           │
│                                 │ HTTPS POST                                │
│                                 │ (multipart/form-data)                     │
│                                 │                                           │
└─────────────────────────────────┼───────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              SERVER SIDE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         API Gateway / Load Balancer                   │   │
│  └──────────────────────────────┬──────────────────────────────────────┘   │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         Ingestion Service                             │   │
│  │  ─────────────────────────────────────────────────────────────────   │   │
│  │  • Validates uploaded crash                                          │   │
│  │  • Stores raw dump in blob storage                                   │   │
│  │  • Queues processing job                                             │   │
│  └──────────────────────────────┬──────────────────────────────────────┘   │
│                                 │                                           │
│         ┌───────────────────────┴───────────────────────┐                  │
│         ▼                                               ▼                   │
│  ┌──────────────────┐                         ┌──────────────────┐         │
│  │   Blob Storage   │                         │   Message Queue  │         │
│  │  (S3/MinIO)      │                         │  (Redis/RabbitMQ)│         │
│  │                  │                         │                  │         │
│  │  crash_001.dmp   │                         │  { job_id: 001,  │         │
│  │  crash_002.dmp   │                         │    dump_path: ..}│         │
│  └──────────────────┘                         └────────┬─────────┘         │
│                                                        │                    │
│                                                        ▼                    │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         Processing Worker                             │   │
│  │  ─────────────────────────────────────────────────────────────────   │   │
│  │  • Downloads dump from blob storage                                  │   │
│  │  • Runs GDB/minidump_stackwalk                                       │   │
│  │  • Symbolicates stack traces                                         │   │
│  │  • Generates crash fingerprint                                       │   │
│  │  • Stores analysis results in database                               │   │
│  └──────────────────────────────┬──────────────────────────────────────┘   │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                           Database                                    │   │
│  │  ─────────────────────────────────────────────────────────────────   │   │
│  │  crash_groups:                                                        │   │
│  │    id | fingerprint | count | first_seen | last_seen | title        │   │
│  │                                                                       │   │
│  │  crash_events:                                                        │   │
│  │    id | group_id | timestamp | hostname | version | dump_path        │   │
│  │                                                                       │   │
│  │  stack_traces:                                                        │   │
│  │    id | event_id | thread_id | frames (json)                         │   │
│  └──────────────────────────────┬──────────────────────────────────────┘   │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         Dashboard / API                               │   │
│  │  ─────────────────────────────────────────────────────────────────   │   │
│  │  • List crash groups with occurrence counts                          │   │
│  │  • Drill down into individual crash events                           │   │
│  │  • View symbolicated stack traces                                    │   │
│  │  • Trend charts over time                                            │   │
│  │  • Integration APIs (Slack, PagerDuty)                               │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Crash Fingerprinting

The key to crash deduplication is generating a stable “fingerprint”:

┌─────────────────────────────────────────────────────────────────┐
│                   CRASH FINGERPRINTING                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Raw Stack Trace:                                                │
│  ─────────────────                                               │
│  #0  0x7f8a1234abcd in malloc+0x15                              │
│  #1  0x55555555513d in process_data+0x45 at app.c:142           │
│  #2  0x555555555200 in main+0x80 at app.c:200                   │
│  #3  0x7f8a12345678 in __libc_start_main+0x100                  │
│                                                                  │
│  Fingerprint Input (normalized):                                 │
│  ─────────────────────────────────                               │
│  • Remove memory addresses (they vary)                           │
│  • Keep function names                                           │
│  • Keep file names (optional: line numbers)                      │
│  • Use top N frames (3-5 typically)                             │
│                                                                  │
│  Fingerprint String:                                             │
│  ───────────────────                                             │
│  "malloc|process_data:app.c|main:app.c"                         │
│                                                                  │
│  Fingerprint Hash:                                               │
│  ─────────────────                                               │
│  SHA256("malloc|process_data:app.c|main:app.c")                 │
│  = "a1b2c3d4e5f6..."                                            │
│                                                                  │
│  Crashes with same fingerprint hash are grouped together         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

core_pattern Configuration

Linux allows piping core dumps to a program:

# Default: write core to current directory
# core_pattern = core

# Pipe to crash handler program:
echo '|/usr/local/bin/crash_handler %p %e %t %s' > /proc/sys/kernel/core_pattern

# %p = PID
# %e = executable name
# %t = timestamp
# %s = signal number
# %h = hostname
# %E = executable path (with / replaced by!)

# The crash_handler receives:
# - Core dump on stdin
# - Arguments from format specifiers

2.2 Why This Matters

Building a crash reporter teaches you:

System design at scale - handling thousands of crashes
Data pipeline architecture - ingestion, processing, storage
Deduplication algorithms - grouping similar issues
DevOps integration - alerting, dashboards, APIs

2.3 Historical Context

2004: GNOME Bugzilla’s crash reporter
2010: Sentry founded for error tracking
2012: Crashlytics founded (acquired by Google)
Today: Every major platform has crash reporting (Apple, Google, Microsoft)

2.4 Common Misconceptions

Misconception 1: “Just store all the crashes”

Reality: Without deduplication, you’ll drown in data

Misconception 2: “Stack traces are enough”

Reality: Need metadata (version, OS, memory) for context

Misconception 3: “One monolithic service is simpler”

Reality: Separation of concerns (ingestion vs processing) is essential

3. Project Specification

3.1 What You Will Build

A complete crash reporting system with:

Crash capture agent - Pipes crashes to an uploader
Ingestion API - Receives and stores crash uploads
Processing worker - Analyzes crashes with GDB
Database - Stores crash groups and events
Dashboard - Web UI to view crash reports

3.2 Functional Requirements

Client Side:

Configure core_pattern to pipe to crash handler
Crash handler generates minidump or processes core dump
Upload crash data to server with metadata
Retry on network failures

Server Side:

API endpoint to receive crash uploads
Store raw dumps in blob storage
Queue crashes for processing
Worker to run GDB analysis
Generate fingerprint and deduplicate
Store in database with grouping

Dashboard:

List crash groups with count
Show crash details and stack trace
Time-based filtering
Search by crash content

3.3 Non-Functional Requirements

Handle 100+ crashes per minute at peak
Store crashes for at least 30 days
Dashboard response time < 2 seconds
Secure: authenticate uploads, protect crash data

3.4 Example Usage / Output

Client Side (on crashing server):

# Configure core_pattern
$ echo '|/usr/local/bin/crash_uploader %p %e %t %s' | sudo tee /proc/sys/kernel/core_pattern

# When a crash happens:
$ ./buggy_program
Segmentation fault

# Behind the scenes:
# 1. Kernel pipes core to crash_uploader
# 2. crash_uploader creates minidump
# 3. Uploads to crash server
# 4. Returns success

Dashboard Output:

╔══════════════════════════════════════════════════════════════════════════╗
║                         CRASH DASHBOARD                                   ║
╠══════════════════════════════════════════════════════════════════════════╣
║                                                                           ║
║  Last 24 Hours: 47 crashes across 8 unique issues                         ║
║                                                                           ║
║  ┌────────────────────────────────────────────────────────────────────┐  ║
║  │ Issue                              │ Count │ Last Seen │ Status    │  ║
║  ├────────────────────────────────────┼───────┼───────────┼───────────┤  ║
║  │ SIGSEGV in process_data() app.c:142│  23   │ 2 min ago │ New       │  ║
║  │ SIGABRT in malloc()                │  15   │ 1 hr ago  │ Ongoing   │  ║
║  │ SIGSEGV in parse_json() parser.c:88│   5   │ 3 hr ago  │ Ongoing   │  ║
║  │ SIGFPE in calculate() math.c:201   │   2   │ 12 hr ago │ New       │  ║
║  │ Stack overflow in recursive()      │   2   │ 18 hr ago │ Resolved  │  ║
║  └────────────────────────────────────┴───────┴───────────┴───────────┘  ║
║                                                                           ║
║  [View Details] [Mark Resolved] [Create Bug] [Slack Alert]               ║
║                                                                           ║
╚══════════════════════════════════════════════════════════════════════════╝

═══════════════════════════════════════════════════════════════════════════
                    CRASH DETAIL: SIGSEGV in process_data()
═══════════════════════════════════════════════════════════════════════════

Signal: SIGSEGV (11)
Address: 0x0000000000000000
First Seen: 2025-12-20 10:00:00
Last Seen: 2025-12-20 15:58:00
Occurrences: 23

Affected Versions:
  • v2.1.0 (18 crashes)
  • v2.0.9 (5 crashes)

Affected Hosts:
  • prod-web-01 (12)
  • prod-web-02 (8)
  • prod-web-03 (3)

Stack Trace:
#0  0x000055555555513d in process_data (input=0x0) at app.c:142
#1  0x0000555555555200 in handle_request (req=0x7fff...) at app.c:200
#2  0x0000555555555300 in main () at app.c:250
#3  0x00007f8a12345678 in __libc_start_main

Root Cause Analysis:
  The crash occurs when process_data receives a NULL pointer.
  This happens when handle_request fails to validate input before calling.

[View Raw Dump] [Download Core] [Similar Issues]

3.5 Real World Outcome

After this project, you’ll have:

A working crash reporting system
Experience with production system design
Understanding of how Sentry/Crashlytics work
Portfolio project demonstrating infrastructure skills

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                     SYSTEM COMPONENTS                            │
└─────────────────────────────────────────────────────────────────┘

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ Crash Handler │────▶│ Ingestion API │────▶│ Blob Storage  │
│ (Client)      │     │ (Server)      │     │ (MinIO/S3)    │
└───────────────┘     └───────┬───────┘     └───────────────┘
                              │
                              ▼
                      ┌───────────────┐
                      │ Message Queue │
                      │ (Redis)       │
                      └───────┬───────┘
                              │
                              ▼
                      ┌───────────────┐     ┌───────────────┐
                      │  Processing   │────▶│   Database    │
                      │  Worker       │     │  (PostgreSQL) │
                      └───────────────┘     └───────┬───────┘
                                                    │
                                                    ▼
                                            ┌───────────────┐
                                            │   Dashboard   │
                                            │   (Web UI)    │
                                            └───────────────┘

4.2 Key Components

Crash Handler (crash_uploader.py)
- Receives core dump on stdin
- Creates minidump or processes directly
- Uploads to server
Ingestion API (server/api.py)
- Flask/FastAPI application
- Receives multipart uploads
- Stores to blob storage
- Queues for processing
Processing Worker (server/worker.py)
- Pulls jobs from queue
- Runs GDB analysis
- Generates fingerprint
- Updates database
Dashboard (server/dashboard.py)
- Web UI for viewing crashes
- REST API for integrations
- Charts and statistics

4.3 Data Structures

Database Schema:

-- Crash groups (deduplicated issues)
CREATE TABLE crash_groups (
    id SERIAL PRIMARY KEY,
    fingerprint VARCHAR(64) UNIQUE NOT NULL,
    title VARCHAR(255) NOT NULL,
    first_seen TIMESTAMP NOT NULL,
    last_seen TIMESTAMP NOT NULL,
    occurrence_count INTEGER DEFAULT 1,
    status VARCHAR(20) DEFAULT 'new',  -- new, ongoing, resolved
    created_at TIMESTAMP DEFAULT NOW()
);

-- Individual crash events
CREATE TABLE crash_events (
    id SERIAL PRIMARY KEY,
    group_id INTEGER REFERENCES crash_groups(id),
    timestamp TIMESTAMP NOT NULL,
    hostname VARCHAR(255),
    app_version VARCHAR(50),
    signal_number INTEGER,
    crash_address VARCHAR(32),
    dump_path VARCHAR(500),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Stack frames for each event
CREATE TABLE stack_frames (
    id SERIAL PRIMARY KEY,
    event_id INTEGER REFERENCES crash_events(id),
    thread_id INTEGER,
    frame_number INTEGER,
    address VARCHAR(32),
    function_name VARCHAR(255),
    file_name VARCHAR(255),
    line_number INTEGER,
    module_name VARCHAR(255)
);

-- Indexes for common queries
CREATE INDEX idx_crash_groups_last_seen ON crash_groups(last_seen DESC);
CREATE INDEX idx_crash_events_group_id ON crash_events(group_id);
CREATE INDEX idx_crash_events_timestamp ON crash_events(timestamp DESC);

4.4 Algorithm Overview

Fingerprint Generation:

def generate_fingerprint(stack_frames, depth=5):
    """Generate stable fingerprint from stack trace."""
    # Take top N frames
    relevant_frames = stack_frames[:depth]

    # Extract key information
    frame_strings = []
    for frame in relevant_frames:
        if frame.function_name:
            # Use function and file (not line - too specific)
            frame_str = f"{frame.function_name}"
            if frame.file_name:
                frame_str += f":{frame.file_name}"
            frame_strings.append(frame_str)
        elif frame.address:
            # If no symbol, use module name
            if frame.module_name:
                frame_strings.append(f"??:{frame.module_name}")

    # Create fingerprint string
    fingerprint_str = "|".join(frame_strings)

    # Hash it
    return hashlib.sha256(fingerprint_str.encode()).hexdigest()[:16]

Processing Pipeline:

def process_crash(job):
    # 1. Download dump from blob storage
    dump_path = download_dump(job.dump_url)

    # 2. Run GDB analysis
    analysis = run_gdb_analysis(dump_path, job.executable_path)

    # 3. Generate fingerprint
    fingerprint = generate_fingerprint(analysis.stack_frames)

    # 4. Find or create crash group
    group = find_or_create_group(fingerprint, analysis)

    # 5. Create crash event
    event = create_event(group, job, analysis)

    # 6. Store stack frames
    store_frames(event, analysis.stack_frames)

    # 7. Update group statistics
    update_group_stats(group)

    # 8. Trigger alerts if needed
    check_and_send_alerts(group, event)

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
mkdir crash_reporter
cd crash_reporter

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install flask redis sqlalchemy psycopg2-binary minio boto3

# Start local services (using Docker)
docker run -d --name redis -p 6379:6379 redis
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=crashpass postgres
docker run -d --name minio -p 9000:9000 -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin minio/minio server /data

5.2 Project Structure

crash_reporter/
├── client/
│   ├── crash_uploader.py       # Core pattern handler
│   ├── config.py               # Client configuration
│   └── install.sh              # Installation script
│
├── server/
│   ├── api/
│   │   ├── __init__.py
│   │   ├── app.py              # Flask application
│   │   ├── routes.py           # API endpoints
│   │   └── auth.py             # Authentication
│   │
│   ├── worker/
│   │   ├── __init__.py
│   │   ├── processor.py        # Processing worker
│   │   ├── analyzer.py         # GDB analysis
│   │   └── fingerprint.py      # Fingerprinting logic
│   │
│   ├── models/
│   │   ├── __init__.py
│   │   ├── database.py         # SQLAlchemy setup
│   │   └── models.py           # ORM models
│   │
│   ├── dashboard/
│   │   ├── __init__.py
│   │   ├── views.py            # Dashboard routes
│   │   ├── templates/          # Jinja2 templates
│   │   └── static/             # CSS, JS
│   │
│   └── config.py               # Server configuration
│
├── tests/
│   ├── test_fingerprint.py
│   ├── test_api.py
│   └── test_worker.py
│
├── docker-compose.yml          # Local development
├── requirements.txt
└── README.md

5.3 The Core Question You’re Answering

“How do you build infrastructure to capture, analyze, and manage thousands of crashes across a distributed system?”

This requires:

Reliable crash capture at the OS level
Scalable ingestion and storage
Automated analysis pipeline
Intelligent deduplication
Actionable presentation

5.4 Concepts You Must Understand First

Linux core_pattern mechanism
- Reference: man core, kernel documentation
REST API design
- Reference: Flask/FastAPI documentation
Message queue patterns
- Reference: Redis documentation, RabbitMQ concepts
Database design
- Reference: PostgreSQL documentation
GDB automation
- Reference: Project 4 (Automated Crash Detective)

5.5 Questions to Guide Your Design

Architecture Questions:

How will you handle server failures during upload?
How will you scale processing if crashes spike?
How will you manage disk space for crash dumps?

Security Questions:

How will you authenticate crash uploads?
How will you protect sensitive data in dumps?
Who should have access to crash data?

Operations Questions:

How will you monitor the crash reporter itself?
How will you handle the reporter crashing?
How will you upgrade without losing data?

5.6 Thinking Exercise

Design on paper:

Draw the data flow from crash to dashboard
List all failure modes and how to handle them
Design the fingerprinting algorithm - what makes two crashes “the same”?
Plan the database schema - what queries will be common?
Sketch the dashboard UI - what information matters most?

5.7 Hints in Layers

Hint 1 - Start Simple:

# Minimal crash uploader
#!/usr/bin/env python3
import sys
import requests

def main():
    # Read core dump from stdin
    core_data = sys.stdin.buffer.read()

    # Upload to server
    response = requests.post(
        'http://crash-server/api/upload',
        files={'dump': ('core', core_data)},
        data={
            'pid': sys.argv[1],
            'executable': sys.argv[2],
            'timestamp': sys.argv[3],
            'signal': sys.argv[4],
        }
    )
    print(f"Uploaded: {response.status_code}")

if __name__ == '__main__':
    main()

Hint 2 - Minimal API:

from flask import Flask, request
import uuid

app = Flask(__name__)

@app.route('/api/upload', methods=['POST'])
def upload_crash():
    dump_file = request.files.get('dump')
    if not dump_file:
        return {'error': 'No dump file'}, 400

    # Generate unique ID
    crash_id = str(uuid.uuid4())

    # Save to blob storage
    dump_path = f'/var/crash_dumps/{crash_id}.dmp'
    dump_file.save(dump_path)

    # Queue for processing
    queue_job(crash_id, dump_path, request.form)

    return {'crash_id': crash_id}, 202

Hint 3 - Processing Worker:

import redis
import json

r = redis.Redis()

def worker_loop():
    while True:
        # Block waiting for job
        _, job_data = r.blpop('crash_jobs')
        job = json.loads(job_data)

        try:
            process_crash(job)
        except Exception as e:
            # Log error, maybe retry
            print(f"Error processing {job['crash_id']}: {e}")

Hint 4 - Fingerprint Algorithm:

def generate_fingerprint(frames):
    # Skip frames from system libraries
    user_frames = [f for f in frames if not is_system_frame(f)]

    # Take top 5 user frames
    key_frames = user_frames[:5]

    # Create stable string
    parts = []
    for f in key_frames:
        if f.function:
            parts.append(f"{f.function}@{f.module or 'unknown'}")

    return hashlib.md5('|'.join(parts).encode()).hexdigest()[:12]

5.8 The Interview Questions They’ll Ask

“How would you scale this to handle 10,000 crashes per minute?”
- Expected: Load balancing, multiple workers, async processing, rate limiting
“How do you handle duplicate crash reports?”
- Expected: Fingerprinting algorithm, hash-based grouping
“What if the crash reporter itself crashes?”
- Expected: Separate process, watchdog, graceful degradation
“How do you secure crash data?”
- Expected: Authentication, encryption, access control, data retention
“How would you add symbolication support?”
- Expected: Symbol server, build ID matching, symbol upload API
“How do you decide when to alert on a crash?”
- Expected: Threshold-based, rate of change, new issue detection

5.9 Books That Will Help

Topic	Book	Chapter(s)
System Design	“Designing Data-Intensive Applications” - Kleppmann	Ch. 1-4
API Design	“REST API Design Rulebook” - Masse	All
Queueing	“Enterprise Integration Patterns” - Hohpe	Ch. 6
Databases	“SQL Antipatterns” - Karwin	Ch. 1-10

5.10 Implementation Phases

Phase 1: Core Infrastructure (Week 1)

Set up development environment
Create basic API with upload endpoint
Implement blob storage
Set up database schema

Phase 2: Client Side (Week 2)

Create crash handler script
Configure core_pattern
Test upload flow
Handle network errors

Phase 3: Processing Pipeline (Week 2-3)

Implement message queue
Create processing worker
Integrate GDB analysis
Implement fingerprinting

Phase 4: Dashboard (Week 3-4)

Create basic web UI
Implement crash list view
Add detail view
Add filtering/search

Phase 5: Polish (Week 4+)

Add authentication
Improve error handling
Add monitoring
Write documentation

5.11 Key Implementation Decisions

Storage: Use MinIO for local dev, S3 for production
Queue: Redis is simple and sufficient for this scale
Database: PostgreSQL for reliability and JSON support
Web Framework: Flask for simplicity, FastAPI for performance
Analysis: Batch with GDB for now, add minidump_stackwalk later

6. Testing Strategy

Unit Tests

def test_fingerprint_stability():
    """Same stack should produce same fingerprint."""
    frames = [
        Frame(function='main', file='app.c'),
        Frame(function='process', file='app.c'),
    ]
    fp1 = generate_fingerprint(frames)
    fp2 = generate_fingerprint(frames)
    assert fp1 == fp2

def test_fingerprint_different_addresses():
    """Different addresses should produce same fingerprint."""
    frames1 = [Frame(function='main', address='0x1000')]
    frames2 = [Frame(function='main', address='0x2000')]
    assert generate_fingerprint(frames1) == generate_fingerprint(frames2)

Integration Tests

Test full upload → process → dashboard flow
Test with real crash dumps
Test error handling (network failures, corrupt dumps)

Load Tests

Simulate 100 concurrent uploads
Verify worker keeps up
Check database performance

7. Common Pitfalls & Debugging

Pitfall 1: core_pattern Not Working

Problem: Crashes don’t trigger handler

Solution:

# Check current pattern
cat /proc/sys/kernel/core_pattern

# Make script executable
chmod +x /usr/local/bin/crash_uploader

# Test with sysrq
echo c > /proc/sysrq-trigger  # WARNING: crashes system!

# Check handler logs
journalctl -f

Pitfall 2: Worker Falling Behind

Problem: Processing queue grows unboundedly

Solution:

Add more workers
Implement priority queues
Add rate limiting on ingestion
Monitor queue depth

Pitfall 3: Duplicate Fingerprints

Problem: Different crashes getting same fingerprint

Solution:

Include more frames in fingerprint
Consider crash address
Review fingerprint algorithm
Add manual grouping override

Pitfall 4: Disk Space Exhaustion

Problem: Too many crash dumps stored

Solution:

Implement retention policy
Delete processed dumps after N days
Compress stored dumps
Monitor disk usage

8. Extensions & Challenges

Extension 1: Symbolication Server

Build a symbol server:

Upload symbols from builds
Match by build ID
Symbolicate on demand

Extension 2: Source Integration

Link crashes to code:

Git integration
Show source context
Assign to code owners

Extension 3: Alerting System

Implement smart alerting:

New crash type detection
Spike detection
On-call rotation integration

Extension 4: Mobile/Web SDKs

Create SDKs for:

iOS/Android crash reporting
JavaScript error tracking
Native app integration

9. Real-World Connections

Sentry Architecture

Sentry handles millions of events:

Relay for ingestion (Rust)
Snuba for storage (ClickHouse)
Symbolicator for stack traces
Web UI in React

Crashlytics

Google’s Crashlytics:

SDK embedded in apps
Real-time crash reporting
BigQuery integration
Firebase integration

Mozilla Socorro

Firefox crash reporting:

Breakpad client
Collector service
Processor with symbols
Crash-stats dashboard

10. Resources

Similar Projects

Sentry - Open source error tracking
Bugsnag - Commercial crash reporting
Socorro - Mozilla’s system

Documentation

11. Self-Assessment Checklist

Before You Start

Completed Projects 4 and 7
Comfortable with web API development
Basic database knowledge
Understanding of message queues

After Completion

Can capture crashes system-wide
Can upload and store crash dumps
Can process and analyze crashes automatically
Can generate stable fingerprints
Can deduplicate crashes into groups
Can display crashes in a dashboard
Understand how to scale the system
Could extend with additional features

12. Submission / Completion Criteria

Your project is complete when you have:

Working Client
- core_pattern configured
- Crashes successfully uploaded
- Handles network failures
Working Server
- API accepts uploads
- Dumps stored in blob storage
- Processing queue operational
Working Analysis
- GDB analysis runs automatically
- Fingerprints generated
- Crashes grouped correctly
Working Dashboard
- Lists crash groups
- Shows crash details
- Displays stack traces
Documentation
- Setup instructions
- Architecture overview
- API documentation

Congratulations!

You’ve completed the Linux Crash Dump Analysis learning path! You’ve gone from analyzing your first core dump to building production crash infrastructure. These skills are used by kernel developers, SREs, and platform engineers worldwide.

What you’ve learned:

GDB post-mortem debugging
Memory analysis techniques
Multi-threaded crash debugging
Stripped binary analysis
Minidump file formats
Kernel module development
Kernel crash capture with kdump
crash utility for kernel debugging
Production crash reporting systems

Where to go next:

Contribute to crash reporting open source projects
Learn kernel development deeper
Study advanced debugging techniques
Build custom analysis tools for your organization

Congratulations on completing this comprehensive crash analysis journey!