Sprint: Linux Crash Dump Analysis Mastery - From Core Dumps to Kernel Panics

Goal: Master the art of post-mortem debugging in Linux. You will learn to analyze user-space core dumps with GDB, understand the ELF format that stores crash data, diagnose multi-threaded crashes, automate triage workflows, and ultimately dissect kernel panics with the crash utility. By the end, you will confidently find the root cause of any crash—from a simple segmentation fault to a full kernel panic.

Introduction

When a program crashes, it often leaves behind a core dump—a snapshot of the process’s memory and CPU state at the moment of death. This file is your crime scene evidence. Learning to analyze it transforms you from a developer who guesses at bugs into one who knows the root cause.

This guide covers user-space crash analysis (segfaults, memory corruption, multi-threaded races) and introduces kernel crash analysis (panics, oops, kdump). You will build 10 projects that progressively deepen your understanding.

What You Will Build

Crash-generating programs - Controlled bugs that produce core dumps
GDB analysis workflows - Manual and scripted backtrace extraction
Memory inspection tools - Raw memory examination and corruption detection
Automated triage scripts - Python/GDB automation for production systems
Multi-threaded crash scenarios - Data race and deadlock analysis
Stripped binary debugging - Working without debug symbols
Minidump parser - Understanding Breakpad/Crashpad formats
Kernel panic triggers - Writing buggy kernel modules safely in VMs
Kernel crash analysis - Using crash on vmcore files
Centralized crash reporter - A mini-Sentry for your infrastructure

Big Picture: The Crash Analysis Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CRASH ANALYSIS PIPELINE                              │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │  APPLICATION │     │    KERNEL    │     │  CORE DUMP   │     │   ANALYSIS   │
  │    CRASH     │────►│   HANDLER    │────►│    FILE      │────►│    TOOLS     │
  └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
         │                    │                    │                    │
         │                    │                    │                    │
         ▼                    ▼                    ▼                    ▼
   ┌───────────┐        ┌───────────┐        ┌───────────┐        ┌───────────┐
   │ SIGSEGV   │        │core_pattern│       │ ELF Format │       │   GDB     │
   │ SIGABRT   │        │ ulimit -c  │       │ PT_NOTE    │       │  crash    │
   │ SIGFPE    │        │ systemd-   │       │ PT_LOAD    │       │ minidump  │
   │ SIGBUS    │        │ coredump   │       │ Registers  │       │ stackwalk │
   └───────────┘        └───────────┘        └───────────┘        └───────────┘

  USER-SPACE CRASHES:                    KERNEL CRASHES:
  ┌─────────────────────────────┐       ┌─────────────────────────────────────┐
  │  Process → Signal → Core    │       │  Panic → kexec → vmcore → crash    │
  │  GDB loads: executable +    │       │  crash loads: vmlinux + vmcore     │
  │  core file + debug symbols  │       │  Debug kernel symbols required     │
  └─────────────────────────────┘       └─────────────────────────────────────┘

Scope

In Scope:

Linux user-space core dump analysis (GDB)
ELF core dump format internals
Multi-threaded crash debugging
GDB Python scripting for automation
Kernel crash dumps (kdump/vmcore basics)
The crash utility for kernel analysis
Minidump format (Breakpad/Crashpad)

Out of Scope:

Windows crash dumps (WinDbg, minidumps on Windows)
macOS crash reports
Hardware debugging (JTAG, logic analyzers)
Live debugging techniques (covered in other guides)

How to Use This Guide

Reading Strategy

Read the Theory Primer first - The concepts explained before the projects give you the mental model needed to understand why techniques work.
Follow the project order - Projects 1-3 build foundational skills. Projects 4-6 add intermediate complexity. Projects 7-10 tackle advanced topics.
Don’t skip the “Thinking Exercise” - These pre-project exercises build the mental models that make debugging intuitive.
Use the “Definition of Done” - Each project has explicit completion criteria. Don’t move on until you’ve hit them.

Workflow for Each Project

Read the project overview and "Core Question"
Complete the "Thinking Exercise"
Implement the project (use hints only when stuck)
Verify against "Real World Outcome" examples
Review the "Common Pitfalls" section
Complete the "Definition of Done" checklist
Attempt the interview questions

Time Investment

Project Type	Time Estimate	Examples
Beginner	4-8 hours	Projects 1-2
Intermediate	10-20 hours	Projects 3-6
Advanced	20-40 hours	Projects 7-9
Capstone	40+ hours	Project 10

Total Sprint Duration: 8-12 weeks at 10-15 hours/week

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

C Programming Fundamentals

Pointers and memory addresses
Stack vs. heap allocation
Compiling with GCC (gcc -g -o program program.c)
Basic understanding of segmentation faults
Recommended Reading: “The C Programming Language” by Kernighan & Ritchie - Ch. 5-6

Linux Command Line

Basic shell navigation and commands
Understanding of processes and signals
File permissions and sudo usage
Recommended Reading: “The Linux Command Line” by William Shotts - Ch. 1-10

Basic GDB Usage

Starting GDB (gdb ./program)
Setting breakpoints (break main)
Stepping through code (next, step)
Printing variables (print x)
Recommended Reading: “The Art of Debugging with GDB” by Matloff & Salzman - Ch. 1-2

Helpful But Not Required

Assembly Language Basics (learned during Projects 5-6)

x86-64 register names (RAX, RBX, RSP, RIP)
Basic instruction formats
Recommended Reading: “Low-Level Programming” by Igor Zhirkov - Ch. 1-3

Linux Kernel Concepts (learned during Projects 8-9)

Kernel modules and insmod/rmmod
Kernel vs. user space
Recommended Reading: “Linux Device Drivers, 3rd Edition” by Corbet et al. - Ch. 1-2

Python Scripting (needed for Projects 4, 7, 10)

Basic Python syntax and file I/O
subprocess module for running external commands
struct module for binary parsing

Self-Assessment Questions

Answer “yes” to at least 5 of these before starting:

Can you explain what happens when you dereference a NULL pointer in C?
Do you know the difference between stack and heap memory?
Can you compile a C program with debug symbols using GCC?
Have you used GDB to set a breakpoint and step through code?
Do you understand what a signal (like SIGSEGV) is in Linux?
Can you write a basic bash script that runs a command and checks its exit code?
Do you know what a process memory map looks like (/proc/[pid]/maps)?

Development Environment Setup

Required Tools:

Tool	Version	Purpose	Installation
GCC	9.0+	Compiling with debug symbols	`sudo apt install build-essential`
GDB	10.0+	Core dump analysis	`sudo apt install gdb`
Bash	4.0+	Scripting	Pre-installed on Linux
Python	3.8+	Automation scripts	`sudo apt install python3`

Recommended Tools:

Tool	Purpose	Installation
Valgrind	Memory error detection	`sudo apt install valgrind`
strace	System call tracing	`sudo apt install strace`
objdump	Binary disassembly	Part of binutils
readelf	ELF file inspection	Part of binutils
coredumpctl	systemd core dump management	Part of systemd

For Kernel Projects (8-9):

Tool	Purpose	Installation
QEMU/KVM	Virtual machine for safe testing	`sudo apt install qemu-kvm`
crash	Kernel dump analysis	`sudo apt install crash`
kernel-debuginfo	Debug symbols for kernel	Varies by distro

Testing Your Setup:

# Verify GCC with debug symbols
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

# Verify GDB
$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1

# Test core dump generation
$ ulimit -c unlimited
$ cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E

# Note: If using systemd-coredump, use coredumpctl instead
$ coredumpctl list

Important Reality Check

This is not easy material. Crash dump analysis requires understanding:

How programs are represented in memory
How the CPU executes instructions
How the operating system manages processes
How debugging tools reconstruct program state

Expect to feel confused initially. The projects are designed to make abstract concepts concrete through hands-on work. Trust the process.

Big Picture / Mental Model

The Layers of Crash Analysis

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CRASH ANALYSIS MENTAL MODEL                               │
└─────────────────────────────────────────────────────────────────────────────┘

LAYER 5: AUTOMATED SYSTEMS
┌─────────────────────────────────────────────────────────────────────────────┐
│  Crash Reporters (Sentry, Crashpad)  │  CI/CD Integration  │  Alerting     │
│  Symbolication Servers               │  Crash Deduplication │  Dashboards   │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 4: KERNEL CRASH ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│  kdump (capture mechanism)  │  vmcore (kernel memory image)  │  crash tool │
│  kexec (boot into capture)  │  vmlinux (debug symbols)       │  dmesg/log  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 3: ADVANCED USER-SPACE ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│  Multi-threaded debugging   │  Stripped binaries  │  Memory corruption      │
│  GDB Python scripting       │  Disassembly        │  Address-to-symbol      │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 2: BASIC USER-SPACE ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│  GDB backtrace (bt)         │  Variable inspection (print)  │  Memory exam │
│  Stack frames (frame N)     │  Register state (info reg)    │  x/Nfx addr  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 1: CORE DUMP FUNDAMENTALS
┌─────────────────────────────────────────────────────────────────────────────┐
│  ELF core format            │  Signal handling      │  ulimit/core_pattern  │
│  PT_NOTE (metadata)         │  Memory segments      │  systemd-coredump     │
│  PT_LOAD (memory snapshot)  │  Register values      │  Debug symbols (-g)   │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────────────────┐
                    │  FOUNDATION: C Memory Model     │
                    │  Stack, Heap, Code, Data        │
                    │  Pointers, Addresses, Segments  │
                    └─────────────────────────────────┘

The Core Dump Data Flow

 RUNNING PROCESS                    CRASH EVENT                     ANALYSIS
 ┌────────────────┐                ┌────────────────┐              ┌────────────────┐
 │                │                │                │              │                │
 │  Code (.text)  │                │  Signal raised │              │  GDB loads:    │
 │  Data (.data)  │  ─────────►    │  (SIGSEGV,     │  ─────────►  │  - executable  │
 │  BSS  (.bss)   │                │   SIGABRT...)  │              │  - core file   │
 │  Heap          │                │                │              │  - symbols     │
 │  Stack         │                │  Kernel writes │              │                │
 │  Registers     │                │  core dump     │              │  You inspect:  │
 │                │                │                │              │  - backtrace   │
 └────────────────┘                └────────────────┘              │  - variables   │
                                                                   │  - memory      │
                                   Core file contains:             │  - registers   │
                                   ┌────────────────┐              │                │
                                   │ ELF Header     │              └────────────────┘
                                   │ PT_NOTE:       │
                                   │  - prstatus    │
                                   │  - prpsinfo    │
                                   │  - auxv        │
                                   │  - files       │
                                   │ PT_LOAD:       │
                                   │  - memory      │
                                   │    segments    │
                                   └────────────────┘

Theory Primer

This section provides the conceptual foundation for crash dump analysis. Read these chapters before starting the projects—they give you the mental models that make debugging intuitive.

Chapter 1: Core Dumps Fundamentals

Fundamentals

A core dump (or “core file”) is a snapshot of a process’s memory and CPU state at the moment it terminated abnormally. The name comes from early computing when memory was made of magnetic “cores.” Today, a core dump captures everything needed to reconstruct the state of a crashed program: its memory contents, register values, open file descriptors, and more.

When a process receives a signal that causes it to terminate (like SIGSEGV for segmentation faults), the Linux kernel can save this snapshot to a file. This file is your primary evidence for post-mortem debugging—analyzing what went wrong after the crash occurred.

Core dumps are essential because:

Crashes may not be reproducible - Race conditions, timing issues, or specific input combinations may be difficult to recreate
Production debugging is limited - You can’t attach a debugger to production systems easily
The crash context is preserved - You see the exact state at the moment of failure

Deep Dive

The core dump mechanism in Linux is controlled by several system settings and kernel parameters. Understanding these is crucial for both generating and analyzing dumps.

Signal Handling and Core Generation

When a process performs an illegal operation (like dereferencing a NULL pointer), the CPU raises an exception. The kernel translates this into a signal delivered to the process. The default action for certain signals is to terminate the process and generate a core dump:

Signal	Number	Description	Default Action
SIGQUIT	3	Quit from keyboard (Ctrl+)	Core dump
SIGILL	4	Illegal instruction	Core dump
SIGABRT	6	Abort signal (from abort())	Core dump
SIGFPE	8	Floating-point exception	Core dump
SIGSEGV	11	Segmentation fault	Core dump
SIGBUS	7	Bus error (bad memory access)	Core dump
SIGSYS	31	Bad system call	Core dump
SIGTRAP	5	Trace/breakpoint trap	Core dump

Core Dump Size Limits (ulimit)

The shell’s resource limits control whether core dumps are created. The ulimit -c command shows the maximum core file size in 512-byte blocks. A value of 0 (common default) means no core dumps are created.

# Check current limit
$ ulimit -c
0

# Enable unlimited core dumps for this session
$ ulimit -c unlimited

# Enable for all users permanently (in /etc/security/limits.conf)
*    soft    core    unlimited
*    hard    core    unlimited

Core Pattern Configuration

The /proc/sys/kernel/core_pattern file determines where core dumps are written and how they’re named. This can be:

A file path pattern - Specifiers like %p (PID), %e (executable name), %t (timestamp) are replaced
A pipe to a program - Starting with | pipes the dump to a handler (like systemd-coredump or apport)

# Traditional file-based pattern
$ echo "core.%e.%p.%t" > /proc/sys/kernel/core_pattern
# Creates: core.myprogram.1234.1703097600

# Modern systemd-based handling
$ cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

Modern Core Dump Management: systemd-coredump

Most modern Linux distributions use systemd-coredump to manage core dumps. It provides:

Automatic compression and storage in /var/lib/systemd/coredump/
Journal integration for metadata
Automatic cleanup of old dumps
The coredumpctl tool for listing and debugging

# List recent core dumps
$ coredumpctl list
TIME                        PID  UID  GID SIG     COREFILE EXE
Fri 2024-12-20 10:30:15 EST 1234 1000 1000 SIGSEGV present  /usr/bin/myapp

# Debug a specific crash
$ coredumpctl debug myapp

# Export the core file
$ coredumpctl dump myapp -o core.myapp

Core Dump Security Considerations

Core dumps can contain sensitive data:

Environment variables (potentially including secrets)
Memory contents (passwords, API keys, personal data)
File contents that were being processed

This is why many systems disable core dumps by default and why GDPR compliance may require careful handling. Configure Storage=none in /etc/systemd/coredump.conf if you only want journal metadata without the actual memory dump.

How This Fits in Projects

Project 1: You’ll configure core dump generation and verify the system creates dump files
Project 2: You’ll load core dumps into GDB and extract backtraces
Project 10: You’ll build a system that automatically captures and processes core dumps

Mental Model Diagram

                         CORE DUMP GENERATION FLOW
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│  1. ILLEGAL OPERATION      2. CPU EXCEPTION      3. KERNEL SIGNAL            │
│  ┌─────────────────┐      ┌─────────────────┐   ┌─────────────────┐          │
│  │  int *p = NULL; │      │  #PF (Page      │   │  Signal 11      │          │
│  │  *p = 42;       │ ───► │  Fault) raised  │──►│  (SIGSEGV)      │          │
│  │  // BOOM!       │      │  by CPU         │   │  delivered      │          │
│  └─────────────────┘      └─────────────────┘   └────────┬────────┘          │
│                                                          │                   │
│  4. CORE PATTERN CHECK     5. DUMP CREATION      6. FILE WRITTEN            │
│  ┌─────────────────┐      ┌─────────────────┐   ┌─────────────────┐          │
│  │ /proc/sys/      │      │  For each       │   │ core.myapp.1234 │          │
│  │ kernel/         │ ───► │  memory region: │──►│ (ELF format)    │          │
│  │ core_pattern    │      │  - Copy to file │   │ Ready for GDB   │          │
│  │                 │      │  - Add metadata │   │                 │          │
│  │ ulimit -c > 0?  │      │  - Save regs    │   └─────────────────┘          │
│  └─────────────────┘      └─────────────────┘                                │
│                                                                              │
│  OR (Modern systemd systems):                                                │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │  |/usr/lib/systemd/systemd-coredump                                    │ │
│  │  └─► Writes to /var/lib/systemd/coredump/                              │ │
│  │  └─► Logs metadata to journal                                          │ │
│  │  └─► Use `coredumpctl debug` to analyze                                │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

// crash.c - A program that generates a core dump
#include <stdio.h>

int main(void) {
    int *ptr = NULL;  // ptr points to address 0
    *ptr = 42;        // Writing to address 0 triggers SIGSEGV
    return 0;
}

# Compile with debug symbols
$ gcc -g -o crash crash.c

# Enable core dumps
$ ulimit -c unlimited

# Run and crash
$ ./crash
Segmentation fault (core dumped)

# Verify the core file exists (traditional systems)
$ file core*
core: ELF 64-bit LSB core file, x86-64

# Or on systemd systems
$ coredumpctl list
TIME                        PID  UID  GID SIG     COREFILE EXE
...                        1234 1000 1000 SIGSEGV present  /path/to/crash

Common Misconceptions

“Core dumps are always created” - No, they require ulimit -c to be non-zero and the file system must be writable.
“The core file contains the executable” - No, it contains only memory contents and metadata. You need the original executable (ideally with debug symbols) to analyze it.
“Core dumps are huge” - Not always. They can be compressed and typically only contain used memory pages, not the full virtual address space.
“I can analyze a core dump from any machine” - The executable must match exactly (same build). Libraries must also match for accurate analysis.

Check-Your-Understanding Questions

What is the default value of ulimit -c on most systems, and what does it mean?
If core_pattern is |/usr/lib/systemd/systemd-coredump, where do core dumps go?
Why might you set Storage=none in /etc/systemd/coredump.conf?
Which signal is sent when you dereference a NULL pointer?
What’s the difference between SIGSEGV and SIGBUS?

Check-Your-Understanding Answers

Default is 0, meaning core dumps are disabled. No core file will be created when a process crashes.
They’re piped to the systemd-coredump service, which stores them in /var/lib/systemd/coredump/ (compressed) and logs metadata to the journal.
For security/privacy compliance (like GDPR) where you want to log that crashes occurred without storing potentially sensitive memory contents.
SIGSEGV (signal 11) - Segmentation fault. This indicates an invalid memory access.
SIGSEGV typically means accessing unmapped memory. SIGBUS means the memory is mapped but the access is misaligned or otherwise illegal (e.g., accessing memory-mapped I/O incorrectly).

Real-World Applications

Production debugging - When a server process crashes at 3 AM, the core dump lets you analyze it during business hours
Crash reporting systems - Services like Sentry and Crashpad use core dumps (or minidumps) to report crashes
QA and testing - Core dumps help developers understand why tests failed
Security analysis - Examining crashes for potential vulnerabilities

Where You’ll Apply It

Project 1: Generating your first intentional crash and verifying core dump creation
Project 2: Loading core dumps into GDB
Project 4: Automating core dump analysis
Project 10: Building a crash collection system

References

Linux man page: core(5)
systemd-coredump documentation
coredumpctl man page
“The Linux Programming Interface” by Michael Kerrisk - Ch. 22 (Signals)
Arch Wiki: Core dump

Key Insights

A core dump is a frozen snapshot of a process at the moment of death—it captures everything needed to perform a post-mortem investigation without needing to reproduce the crash.

Summary

Core dumps are ELF files containing a process’s memory and CPU state at crash time. Generation is controlled by ulimit -c (size limit) and /proc/sys/kernel/core_pattern (location/handler). Modern systems use systemd-coredump with coredumpctl for management. Core dumps can contain sensitive data, so security considerations apply.

Homework/Exercises

Exercise 1: Write a C program that crashes with each of these signals: SIGSEGV, SIGFPE, SIGABRT. Generate and verify core dumps for each.
Exercise 2: Configure your system to store core dumps with the pattern /tmp/cores/core.%e.%p.%t. Create the directory, set permissions, and test.
Exercise 3: If your system uses systemd-coredump, practice using coredumpctl list, coredumpctl info, and coredumpctl debug.
Exercise 4: Write a script that checks if core dumps are enabled and reports the current configuration.

Solutions to Homework/Exercises

Exercise 1 Solution:

// Three separate programs:

// sigsegv_crash.c
int main() { int *p = 0; *p = 1; return 0; }

// sigfpe_crash.c
int main() { int x = 1, y = 0; return x / y; }

// sigabrt_crash.c
#include <stdlib.h>
int main() { abort(); return 0; }

Exercise 2 Solution:

sudo mkdir -p /tmp/cores
sudo chmod 1777 /tmp/cores
echo "/tmp/cores/core.%e.%p.%t" | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
./crash  # Test with any crashing program
ls /tmp/cores/  # Verify core file created

Exercise 3 Solution:

coredumpctl list                    # List all dumps
coredumpctl info $(coredumpctl list | tail -1 | awk '{print $2}')  # Info on latest
coredumpctl debug                   # Debug most recent dump

Exercise 4 Solution:

#!/bin/bash
echo "=== Core Dump Configuration Check ==="
echo "ulimit -c: $(ulimit -c)"
echo "core_pattern: $(cat /proc/sys/kernel/core_pattern)"
if [ "$(ulimit -c)" = "0" ]; then
    echo "WARNING: Core dumps are DISABLED"
else
    echo "Core dumps are ENABLED"
fi

Chapter 2: ELF Core Dump Format

Fundamentals

Core dumps are stored in the ELF (Executable and Linkable Format) format—the same format used for Linux executables and shared libraries. Understanding ELF structure is essential because it tells you where to find different pieces of crash data.

An ELF core dump is essentially a specially structured file that contains two main types of information: metadata (what process crashed, signal received, register values) stored in PT_NOTE segments, and memory contents (stack, heap, data sections) stored in PT_LOAD segments.

Unlike executable ELF files that have section headers for code and data, core dumps primarily use program headers to describe memory segments. The readelf and eu-readelf tools can parse these headers, revealing the structure of your crash data.

Deep Dive

ELF File Structure Overview

Every ELF file begins with a fixed-size header that identifies the file and points to important data structures:

┌────────────────────────────────────────────────────────────────────────────┐
│                           ELF FILE STRUCTURE                                │
├────────────────────────────────────────────────────────────────────────────┤
│ ELF Header (52/64 bytes)                                                   │
│   - Magic number: 0x7F 'E' 'L' 'F'                                        │
│   - Class (32/64-bit), Endianness, Version                                │
│   - Type: ET_CORE (4) for core dumps                                      │
│   - Entry point, Program header offset, Section header offset             │
├────────────────────────────────────────────────────────────────────────────┤
│ Program Headers (array)                                                    │
│   - PT_NOTE: Metadata (registers, process info, file mappings)            │
│   - PT_LOAD: Memory segments (actual memory contents)                     │
├────────────────────────────────────────────────────────────────────────────┤
│ Segment Data                                                               │
│   - NOTE data: prstatus, prpsinfo, auxv, file mappings                    │
│   - LOAD data: Stack, heap, mapped files, anonymous memory               │
└────────────────────────────────────────────────────────────────────────────┘

PT_NOTE Segment: The Metadata Treasure Trove

The PT_NOTE segment contains structured metadata about the crashed process. Each note has a name, type, and descriptor (data). Key note types include:

Note Type	Name	Description
NT_PRSTATUS	CORE	Process status including registers, signal, PID
NT_PRPSINFO	CORE	Process info: state, command name, nice value
NT_AUXV	CORE	Auxiliary vector (dynamic linker info)
NT_FILE	CORE	File mappings (which files were mapped where)
NT_FPREGSET	CORE	Floating-point register state
NT_X86_XSTATE	LINUX	Extended CPU state (AVX, etc.)

The NT_PRSTATUS note is particularly important—it contains:

The signal that killed the process (e.g., SIGSEGV = 11)
Current and pending signal masks
All general-purpose register values (RIP, RSP, RAX, etc.)
Process and thread IDs

For multi-threaded processes, there’s one NT_PRSTATUS note per thread, allowing you to see what each thread was doing at crash time.

PT_LOAD Segments: The Memory Snapshot

PT_LOAD segments contain the actual memory contents of the process. Each segment has:

p_vaddr: Virtual address where this memory was mapped
p_filesz: How many bytes are in the core file
p_memsz: How many bytes this segment represented in memory
p_flags: Permissions (read, write, execute)

If p_filesz is 0 but p_memsz is non-zero, the segment was all zeros (like uninitialized BSS) and wasn’t stored to save space.

Inspecting ELF Structure with readelf

# View the ELF header
$ readelf -h core
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 ...
  Class:                             ELF64
  Type:                              CORE (Core file)
  ...

# View program headers
$ readelf -l core
Program Headers:
  Type           Offset             VirtAddr           ...
  NOTE           0x0000000000000350 0x0000000000000000 ...
  LOAD           0x0000000000001000 0x0000555555554000 ...
  LOAD           0x0000000000002000 0x00007ffff7dd5000 ...

# View notes
$ readelf -n core
Displaying notes found in: core
  Owner     Data size    Description
  CORE      0x00000150   NT_PRSTATUS (prstatus structure)
  CORE      0x00000088   NT_PRPSINFO (prpsinfo structure)
  CORE      0x00000130   NT_AUXV (auxiliary vector)
  CORE      0x00000550   NT_FILE (mapped files)

The NT_FILE Note: Understanding Memory Mappings

The NT_FILE note is invaluable—it tells you which shared libraries and files were mapped into the process. This helps you understand:

Which version of libc was running
What shared libraries were loaded
Where the executable was mapped

# Example NT_FILE output:
  Page size: 4096
  Start                End                 Page Offset File
  0x0000555555554000   0x0000555555555000  0x00000000  /path/to/program
  0x00007ffff7dd5000   0x00007ffff7f6a000  0x00000000  /lib/x86_64-linux-gnu/libc.so.6

How This Fits in Projects

Project 6: You’ll work with stripped binaries where understanding ELF structure helps locate functions
Project 7: You’ll parse the ELF/minidump structure programmatically
All projects: Understanding where data lives in the core file helps you navigate GDB output

Mental Model Diagram

                    ELF CORE DUMP ANATOMY
┌────────────────────────────────────────────────────────────────────────┐
│                        ELF HEADER (64 bytes)                           │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ Magic: 7F 45 4C 46  Type: CORE  Machine: x86_64  Entry: 0x0     │ │
│  │ Program Header Offset: 64  Section Header Offset: 0 (none)      │ │
│  └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│                      PROGRAM HEADERS                                   │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ [0] PT_NOTE   offset=0x350   vaddr=0x0        filesz=0x8a8      │ │
│  │ [1] PT_LOAD   offset=0x1000  vaddr=0x555...   filesz=0x1000     │ │
│  │ [2] PT_LOAD   offset=0x2000  vaddr=0x7ff...   filesz=0x195000   │ │
│  │ ...                                                              │ │
│  └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│                      NOTE SEGMENT DATA                                 │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ NT_PRSTATUS: signal=11, pid=1234, regs={rip=0x555..., rsp=...}  │ │
│  │ NT_PRPSINFO: state='R', fname="program", args="./program"       │ │
│  │ NT_AUXV: AT_PHDR=..., AT_ENTRY=..., AT_BASE=...                 │ │
│  │ NT_FILE: [0x555...–0x556...] /path/to/program                   │ │
│  │          [0x7ff...–0x7ff...] /lib/.../libc.so.6                 │ │
│  │ NT_FPREGSET: floating point registers                           │ │
│  │ (For multi-threaded: NT_PRSTATUS for each thread)               │ │
│  └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│                      LOAD SEGMENT DATA                                 │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ [Segment 1: Code] .text section from executable                 │ │
│  │ [Segment 2: Data] .data, .bss, heap                             │ │
│  │ [Segment 3: Stack] Local variables, return addresses            │ │
│  │ [Segment 4: libc] Memory-mapped shared library                  │ │
│  │ ...                                                              │ │
│  └──────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘

     NOTE: No section headers! Core dumps use program headers only.

Minimal Concrete Example

# Generate a core dump
$ ulimit -c unlimited
$ echo "core" | sudo tee /proc/sys/kernel/core_pattern
$ ./crash
Segmentation fault (core dumped)

# Inspect the ELF structure
$ file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style

$ readelf -h core | head -10
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  Type:                              CORE (Core file)

$ readelf -n core | grep -A2 "NT_PRSTATUS"
    CORE                 0x00000150       NT_PRSTATUS (prstatus structure)
# Contains signal 11 (SIGSEGV) and register values

Common Misconceptions

“Core dumps have section headers like executables” - No, core dumps typically have 0 section headers. They use program headers (PT_NOTE, PT_LOAD) exclusively.
“The entire virtual address space is saved” - No, only mapped pages with actual content are saved. Zero pages and unmapped regions are omitted.
“You can run the core dump” - No, it’s not an executable. It’s a memory snapshot that requires the original executable and GDB to interpret.
“All notes are in one PT_NOTE segment” - Usually yes, but the notes within that segment are individual records you must parse sequentially.

Check-Your-Understanding Questions

What is the ELF type field value for core dumps?
What information is stored in the NT_PRSTATUS note?
How can you determine which shared libraries were loaded when a process crashed?
Why do core dumps typically have 0 section headers?
What does it mean when a PT_LOAD segment has p_filesz=0 but p_memsz > 0?

Check-Your-Understanding Answers

ET_CORE (value 4). This distinguishes core dumps from executables (ET_EXEC), shared objects (ET_DYN), and relocatable files (ET_REL).
NT_PRSTATUS contains: the signal that killed the process, PID, PPID, all general-purpose registers (including the instruction pointer RIP and stack pointer RSP), and signal masks.
The NT_FILE note in the PT_NOTE segment lists all memory-mapped files with their virtual address ranges. Use readelf -n core to view it.
Section headers are used by linkers and debuggers for executables, but core dumps only need to represent memory layout. Program headers (PT_LOAD) are sufficient for this, and omitting section headers saves space.
This means the memory region was all zeros (like uninitialized BSS). The kernel doesn’t store zero pages in the core file to save space—GDB knows to treat this region as zeros.

Real-World Applications

Crash reporting tools - Parse ELF structure to extract register values and stack data
Forensic analysis - Understand exactly what memory a process had access to
Custom debugging tools - Build specialized analysis tools by parsing core dumps directly
Minidump generation - Convert full core dumps to smaller formats for upload

Where You’ll Apply It

Project 6: Understanding stripped binary structure
Project 7: Parsing minidumps (similar structure)
All projects: Knowing where GDB gets its information

References

Anatomy of an ELF core file
LIEF ELF Coredump Tutorial
“Practical Binary Analysis” by Dennis Andriesse - Ch. 2-3
“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Ch. 7

Key Insights

A core dump is just another ELF file—but instead of code and data for execution, it contains memory snapshots (PT_LOAD) and metadata (PT_NOTE) for post-mortem analysis.

Summary

Core dumps use ELF format with PT_NOTE segments for metadata (registers, signal, file mappings) and PT_LOAD segments for memory contents. The NT_PRSTATUS note contains registers and signal info. The NT_FILE note lists memory-mapped files. Use readelf -h, readelf -l, and readelf -n to inspect structure.

Homework/Exercises

Exercise 1: Use readelf -l on a core dump and count the PT_LOAD segments. Correlate them with the output of /proc/[pid]/maps from a running process of the same program.
Exercise 2: Use readelf -n to extract the signal number from NT_PRSTATUS. Verify it matches the signal you expected.
Exercise 3: Write a Python script using the struct module to parse the ELF header and count program headers in a core dump.
Exercise 4: Compare the ELF structure of a core dump vs. the original executable using readelf -h on both.

Solutions to Homework/Exercises

Exercise 1 Solution:

# First, run a program and check its maps
$ ./myprogram &
$ cat /proc/$!/maps
# Note the memory regions

# Then generate a core dump (Ctrl+\ or kill -QUIT)
$ readelf -l core | grep LOAD
# Each PT_LOAD should correspond to a mapped region

Exercise 2 Solution:

$ readelf -n core | grep -A5 "NT_PRSTATUS"
# The signal is stored in the prstatus structure
# Look for "si_signo" or use a hex dump to find signal at known offset

# For SIGSEGV (11):
$ xxd core | grep -A2 "0x350"  # Approximate offset of prstatus

Exercise 3 Solution (outline):

import struct

with open('core', 'rb') as f:
    # ELF header is 64 bytes for ELF64
    header = f.read(64)

    # Parse key fields
    magic = header[0:4]  # Should be b'\x7fELF'
    phoff = struct.unpack('<Q', header[32:40])[0]  # Program header offset
    phnum = struct.unpack('<H', header[56:58])[0]  # Number of program headers

    print(f"Program headers: {phnum} at offset {phoff}")

Exercise 4 Solution:

$ readelf -h ./myprogram | grep Type
  Type:                              EXEC (Executable file)

$ readelf -h core | grep Type
  Type:                              CORE (Core file)

# Key differences: Type field, entry point (0 for core), section headers

Chapter 3: GDB for Post-Mortem Debugging

Fundamentals

GDB (GNU Debugger) is the primary tool for analyzing core dumps on Linux. While GDB is commonly used for live debugging (setting breakpoints, stepping through code), post-mortem debugging—loading a core dump after a crash—is fundamentally different: you’re examining a frozen moment in time, not a running process.

In post-mortem mode, you cannot step forward, set breakpoints, or continue execution. Instead, you can inspect the state at crash time: the call stack (backtrace), variable values, memory contents, and register values. The key insight is that a core dump + the original executable + debug symbols together reconstruct the complete picture of what went wrong.

The basic workflow is: gdb <executable> <core-file>. GDB loads the executable to get symbol information and the core file to get the crash state.

Deep Dive

Loading a Core Dump

The fundamental GDB command for core dump analysis is straightforward:

$ gdb ./program core
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Core was generated by `./program'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000555555555149 in main () at crash.c:5
5           *ptr = 42;
(gdb)

GDB immediately shows you:

Which program generated the core
Which signal terminated it
The function and line where it stopped (if symbols are available)

The Backtrace: Your Map of the Crash

The backtrace (or bt) command shows the call stack at crash time:

(gdb) bt
#0  0x0000555555555149 in vulnerable_function (input=0x7fffffffe010 "AAAA") at crash.c:5
#1  0x0000555555555178 in process_data (data=0x7fffffffe010 "AAAA") at crash.c:12
#2  0x00005555555551a2 in main (argc=2, argv=0x7fffffffe108) at crash.c:18

Each frame represents a function call. Frame #0 is where the crash occurred. Higher numbers are callers going back to main (and beyond to _start if you look far enough).

Navigating Stack Frames

You can switch between frames to examine different contexts:

(gdb) frame 1                    # Switch to process_data()
#1  0x0000555555555178 in process_data (data=0x7fffffffe010 "AAAA") at crash.c:12
12          vulnerable_function(data);

(gdb) info args                  # Show function arguments
data = 0x7fffffffe010 "AAAA"

(gdb) info locals                # Show local variables
local_buffer = "..."

(gdb) up                         # Move up one frame (to caller)
(gdb) down                       # Move down one frame (to callee)

Inspecting Variables and Memory

The print command examines variable values:

(gdb) print ptr
$1 = (int *) 0x0                 # NULL pointer - found the bug!

(gdb) print *ptr                 # Try to dereference
Cannot access memory at address 0x0

(gdb) print my_struct
$2 = {name = "test", value = 42, next = 0x555555558040}

(gdb) print my_struct.name
$3 = "test"

The x (examine) command inspects raw memory:

(gdb) x/16xw $rsp               # 16 words in hex, starting at stack pointer
0x7fffffffe000: 0x41414141 0x41414141 0x41414141 0x41414141

(gdb) x/s 0x555555556000        # Examine as string
0x555555556000: "Hello, World!"

(gdb) x/10i $rip                # Examine 10 instructions at instruction pointer
   0x555555555149 <main+20>:    mov    DWORD PTR [rax],0x2a
   0x55555555514f <main+26>:    mov    eax,0x0

Format specifiers for x:

x - hexadecimal
d - decimal
s - string
i - instruction (disassembly)
c - character
b/h/w/g - byte/halfword(2)/word(4)/giant(8) sizes

Register Inspection

Registers often reveal the immediate cause of a crash:

(gdb) info registers
rax            0x0                 0        # Often holds bad address
rbx            0x0                 0
rcx            0x7ffff7f9a6a0      140737353705120
rdx            0x7ffff7f9c4e0      140737353712864
rsi            0x0                 0
rdi            0x7fffffffe010      140737488347152
rbp            0x7fffffffe030      0x7fffffffe030
rsp            0x7fffffffe000      0x7fffffffe000    # Stack pointer
rip            0x555555555149      0x555555555149 <main+20>  # Instruction pointer

(gdb) print $rip                 # Access registers with $ prefix
$1 = (void (*)()) 0x555555555149 <main+20>

The Critical Importance of Debug Symbols

Without debug symbols (compiled with -g), you lose:

Function names (replaced by addresses)
Variable names and types
Source file and line information

# WITH symbols:
#0  0x0000555555555149 in main () at crash.c:5
(gdb) print ptr
$1 = (int *) 0x0

# WITHOUT symbols:
#0  0x0000555555555149 in ?? ()
(gdb) print ptr
No symbol "ptr" in current context.

Essential GDB Commands for Core Analysis

Command	Shortcut	Description
`backtrace`	`bt`	Show call stack
`backtrace full`	`bt full`	Show stack with local variables
`frame N`	`f N`	Switch to frame N
`up` / `down`		Move up/down the stack
`info registers`	`i r`	Show CPU registers
`info args`		Show function arguments
`info locals`		Show local variables
`print EXPR`	`p EXPR`	Evaluate and print expression
`x/FMT ADDR`		Examine memory
`list`	`l`	Show source code at current location
`disassemble`	`disas`	Show assembly code
`info threads`	`i threads`	List all threads
`thread N`	`t N`	Switch to thread N

Using GDB’s Python API (Preview)

GDB can be scripted with Python for automation:

# Save as analyze.py, run with: gdb -x analyze.py ./program core
import gdb

gdb.execute("set pagination off")
print("=== Crash Analysis ===")
print(gdb.execute("bt", to_string=True))
print("=== Registers ===")
rip = gdb.parse_and_eval("$rip")
print(f"RIP: {rip}")

This is covered in depth in Project 4.

How This Fits in Projects

Project 2: Master the basic GDB workflow with backtraces
Project 3: Use memory inspection to diagnose corruption
Project 4: Automate GDB with Python scripting
Project 5: Apply GDB to multi-threaded crashes
Project 6: Use GDB with stripped binaries

Mental Model Diagram

                        GDB CORE DUMP WORKFLOW
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│  INPUT                          GDB                          OUTPUT         │
│  ┌─────────────────┐       ┌──────────────┐           ┌──────────────────┐ │
│  │  Executable     │       │              │           │  Crash Location  │ │
│  │  (with -g)      │──────►│   Symbol     │──────────►│  Function names  │ │
│  │                 │       │   Matching   │           │  Line numbers    │ │
│  └─────────────────┘       │              │           │  Variable types  │ │
│                            │              │           └──────────────────┘ │
│  ┌─────────────────┐       │              │                                │
│  │  Core Dump      │──────►│   State      │           ┌──────────────────┐ │
│  │  (memory +      │       │   Extraction │──────────►│  Backtrace       │ │
│  │   registers)    │       │              │           │  Variable values │ │
│  └─────────────────┘       │              │           │  Memory contents │ │
│                            │              │           │  Register state  │ │
│                            └──────────────┘           └──────────────────┘ │
│                                                                             │
│  KEY COMMANDS:                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  bt           → Call stack (where did it crash?)                    │   │
│  │  frame N      → Switch context (examine caller)                     │   │
│  │  info args    → Function parameters (what was passed?)              │   │
│  │  info locals  → Local variables (what state?)                       │   │
│  │  print VAR    → Variable value (specific data)                      │   │
│  │  x/FMT ADDR   → Raw memory (bit-level view)                         │   │
│  │  info reg     → CPU registers (hardware state)                      │   │
│  │  list         → Source code (if available)                          │   │
│  │  disas        → Assembly (always available)                         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

# Compile with debug symbols
$ gcc -g -o crash crash.c

# Generate core dump
$ ulimit -c unlimited
$ ./crash
Segmentation fault (core dumped)

# Analyze with GDB
$ gdb ./crash core
(gdb) bt
#0  0x0000555555555149 in main () at crash.c:5
(gdb) list
1       #include <stdio.h>
2
3       int main(void) {
4           int *ptr = NULL;
5           *ptr = 42;      # <-- Crash here
6           return 0;
7       }
(gdb) print ptr
$1 = (int *) 0x0
(gdb) info registers rax rip
rax            0x0                 0
rip            0x555555555149      0x555555555149 <main+20>

Common Misconceptions

“I can step through code in a core dump” - No, the program isn’t running. You can only examine the frozen state. Commands like next, step, continue don’t work.
“I need the exact same binary” - Not exact, but the symbols must match. If you have a debug version of the same source, it will work. Different builds may have different addresses.
“GDB shows me the bug” - GDB shows you the symptom (where it crashed). Finding the cause (often earlier in the program) requires detective work.
“Missing symbols means I can’t debug” - You can still see addresses and disassembly. It’s harder, but not impossible (covered in Project 6).

Check-Your-Understanding Questions

What are the two files you need to load a core dump in GDB?
What does frame #0 represent in a backtrace?
What command shows local variables in the current stack frame?
How do you examine 8 bytes of memory at address 0x7fff0000 as hexadecimal?
Why might print ptr fail with “No symbol” even though the core dump is valid?

Check-Your-Understanding Answers

The executable (preferably compiled with -g for symbols) and the core dump file. Command: gdb ./program core
Frame #0 is the innermost frame—the function that was executing when the crash occurred. It’s where the instruction pointer (RIP) was pointing.
info locals shows local variables. info args shows function arguments.
x/8xb 0x7fff0000 (8 bytes in hex) or x/2xg 0x7fff0000 (2 giant/8-byte words in hex)
The executable was compiled without debug symbols (-g flag). The core dump is valid and contains the value, but GDB can’t map addresses to variable names without symbols.

Real-World Applications

Production crash triage - Quickly identify crash location in server applications
Bug reports - Extract relevant information to share with developers
Regression testing - Analyze crashes from automated test runs
Security research - Examine crash state for vulnerability analysis

Where You’ll Apply It

Project 2: Basic backtrace extraction
Project 3: Memory inspection for corruption analysis
Project 4: Scripting GDB for automation
Project 5: Multi-threaded analysis with info threads
Project 6: Working without symbols

References

GDB Manual - Core File Generation
“The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman
Stanford CS107 GDB Guide
Brendan Gregg’s GDB Tutorial

Key Insights

GDB + core dump = a time machine to the moment of death. You can’t change the past, but you can thoroughly examine what went wrong.

Summary

GDB loads core dumps with gdb <executable> <core>. Essential commands: bt (backtrace), frame N (switch context), print (variables), x/FMT ADDR (raw memory), info registers (CPU state). Debug symbols (-g) are crucial for meaningful output. Post-mortem debugging examines frozen state—you cannot step or continue.

Homework/Exercises

Exercise 1: Create a program with three nested function calls (main → func1 → func2) where func2 crashes. Use GDB to examine each frame’s local variables.
Exercise 2: Practice memory examination: write a program that stores the string “DEADBEEF” in a buffer, then examine it with x/8xb, x/2xw, and x/s.
Exercise 3: Compare the output of bt with and without debug symbols on the same crash. Document the differences.
Exercise 4: Write down 10 GDB commands and their purposes from memory. Then verify against the documentation.

Solutions to Homework/Exercises

Exercise 1 Solution:

// nested.c
void func2(int *p) { *p = 42; }
void func1(int *p) { func2(p); }
int main() { int *p = 0; func1(p); return 0; }

(gdb) bt
#0  func2 (p=0x0) at nested.c:1
#1  func1 (p=0x0) at nested.c:2
#2  main () at nested.c:3
(gdb) frame 1
(gdb) info args
p = 0x0

Exercise 2 Solution:

int main() { char buf[16] = "DEADBEEF"; volatile int x = 1/0; return 0; }

(gdb) x/8xb buf
0x...: 0x44 0x45 0x41 0x44 0x42 0x45 0x45 0x46  # D E A D B E E F
(gdb) x/2xw buf
0x...: 0x44414544 0x46454542  # Note: little-endian
(gdb) x/s buf
0x...: "DEADBEEF"

Exercise 3 Solution: With -g: Shows function names, file:line, variable names Without -g: Shows ?? (), no file/line, “No symbol in context” for prints

Exercise 4 Solution: bt, frame, up, down, print, x, info registers, info locals, info args, list, disassemble, quit

Chapter 4: Multi-Threaded Crash Analysis

Fundamentals

Modern applications are often multi-threaded, which adds complexity to crash analysis. When a multi-threaded program crashes, the core dump captures the state of all threads, not just the one that triggered the crash. Understanding how to navigate between threads and correlate their states is essential for diagnosing concurrency bugs.

The challenge with multi-threaded crashes is that the crashing thread often reveals the symptom (e.g., a NULL pointer dereference), but the cause may be another thread that corrupted shared data or failed to properly synchronize. You must examine all threads to understand the full picture.

GDB provides commands like info threads, thread N, and thread apply all to navigate and inspect multiple threads in a core dump.

Deep Dive

How Threads Appear in Core Dumps

Each thread in a process has its own:

Stack (with its own local variables and call chain)
Registers (including its own instruction pointer)
Thread ID (TID/LWP - Light Weight Process)

In the ELF core dump, each thread gets its own NT_PRSTATUS note containing that thread’s registers. GDB parses these to reconstruct the state of each thread.

Listing All Threads

(gdb) info threads
  Id   Target Id                    Frame
* 1    Thread 0x7ffff7fb4740 (LWP 12345) main () at main.c:30
  2    Thread 0x7ffff7fb3700 (LWP 12346) worker_func () at worker.c:15
  3    Thread 0x7ffff7fb2700 (LWP 12347) writer_func () at writer.c:22

The * indicates the “current” thread—the one GDB is focused on. In a crash, this is typically the thread that received the fatal signal.

Switching Between Threads

(gdb) thread 2                    # Switch to thread 2
[Switching to thread 2 (Thread 0x7ffff7fb3700 (LWP 12346))]
#0  worker_func () at worker.c:15

(gdb) bt                          # Now shows thread 2's stack
#0  worker_func () at worker.c:15
#1  thread_entry () at main.c:20
...

Getting All Backtraces at Once

The most powerful command for multi-threaded crash analysis:

(gdb) thread apply all bt

Thread 3 (Thread 0x7ffff7fb2700 (LWP 12347)):
#0  writer_func () at writer.c:22
#1  thread_entry () at main.c:25
...

Thread 2 (Thread 0x7ffff7fb3700 (LWP 12346)):
#0  worker_func () at worker.c:15
#1  thread_entry () at main.c:20
...

Thread 1 (Thread 0x7ffff7fb4740 (LWP 12345)):
#0  0x0000555555555149 in main () at main.c:30

Common Multi-Threaded Bug Patterns

Data Race - Two threads access shared data without synchronization, one writes, one reads. The reader may see corrupted or inconsistent data.
Use-After-Free Race - Thread A frees memory, Thread B still has a pointer and uses it.
Double-Free Race - Two threads each try to free the same memory.
Deadlock-Induced Timeout - While not a crash per se, if a program is killed due to a deadlock, the core dump shows threads waiting on locks.

Detecting a Data Race from a Core Dump

Consider this scenario:

// Shared global
char *g_data = NULL;

// Thread 1: Writer
void writer_thread() {
    g_data = malloc(100);
    strcpy(g_data, "Hello");
}

// Thread 2: Reader (crashes!)
void reader_thread() {
    printf("%s\n", g_data);  // CRASH if g_data is still NULL
}

In the core dump:

(gdb) thread apply all bt
Thread 2:
#0  reader_thread () at race.c:12
...

Thread 1:
#0  writer_thread () at race.c:6
...

(gdb) thread 2
(gdb) print g_data
$1 = (char *) 0x0            # Still NULL when reader tried to use it!

(gdb) thread 1
(gdb) info locals
# Maybe see that malloc was about to be called

The race condition is revealed: Thread 2 accessed g_data before Thread 1 finished initializing it.

Examining Locks and Synchronization State

For programs using pthreads, you can examine mutex states:

(gdb) print my_mutex
$1 = {__data = {__lock = 1, __count = 0, __owner = 12346, ...}}
#                                         ^^^^^^^^^^^
#                   This thread (LWP 12346) holds the lock

If a thread is waiting on a lock, its backtrace often shows pthread_mutex_lock or similar.

How This Fits in Projects

Project 5: Create and analyze a multi-threaded race condition crash
Project 10: Handle multi-threaded crashes in your crash reporter

Mental Model Diagram

                    MULTI-THREADED CRASH ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│                            SINGLE PROCESS                                   │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │                                                                       │ │
│  │   Thread 1 (Main)        Thread 2 (Worker)     Thread 3 (Writer)     │ │
│  │   ┌─────────────┐        ┌─────────────┐       ┌─────────────┐       │ │
│  │   │ Stack       │        │ Stack       │       │ Stack       │       │ │
│  │   │ Registers   │        │ Registers   │       │ Registers   │       │ │
│  │   │ TID: 12345  │        │ TID: 12346  │       │ TID: 12347  │       │ │
│  │   └──────┬──────┘        └──────┬──────┘       └──────┬──────┘       │ │
│  │          │                      │                     │               │ │
│  │          └──────────────────────┼─────────────────────┘               │ │
│  │                                 │                                     │ │
│  │                     ┌───────────▼───────────┐                         │ │
│  │                     │   SHARED MEMORY       │                         │ │
│  │                     │   - Global variables  │                         │ │
│  │                     │   - Heap              │                         │ │
│  │                     │   - Mutexes           │                         │ │
│  │                     └───────────────────────┘                         │ │
│  │                                                                       │ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   CRASH IN THREAD 1:                                                        │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │  Core dump captures ALL threads' states                             │  │
│   │  - Each thread has its own NT_PRSTATUS in the core                  │  │
│   │  - The symptom is in Thread 1, but the cause may be Thread 2 or 3   │  │
│   │  - Use `thread apply all bt` to see all backtraces                  │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│   GDB COMMANDS:                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │  info threads          → List all threads                           │  │
│   │  thread N              → Switch to thread N                         │  │
│   │  thread apply all bt   → Backtrace ALL threads                      │  │
│   │  thread apply all info locals → All threads' local vars            │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

// race_crash.c
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

char *g_data = NULL;

void *writer_thread(void *arg) {
    sleep(1);  // Simulate slow initialization
    g_data = malloc(100);
    return NULL;
}

void *reader_thread(void *arg) {
    // Race: may execute before writer finishes
    printf("Data: %s\n", g_data);  // CRASH if g_data is NULL
    return NULL;
}

int main() {
    pthread_t writer, reader;
    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);
    pthread_join(writer, NULL);
    pthread_join(reader, NULL);
    return 0;
}

$ gdb ./race_crash core
(gdb) info threads
  Id   Target Id                    Frame
* 2    Thread ... (LWP ...) reader_thread () at race_crash.c:15
  3    Thread ... (LWP ...) writer_thread () at race_crash.c:10
  1    Thread ... (LWP ...) pthread_join () ...

(gdb) thread 2
(gdb) print g_data
$1 = (char *) 0x0    # NULL - writer hasn't finished!

Common Misconceptions

“The crashing thread is always the buggy one” - Often the crash is a symptom. Another thread corrupted data that caused this thread to crash.
“Core dumps only capture the crashing thread” - No, all threads are captured. Each has its own registers and stack in the dump.
“I can see lock contention history” - No, you only see the current state. You can’t see what happened before the crash.
“Thread IDs are stable across runs” - No, TIDs (LWP numbers) are assigned by the kernel and vary between runs.

Check-Your-Understanding Questions

How do you list all threads in a GDB core dump session?
What does the * mean in info threads output?
How do you get a backtrace for every thread at once?
Where in the core dump are individual thread states stored?
If Thread 1 crashes due to a NULL pointer, how might Thread 2 be responsible?

Check-Your-Understanding Answers

Use info threads to list all threads with their IDs, target IDs, and current frames.
The * indicates the currently selected/focused thread—usually the one that received the fatal signal.
Use thread apply all bt to get backtraces for all threads in one command.
Each thread has its own NT_PRSTATUS note in the PT_NOTE segment of the core dump, containing that thread’s registers.
Thread 2 might have been responsible for initializing the pointer but hadn’t done so yet (race condition), or Thread 2 might have freed or corrupted the memory Thread 1 was using.

Key Insights

In multi-threaded crashes, the crashing thread shows the symptom, but the cause may be in any thread. Always examine all threads.

Summary

Multi-threaded core dumps capture all threads’ states. Use info threads to list them, thread N to switch, and thread apply all bt for all backtraces. Data races, use-after-free, and synchronization issues require examining shared state across threads. The crashing thread often isn’t the root cause.

Chapter 5: Kernel Crash Analysis (kdump and crash)

Fundamentals

When the Linux kernel itself crashes (a “kernel panic”), the normal core dump mechanism can’t work—the kernel is the core dump generator. Instead, Linux uses kdump, which boots a secondary “capture kernel” to save the memory of the panicked kernel. The resulting file is called a vmcore.

Analyzing a vmcore requires the crash utility, which is essentially “GDB for the kernel.” It understands kernel data structures and can navigate process lists, examine kernel stacks, and inspect driver state—all from a frozen snapshot of the entire system.

This is advanced material, but understanding the basics gives you powerful debugging capabilities for system-level issues.

Deep Dive

How kdump Works

When the kernel panics, it can’t simply write a file—the file system might be corrupted, and the kernel’s own code might be broken. Instead, kdump uses kexec to immediately boot into a small, pre-loaded “capture kernel” that runs from reserved memory:

Normal Operation:
┌─────────────────────────────────────────────────────────────────┐
│                      RUNNING KERNEL                              │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Normal kernel uses most of memory                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌──────────────┐                                               │
│  │  Reserved    │ ← Capture kernel loaded here (crashkernel=)  │
│  │  Memory      │                                               │
│  └──────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

After Panic:
┌─────────────────────────────────────────────────────────────────┐
│                    CAPTURE KERNEL RUNNING                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Panicked kernel's memory preserved, accessible via     │    │
│  │  /proc/vmcore in capture kernel                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌──────────────┐                                               │
│  │  Capture     │ ← This small kernel is now running           │
│  │  Kernel      │ ← Saves /proc/vmcore to disk                  │
│  └──────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Configuring kdump

Reserve memory - Add crashkernel=256M (or more) to kernel command line in GRUB
Install packages - kexec-tools and crash
Enable the service - systemctl enable kdump
Get debug symbols - Install kernel-debuginfo or equivalent

# Check kdump status
$ systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Active: active (exited)

# Check crashkernel reservation
$ cat /proc/cmdline | grep crashkernel
crashkernel=256M

# List crash dumps (after a panic)
$ ls /var/crash/
127.0.0.1-2024-12-20-15:30:00/

The crash Utility

The crash utility is an interactive tool for analyzing vmcore files:

# Basic invocation
$ crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/.../vmcore

crash> help              # List available commands
crash> bt                # Backtrace of panicking task
crash> log               # Kernel log buffer (dmesg)
crash> ps                # Process list at crash time
crash> files <pid>       # Open files for a process
crash> vm <pid>          # Virtual memory info
crash> struct <name>     # Examine kernel structure
crash> quit              # Exit

Essential crash Commands

Command	Description
`bt`	Backtrace of current (panicking) task
`bt -a`	Backtrace of all CPUs
`log`	Kernel log buffer (like dmesg)
`ps`	List all processes
`files <pid>`	Open files for a process
`vm <pid>`	Virtual memory for a process
`struct <name> <addr>`	Display kernel structure
`mod`	List loaded modules
`kmem -i`	Memory usage summary
`foreach bt`	Backtrace every process

Analyzing a Kernel Panic

When you trigger a panic (e.g., via a buggy kernel module), the output in crash looks like:

crash> bt
PID: 1234    TASK: ffff88810a4d8000  CPU: 1   COMMAND: "insmod"
 #0 [ffffc90000a77e30] machine_kexec at ffffffff8100259b
 #1 [ffffc90000a77e80] __crash_kexec at ffffffff8110d9ab
 #2 [ffffc90000a77f00] panic at ffffffff8106a3e8
 #3 [ffffc90000a77f30] oops_end at ffffffff81c01b9a
 #4 [ffffc90000a77f80] no_context at ffffffff8104d2ab
 #5 [ffffc90000a77ff0] do_page_fault at ffffffff81c0605e
 #6 [ffffc90000a77ff8] page_fault at ffffffff82000b9e
 #7 [ffffc90000a78050] buggy_init at ffffffffc0670010 [buggy_module]
                                   ^^^^^^^^^^^^^^^ YOUR BUGGY CODE

crash> log | tail -20
[  123.456] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  123.457] #PF: supervisor write access in kernel mode
...
[  123.465] Kernel panic - not syncing: Fatal exception

How This Fits in Projects

Project 8: Configure kdump and trigger a kernel panic with a buggy module
Project 9: Use crash to analyze the resulting vmcore

Mental Model Diagram

                    KERNEL CRASH ANALYSIS PIPELINE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│  1. PANIC OCCURS           2. KEXEC TRIGGERS        3. CAPTURE & SAVE      │
│  ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐     │
│  │  BUG: NULL ptr  │      │  kexec boots    │      │  /proc/vmcore   │     │
│  │  dereference    │ ───► │  capture kernel │ ───► │  saved to       │     │
│  │  in kernel code │      │  from reserved  │      │  /var/crash/    │     │
│  │                 │      │  memory         │      │                 │     │
│  └─────────────────┘      └─────────────────┘      └─────────────────┘     │
│                                                            │                │
│                                                            ▼                │
│  4. ANALYSIS              5. INVESTIGATION                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  $ crash vmlinux vmcore                                              │   │
│  │                                                                      │   │
│  │  crash> bt          # See panic stack trace                          │   │
│  │  crash> log         # See kernel messages                            │   │
│  │  crash> ps          # See running processes                          │   │
│  │  crash> mod         # See loaded modules                             │   │
│  │                                                                      │   │
│  │  Result: Identify buggy code path in kernel/module                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  REQUIREMENTS:                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  - crashkernel= parameter in boot command line                       │   │
│  │  - kdump service enabled and running                                 │   │
│  │  - crash utility installed                                           │   │
│  │  - Kernel debug symbols (vmlinux with debuginfo)                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

// buggy_module.c - A kernel module that causes a panic
#include <linux/module.h>
#include <linux/kernel.h>

static int __init buggy_init(void) {
    int *ptr = NULL;
    printk(KERN_INFO "About to crash...\n");
    *ptr = 42;  // PANIC: NULL pointer dereference in kernel mode
    return 0;
}

static void __exit buggy_exit(void) {
    printk(KERN_INFO "Goodbye\n");
}

module_init(buggy_init);
module_exit(buggy_exit);
MODULE_LICENSE("GPL");

# Build the module
$ make -C /lib/modules/$(uname -r)/build M=$(pwd) modules

# Load it (IN A VM!)
$ sudo insmod buggy_module.ko
# System panics, kdump captures vmcore

# After reboot, analyze
$ sudo crash /usr/lib/debug/.../vmlinux /var/crash/.../vmcore
crash> bt
crash> log

Key Insights

Kernel crashes require a completely different capture mechanism (kdump) because the kernel can’t debug itself. The crash utility is GDB for the entire operating system.

Summary

Kernel panics are captured by kdump, which boots a capture kernel to save memory as a vmcore. The crash utility analyzes vmcore files using commands like bt, log, and ps. This requires crashkernel reservation, kdump service, and kernel debug symbols.

Chapter 6: Automation and Scripting

Fundamentals

Manual crash analysis doesn’t scale. Production systems may generate dozens of crashes daily, and each needs initial triage to determine severity and potential root cause. GDB’s Python API and batch mode enable automation, letting you build scripts that extract key information without human intervention.

Automation serves two purposes: efficiency (processing many dumps quickly) and consistency (every dump gets the same analysis, reducing human error). This chapter covers the techniques used in Project 4 (automated triage) and Project 10 (centralized crash reporter).

Deep Dive

GDB Batch Mode

The simplest automation is GDB’s batch mode, which runs commands from a file:

# commands.gdb
set pagination off
bt
info registers
quit

$ gdb -q --batch -x commands.gdb ./program core
#0  main () at crash.c:5
...

The -q (quiet) flag suppresses the welcome message. --batch exits after running commands.

GDB Python API

For more sophisticated automation, GDB embeds a Python interpreter. Your Python scripts run inside GDB and can access its internals:

# analyze.py - Run with: gdb -q --batch -x analyze.py ./program core
import gdb

# Disable paging
gdb.execute("set pagination off")

# Get backtrace as string
bt = gdb.execute("bt", to_string=True)
print("=== BACKTRACE ===")
print(bt)

# Access registers programmatically
rip = gdb.parse_and_eval("$rip")
rsp = gdb.parse_and_eval("$rsp")
print(f"RIP: {rip}")
print(f"RSP: {rsp}")

# Get the signal that killed the process
try:
    # This works if the core was from a signal
    info = gdb.execute("info signals", to_string=True)
    print("=== SIGNALS ===")
    print(info[:500])  # First 500 chars
except:
    pass

# Examine a specific address
try:
    mem = gdb.execute("x/16xb $rsp", to_string=True)
    print("=== STACK TOP ===")
    print(mem)
except:
    pass

Key GDB Python Functions

Function	Description
`gdb.execute(cmd)`	Run a GDB command
`gdb.execute(cmd, to_string=True)`	Run command, capture output as string
`gdb.parse_and_eval(expr)`	Evaluate expression, return GDB Value
`gdb.selected_frame()`	Get current stack frame
`gdb.selected_thread()`	Get current thread
`gdb.inferiors()`	List of debugged programs

Crash Fingerprinting

To deduplicate crashes (identify unique bugs vs. repeat occurrences), you need a stable “fingerprint.” A common approach:

Extract the top 3-5 frames of the backtrace
Normalize addresses to function names (or offset within function)
Hash the result

import hashlib

def get_crash_fingerprint(bt_output):
    """Generate a fingerprint from a backtrace."""
    lines = bt_output.strip().split('\n')
    # Take first 5 frames
    frames = []
    for line in lines[:5]:
        # Extract function name (simplified)
        if ' in ' in line:
            func = line.split(' in ')[1].split(' ')[0]
            frames.append(func)

    fingerprint = '|'.join(frames)
    return hashlib.md5(fingerprint.encode()).hexdigest()[:16]

Building an Analysis Pipeline

A complete automation pipeline might look like:

┌──────────────────────────────────────────────────────────────────┐
│                    CRASH ANALYSIS PIPELINE                        │
├──────────────────────────────────────────────────────────────────┤
│  1. Core dump arrives (via core_pattern pipe or file watch)      │
│  2. Python script invokes GDB with analysis script               │
│  3. GDB script extracts:                                         │
│     - Backtrace (all threads)                                    │
│     - Registers                                                  │
│     - Signal info                                                │
│     - Key variables (if symbols available)                       │
│  4. Generate fingerprint from backtrace                          │
│  5. Store results:                                                │
│     - Raw core dump (compressed)                                 │
│     - Analysis report (JSON/text)                                │
│     - Fingerprint for deduplication                              │
│  6. Alert if new unique crash or high-severity                   │
└──────────────────────────────────────────────────────────────────┘

How This Fits in Projects

Project 4: Build an automated crash triage tool
Project 10: Create a complete crash reporting system

Key Insights

Automation transforms crash analysis from a manual, ad-hoc process into a systematic pipeline that can handle production scale.

Summary

GDB batch mode runs commands from files. GDB Python API provides programmatic access via gdb.execute() and gdb.parse_and_eval(). Crash fingerprinting uses backtrace frames to identify unique bugs. Automation enables production-scale crash analysis.

Glossary

Term	Definition
Core Dump	A file containing a snapshot of a process’s memory and CPU state at termination
Backtrace	The call stack showing the sequence of function calls leading to the current point
SIGSEGV	Signal 11, Segmentation Fault—raised when a process accesses invalid memory
ELF	Executable and Linkable Format—the binary format for Linux executables and core dumps
PT_NOTE	ELF program header type containing metadata (registers, signal, file mappings)
PT_LOAD	ELF program header type containing actual memory contents
NT_PRSTATUS	Note type containing process status and registers at crash time
Debug Symbols	Metadata linking binary addresses to source code (file/line/variable names)
ulimit	Shell command to set resource limits, including core dump size (`ulimit -c`)
core_pattern	Kernel parameter (`/proc/sys/kernel/core_pattern`) controlling dump location
systemd-coredump	Modern Linux daemon that captures and manages core dumps
coredumpctl	Command-line tool to list and debug systemd-managed core dumps
GDB	GNU Debugger—the primary tool for analyzing core dumps on Linux
Post-Mortem Debugging	Analyzing a crashed program from its core dump after the fact
Stack Frame	A section of the stack containing a function’s local variables and return address
kdump	Kernel crash dump mechanism using kexec to boot a capture kernel
vmcore	The memory dump file created by kdump after a kernel panic
crash	The utility for analyzing kernel crash dumps (vmcore files)
kexec	Mechanism to boot into a new kernel without going through firmware
Minidump	A compact crash dump format used by Breakpad/Crashpad
Symbolication	The process of resolving memory addresses to function/line names
Fingerprint	A hash identifying a unique crash type for deduplication
Data Race	A bug where two threads access shared data without synchronization
LWP	Light Weight Process—another term for a thread’s kernel identifier

Why Linux Crash Dump Analysis Matters

Modern Relevance

Crash analysis is a fundamental skill for anyone working with systems software. Consider:

Cloud infrastructure runs millions of processes. When one crashes, operators need fast root cause analysis
IoT and embedded systems often can’t be debugged live; crash dumps are the only evidence
Security researchers analyze crashes to find vulnerabilities
SRE/DevOps teams need to correlate crashes with deployments and load patterns

According to Red Hat’s documentation, the crash utility and kernel debug symbols are essential tools for diagnosing kernel issues in enterprise Linux environments.

Real-World Statistics

Large-scale services may see thousands of process crashes daily (Facebook’s analysis infrastructure processes millions of crash reports)
Kernel panics, while rarer, can cause significant outages (each minute of downtime can cost enterprises $5,600 on average per Gartner estimates)
The average time to diagnose a crash without proper tooling can be hours; with proper crash analysis, minutes

Context and Evolution

Core dumps have existed since the earliest Unix systems (1970s). The name “core” comes from magnetic core memory. While the underlying technology has evolved (from simple memory snapshots to ELF-formatted files with metadata), the concept remains: preserve the state for later analysis.

Modern developments include:

systemd-coredump (2012+) for centralized management
Breakpad/Crashpad (Google) for cross-platform minidumps
Sentry, Bugsnag, Crashlytics for cloud-based crash aggregation
eBPF for live system analysis (complementing post-mortem)

Concept Summary Table

Concept Cluster	What You Need to Internalize
Core Dump Fundamentals	Core dumps are ELF files capturing process memory + CPU state. Generation requires `ulimit -c` and `core_pattern`. Modern systems use systemd-coredump.
ELF Core Format	PT_NOTE segments hold metadata (registers in NT_PRSTATUS, files in NT_FILE). PT_LOAD segments hold memory. No section headers.
GDB Post-Mortem	Load with `gdb <exe> <core>`. Key commands: `bt`, `frame`, `print`, `x`, `info registers`. Debug symbols (`-g`) are essential.
Multi-Threaded Analysis	All threads captured in core. Use `info threads`, `thread N`, `thread apply all bt`. Crashing thread shows symptom, cause may be elsewhere.
Kernel Crash Analysis	kdump uses kexec to boot capture kernel. vmcore analyzed with crash utility. Requires crashkernel reservation and debug symbols.
Automation	GDB batch mode and Python API enable scripted analysis. Fingerprinting identifies unique crashes. Scales to production.

Project-to-Concept Map

Project	Concepts Applied
Project 1: The First Crash	Core Dump Fundamentals
Project 2: The GDB Backtrace	GDB Post-Mortem, Debug Symbols
Project 3: The Memory Inspector	GDB Post-Mortem, ELF Core Format
Project 4: Automated Crash Detective	Automation, GDB Python API
Project 5: Multi-threaded Mayhem	Multi-Threaded Analysis
Project 6: Stripped Binary Crash	ELF Core Format, GDB without symbols
Project 7: Minidump Parser	ELF Core Format (similar concepts)
Project 8: Kernel Panics	Kernel Crash Analysis, kdump
Project 9: Analyzing with crash	Kernel Crash Analysis, crash utility
Project 10: Centralized Reporter	Automation, Fingerprinting, All concepts

Deep Dive Reading by Concept

Concept	Book & Chapter	Why This Matters
Core Dumps	“The Linux Programming Interface” Ch. 22	Definitive coverage of signals and process termination
ELF Format	“Practical Binary Analysis” Ch. 2-3	Deep dive into ELF structure
GDB Basics	“The Art of Debugging with GDB” Ch. 1-4	Foundational GDB skills
Memory Layout	“Computer Systems: A Programmer’s Perspective” Ch. 7, 9	Understanding virtual memory
Multi-threading	“The Linux Programming Interface” Ch. 29-30	POSIX threads and synchronization
Kernel Internals	“Linux Kernel Development” by Robert Love	For kernel panic analysis
Kernel Modules	“Linux Device Drivers, 3rd Ed” Ch. 1-2	Writing and debugging modules

Quick Start: Your First 48 Hours

Day 1: Foundation (4-6 hours)

Read Theory Primer Chapters 1-3 (2 hours)
- Core Dump Fundamentals
- ELF Core Format (overview)
- GDB Post-Mortem
Complete Project 1 (2 hours)
- Configure your system for core dumps
- Verify dumps are generated
- Definition of Done: file core* shows ELF core file
Start Project 2 (1-2 hours)
- Load your first core dump in GDB
- Get a backtrace with line numbers

Day 2: Practical Skills (4-6 hours)

Finish Project 2 (1 hour)
- Compare with/without debug symbols
- Definition of Done: Can identify crash location by file:line
Start Project 3 (3-4 hours)
- Create a memory corruption scenario
- Practice print and x commands
- Inspect variables across stack frames
Review and Practice (1 hour)
- Re-do the GDB homework exercises
- Take notes on commands you find useful

Recommended Learning Paths

Path 1: The Developer (Focus on User-Space)

If you’re a software developer wanting to debug your own applications:

Project 1 → Project 2 → Project 3 (Weeks 1-2)
Project 4 (Automation) → Project 5 (Multi-threaded) (Weeks 3-4)
Project 6 (Stripped binaries) if you work with releases (Week 5)
Skip kernel projects unless needed

Path 2: The SRE/DevOps Engineer

If you’re managing infrastructure and need to triage crashes:

Project 1 → Project 2 (quick start) (Week 1)
Project 4 (Automation - your main tool) (Week 2)
Project 8 → Project 9 (Kernel crashes) (Weeks 3-4)
Project 10 (Build your own crash collection) (Weeks 5-6)

Path 3: The Security Researcher

If you’re analyzing crashes for vulnerabilities:

Project 1 → Project 2 → Project 3 (Weeks 1-2)
Project 6 (Stripped binaries - common in targets) (Week 3)
Project 7 (Minidump parsing) (Week 4)
Deep study of ELF format and memory corruption patterns

Success Metrics

You’ve mastered this material when you can:

Configure any Linux system for core dump capture in under 5 minutes
Get a backtrace from a core dump and identify the crash location
Navigate stack frames and inspect variables/memory in GDB
Write a Python script that automates basic crash triage
Analyze a multi-threaded crash and identify cross-thread issues
Work with stripped binaries using disassembly
Configure kdump and trigger a test kernel panic (in a VM)
Use the crash utility to analyze a vmcore
Explain the ELF core format and what each section contains
Design a crash reporting pipeline for a production system

Project Overview Table

#	Project	Difficulty	Time	Key Skill
1	The First Crash	Beginner	4-8h	System configuration
2	The GDB Backtrace	Beginner	4-8h	Basic GDB workflow
3	The Memory Inspector	Intermediate	10-15h	Memory examination
4	Automated Crash Detective	Intermediate	15-20h	GDB scripting
5	Multi-threaded Mayhem	Advanced	15-20h	Thread analysis
6	Stripped Binary Crash	Advanced	15-20h	Disassembly
7	Minidump Parser	Advanced	20-30h	Binary parsing
8	Kernel Panics	Expert	20-30h	Kernel modules
9	Analyzing with crash	Expert	15-20h	Kernel debugging
10	Centralized Reporter	Master	40+h	System design

Project List

The following 10 projects guide you from your first intentional crash to building production-grade crash analysis infrastructure.

Project 1: The First Crash — Understanding Core Dump Generation

File: P01-first-crash-core-dump-generation.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust (for comparison)
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Systems Configuration, Process Signals
Software or Tool: ulimit, systemd-coredump, coredumpctl
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A controlled environment that generates, captures, and verifies core dumps from intentional crashes, along with a configuration script that sets up any Linux system for crash capture.

Why it teaches crash dump analysis: Before you can analyze a crash, you must reliably capture one. This project forces you to understand how the kernel decides whether to dump core, where it writes the dump, and how modern Linux (systemd) has changed the traditional core.PID pattern. You’ll learn by intentionally breaking things and verifying the evidence is preserved.

Core challenges you will face:

Configuring ulimit correctly → Maps to Core Dump Fundamentals (soft vs hard limits, shell vs process)
Understanding core_pattern → Maps to systemd-coredump integration
Choosing storage location → Maps to real-world deployment considerations
Triggering different crash types → Maps to understanding signals (SIGSEGV, SIGABRT, SIGFPE)

Real World Outcome

You will have a shell script that configures any Linux system for core dump capture and a test program that crashes in multiple ways. Running the test will produce visible, verifiable core dumps.

Example Output:

$ ./setup-coredumps.sh
[+] Checking current ulimit -c: 0
[+] Setting ulimit -c unlimited for current shell
[+] Current core_pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
[+] systemd-coredump is active — using coredumpctl
[+] Configuration complete!

$ ./crash-test segfault
[*] About to trigger: SIGSEGV (Segmentation Fault)
[*] Dereferencing NULL pointer...
Segmentation fault (core dumped)

$ coredumpctl list
TIME                          PID  UID  GID SIG     COREFILE EXE                      SIZE
Sat 2025-01-04 10:23:45 UTC  1234 1000 1000 SIGSEGV present  /home/user/crash-test   245.2K

$ ./crash-test abort
[*] About to trigger: SIGABRT (Abort)
[*] Calling abort()...
Aborted (core dumped)

$ coredumpctl list | tail -2
Sat 2025-01-04 10:23:45 UTC  1234 1000 1000 SIGSEGV present  /home/user/crash-test   245.2K
Sat 2025-01-04 10:24:01 UTC  1235 1000 1000 SIGABRT present  /home/user/crash-test   246.1K

$ file $(coredumpctl -o /tmp/core.test info --no-pager 1234 2>/dev/null; echo /tmp/core.test)
/tmp/core.test: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './crash-test segfault'

The Core Question You Are Answering

“Where does my crashed process’s memory go, and how do I make sure the kernel actually writes it?”

Before writing any code, understand that core dumps are not automatic. The kernel checks multiple conditions: Is RLIMIT_CORE non-zero? Is the executable setuid? Does the process have permission to write to the dump location? Is there enough disk space? Modern systems add another layer: systemd-coredump intercepts dumps before they hit the filesystem. You need to understand this pipeline before you can debug anything.

Concepts You Must Understand First

Resource Limits (ulimit)
- What is the difference between soft and hard limits?
- Why does ulimit -c unlimited in a script not affect programs started afterward?
- How do you set permanent limits via /etc/security/limits.conf?
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 36
Signals and Process Termination
- Which signals generate core dumps by default (SIGQUIT, SIGILL, SIGABRT, SIGFPE, SIGSEGV, SIGBUS, SIGSYS, SIGTRAP)?
- How can you catch a signal vs letting it dump core?
- What happens when a signal handler re-raises the signal?
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 20-22
systemd-coredump
- How does the kernel pipe core dumps to an external program?
- Where does systemd-coredump store files (/var/lib/systemd/coredump/)?
- How does compression (LZ4) affect storage and retrieval?
- Book Reference: systemd-coredump(8) and coredumpctl(1) man pages

Questions to Guide Your Design

Configuration Detection
- How will your script detect if systemd-coredump is active vs traditional core files?
- What should happen on non-systemd systems (Alpine, older RHEL)?
Crash Triggering
- How will you trigger each signal type (NULL deref for SIGSEGV, abort() for SIGABRT, 1/0 for SIGFPE)?
- Should you compile with or without optimizations? Why?
Verification
- How will you verify the dump was actually created (not a zero-size file)?
- What fields in coredumpctl info prove the dump is usable?
Cleanup
- How do you remove old test dumps without affecting real crashes?
- What is the coredumpctl retention policy?

Thinking Exercise

Trace the Kernel Path

Before coding, trace what happens when a process dereferences NULL:

1. Process executes: *(int *)0 = 42;
2. CPU raises page fault (address 0 is not mapped)
3. Kernel's page fault handler runs
4. Handler finds no valid mapping → sends SIGSEGV to process
5. Process has no handler for SIGSEGV → default action is "dump core + terminate"
6. Kernel checks RLIMIT_CORE:
   - If 0 → no dump, just terminate
   - If >0 → proceed
7. Kernel reads /proc/sys/kernel/core_pattern:
   - If starts with "|" → pipe to that program (systemd-coredump)
   - Otherwise → write to file with that pattern
8. Kernel writes ELF core file with process memory + registers
9. Process terminates, parent receives SIGCHLD

Questions while tracing:

At which step can you lose the dump?
What if the pipe to systemd-coredump fails?
How does the kernel know which memory regions to include?

The Interview Questions They Will Ask

“A production service crashed but there’s no core dump. Walk me through how you would debug why.”
“What is the difference between ulimit in .bashrc and /etc/security/limits.conf?”
“How does systemd-coredump differ from traditional core dumps, and what are the tradeoffs?”
“Why might a setuid program not generate a core dump even with ulimit -c unlimited?”
“How would you configure core dumps on a container running in Kubernetes?”

Hints in Layers

Hint 1: Start with Detection First, detect the current configuration before changing anything. Read /proc/sys/kernel/core_pattern and compare ulimit -c (soft) vs ulimit -Hc (hard).

Hint 2: Handle Both Modes Your script should work on both systemd and non-systemd systems. Use which coredumpctl or check for the |/usr/lib/systemd/systemd-coredump pattern to detect the mode.

Hint 3: Test Program Structure

main(argc, argv):
    if argc < 2:
        print usage: "./crash-test [segfault|abort|fpe|bus]"
        exit

    switch argv[1]:
        case "segfault":
            print "Triggering SIGSEGV..."
            ptr = NULL
            *ptr = 42  // crash here
        case "abort":
            print "Triggering SIGABRT..."
            abort()
        case "fpe":
            print "Triggering SIGFPE..."
            volatile x = 0
            y = 1 / x  // crash here
        case "bus":
            print "Triggering SIGBUS..."
            // Requires unaligned access on strict architectures

Hint 4: Verification Commands Use these commands to verify your setup:

# Check if dump was created (systemd)
coredumpctl list | head -5

# Extract and verify
coredumpctl dump <PID> -o /tmp/test.core
file /tmp/test.core  # Should show "ELF 64-bit LSB core file"

# Check size (should be non-zero)
ls -la /tmp/test.core

Books That Will Help

Topic	Book	Chapter
Resource Limits	“The Linux Programming Interface” by Kerrisk	Ch. 36: Process Resources
Signals	“The Linux Programming Interface” by Kerrisk	Ch. 20-22: Signals
Core Dumps	“The Linux Programming Interface” by Kerrisk	Ch. 22.1: Core Dump Files
Practical GDB	“The Art of Debugging” by Matloff & Salzman	Ch. 1: Getting Started

Common Pitfalls and Debugging

Problem 1: “ulimit -c unlimited had no effect”

Why: ulimit only affects the current shell and its children. Running sudo ulimit -c unlimited spawns a root subshell that exits immediately.
Fix: Add limit to /etc/security/limits.conf for persistent change, or run ulimit in the same shell that launches the process.
Quick test: bash -c 'ulimit -c unlimited; ./crash-test segfault'

Problem 2: “Core dump was created but has 0 bytes”

Why: The process may have died before the dump completed, or filesystem is full.
Fix: Check dmesg | grep -i core for kernel messages. Verify disk space with df -h.
Quick test: dmesg | tail -20

Problem 3: “coredumpctl shows ‘missing’ in COREFILE column”

Why: systemd-coredump may have retention policies that deleted old dumps, or the dump failed during capture.
Fix: Check /etc/systemd/coredump.conf for MaxUse= and KeepFree= settings.
Quick test: coredumpctl info <PID> for detailed error messages

Problem 4: “Crash happens but no ‘core dumped’ message”

Why: Shell might not report core dump status, or signal was caught.
Fix: Check exit status: ./crash-test segfault; echo "Exit: $?" (139 = 128+11 = SIGSEGV with core)
Quick test: Exit code 139 (SIGSEGV) or 134 (SIGABRT) indicates dump should have occurred

Definition of Done

Setup script detects and reports current core dump configuration
Setup script configures ulimit and verifies core_pattern
Test program triggers at least 3 different signal types (SIGSEGV, SIGABRT, SIGFPE)
Each crash produces a verifiable core dump (non-zero size, correct ELF type)
coredumpctl list or ls core.* shows all test crashes
Script works on at least 2 different Linux distributions (Ubuntu, Fedora, or similar)
Documentation explains the differences between systemd and traditional core patterns

Project 2: The GDB Backtrace — Extracting Crash Context

File: P02-gdb-backtrace-crash-context.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Debugging, Stack Traces
Software or Tool: GDB, debug symbols (-g flag)
Main Book: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman

What you will build: A debugging workflow that extracts meaningful crash information from core dumps using GDB. You will create programs that crash in various ways and practice the essential GDB commands to understand exactly what went wrong.

Why it teaches crash dump analysis: The backtrace is your first tool when analyzing any crash. But a raw backtrace is often useless without understanding stack frames, argument values, and local variables. This project teaches you to navigate from “Segmentation fault” to “Line 47 of foo.c passed NULL to bar()”.

Core challenges you will face:

Loading core dumps correctly → Maps to GDB’s core-file command and executable matching
Reading backtraces with symbols → Maps to understanding compile flags (-g, -O)
Navigating stack frames → Maps to frame selection and variable inspection
Comparing with/without debug symbols → Maps to real-world debugging constraints

Real World Outcome

You will have a debugging toolkit that demonstrates GDB’s core dump analysis capabilities. Running your workflow will produce clear, actionable crash information.

Example Output:

$ ./demo-crashes linked-list-corruption
[*] Creating a linked list with 5 nodes
[*] Corrupting node 3's next pointer with garbage
[*] Traversing list (will crash)...
Segmentation fault (core dumped)

$ gdb ./demo-crashes core.12345
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Reading symbols from ./demo-crashes...
[New LWP 12345]
Core was generated by `./demo-crashes linked-list-corruption'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
47          printf("Node value: %d\n", current->value);

(gdb) bt
#0  0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
#1  0x0000555555555456 in test_linked_list_corruption () at demo-crashes.c:89
#2  0x00005555555556b2 in main (argc=2, argv=0x7fffffffde68) at demo-crashes.c:142

(gdb) frame 0
#0  0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
47          printf("Node value: %d\n", current->value);

(gdb) print current
$1 = (struct Node *) 0xdeadbeef

(gdb) print *current
Cannot access memory at address 0xdeadbeef

(gdb) info locals
head = 0x5555555592a0
current = 0xdeadbeef
count = 3

(gdb) # The bug: after 3 iterations, current became 0xdeadbeef (our corruption)
(gdb) # Root cause: someone wrote 0xdeadbeef to node[2]->next

The Core Question You Are Answering

“The program crashed—but WHERE exactly, and with WHAT state?”

A crash report saying “Segmentation fault” tells you almost nothing. You need: (1) the exact line of code, (2) the call chain that led there, (3) the values of relevant variables, and (4) the state of memory at the crash point. GDB gives you all of this—if you know the commands.

Concepts You Must Understand First

Debug Symbols (-g flag)
- What information does -g embed in the binary?
- How does DWARF format map addresses to source lines?
- Why can you still get a backtrace without symbols, just without names?
- Book Reference: “Practical Binary Analysis” by Andriesse — Ch. 2
Stack Frames
- What is stored in each stack frame (return address, saved registers, locals)?
- How does GDB number frames (0 is current, higher is caller)?
- What is the frame pointer (RBP) and how does it help?
- Book Reference: “Computer Systems: A Programmer’s Perspective” by Bryant — Ch. 3.7
GDB Core Commands
- bt (backtrace) — shows call stack
- frame N — select stack frame N
- info locals — show local variables in current frame
- info args — show function arguments
- print <expr> — evaluate and print expression
- Book Reference: “The Art of Debugging” by Matloff — Ch. 2-3

Questions to Guide Your Design

Test Case Design
- What crash scenarios will you create (NULL pointer, buffer overflow, use-after-free)?
- How will you ensure the crash happens at a predictable location?
Symbol Comparison
- How will you demonstrate the difference between -g and no -g?
- What about -g with optimization (-O2 -g)?
Variable Inspection
- How will you show variables at different stack depths?
- What happens when you try to print an optimized-out variable?
Documentation
- What GDB commands will you document for your reference?
- How will you create a “cheat sheet” for common scenarios?

Thinking Exercise

Trace the Stack

Given this call chain that crashes:

void foo(int *p) { *p = 42; }  // Crash here if p is NULL
void bar(int x) { if (x < 0) foo(NULL); else { int y = x; foo(&y); } }
void baz() { bar(-1); }
int main() { baz(); return 0; }

Draw the stack at crash time:

High addresses
┌─────────────────┐
│ main's frame    │ ← Frame 3
│  (return addr)  │
├─────────────────┤
│ baz's frame     │ ← Frame 2
│  (return addr)  │
├─────────────────┤
│ bar's frame     │ ← Frame 1
│  x = -1         │
│  (return addr)  │
├─────────────────┤
│ foo's frame     │ ← Frame 0 (current)
│  p = NULL       │
│  crash point    │
└─────────────────┘
Low addresses

Questions while tracing:

If you only see frame 0, how do you know the real bug is in bar()?
What would info args show in frame 1?
How would this look different without debug symbols?

The Interview Questions They Will Ask

“You have a core dump from a production crash. Walk me through your first 5 GDB commands.”
“What is the difference between bt and bt full?”
“How can you tell if a crash is a NULL pointer dereference vs a use-after-free from the backtrace?”
“The backtrace shows ?? instead of function names. What does that mean and how do you fix it?”
“How do you inspect a variable that GDB says is ‘optimized out’?”
“What is the frame pointer, and why do some binaries omit it?”

Hints in Layers

Hint 1: Create Multiple Crash Types Start with 3-4 distinct crash scenarios: NULL dereference, stack buffer overflow, heap corruption, double-free. Each teaches different debugging patterns.

Hint 2: Compile Two Versions Always compile the same source twice: gcc -g -O0 (debug) and gcc -O2 (release). Compare the backtrace quality to understand why debug symbols matter.

Hint 3: Build a Command Reference

Essential GDB Commands for Core Dump Analysis:
- bt              : Show backtrace (call stack)
- bt full         : Backtrace with local variables
- frame N         : Switch to frame N
- up / down       : Move one frame up/down
- info locals     : Show local variables
- info args       : Show function arguments
- print VAR       : Print variable value
- print *PTR      : Dereference pointer
- print PTR@10    : Print array of 10 elements
- x/10x ADDR      : Examine 10 hex words at address
- list            : Show source around current line
- info registers  : Show CPU registers

Hint 4: Automate with GDB Batch Mode

# Run GDB commands non-interactively
gdb -batch -ex "bt" -ex "info locals" ./program core.123

# Save to file
gdb -batch -ex "bt full" ./program core.123 > crash-report.txt

Books That Will Help

Topic	Book	Chapter
GDB Basics	“The Art of Debugging” by Matloff & Salzman	Ch. 1-4
Stack Frames	“CSAPP” by Bryant & O’Hallaron	Ch. 3.7: Procedures
Debug Symbols	“Practical Binary Analysis” by Andriesse	Ch. 2: ELF Format
Memory Layout	“CSAPP” by Bryant & O’Hallaron	Ch. 9: Virtual Memory

Common Pitfalls and Debugging

Problem 1: “GDB says ‘no debugging symbols found’“

Why: The binary was compiled without -g, or symbols were stripped.
Fix: Recompile with gcc -g or locate a debug symbol package (e.g., -dbgsym packages on Ubuntu).
Quick test: file ./program should say “with debug_info” if symbols present

Problem 2: “Backtrace shows ‘??’ for function names”

Why: GDB can’t map addresses to symbols. Either wrong executable or missing symbols.
Fix: Ensure the executable matches the core dump (same build). Use info shared to check shared library symbols.
Quick test: readelf -s ./program | head -20 should show symbol table

Problem 3: “Variable shows as ‘optimized out’“

Why: Compiler optimized away the variable (stored in register, inlined, etc.)
Fix: Recompile with -O0 for debugging, or use info registers to find register values.
Quick test: gcc -g -O0 should preserve all variables

Problem 4: “Core file doesn’t match executable”

Why: The binary was rebuilt after the crash, changing addresses.
Fix: Keep the exact binary that crashed alongside the core dump. Use version control or preserve binaries with each release.
Quick test: file core.123 shows the original executable path

Definition of Done

Created at least 4 different crash scenarios (NULL, overflow, use-after-free, etc.)
Can load core dump in GDB and get full backtrace with symbols
Documented the difference in backtrace quality with and without -g
Can navigate frames and inspect variables at each level
Created a GDB command cheat sheet with examples
Can use GDB batch mode to generate crash reports non-interactively
Can explain what each line of a backtrace means

Project 3: The Memory Inspector — Deep State Examination

File: P03-memory-inspector-deep-state.md
Main Programming Language: C
Alternative Programming Languages: C++
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Memory Layout, Debugging, Forensics
Software or Tool: GDB, hexdump, /proc filesystem
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you will build: A memory forensics toolkit that goes beyond backtraces to examine heap state, corrupted data structures, and memory patterns. You will create programs with subtle memory bugs and use GDB’s memory examination commands to find root causes that backtraces alone can’t reveal.

Why it teaches crash dump analysis: Many crashes don’t happen at the bug—they happen later when corrupted data is used. A backtrace shows you where it died, but memory inspection shows you what went wrong. This is the difference between “it crashed in strcmp()” and “someone wrote past the end of the username buffer 200 lines earlier.”

Core challenges you will face:

Examining raw memory → Maps to GDB’s x command and format specifiers
Finding corruption patterns → Maps to recognizing freed memory, stack canaries, guard bytes
Tracing data structure state → Maps to following pointers and understanding layout
Correlating addresses with regions → Maps to stack vs heap vs data vs code identification

Real World Outcome

You will have a collection of memory forensics scenarios and the skills to investigate corruption that causes delayed crashes.

Example Output:

$ ./memory-mysteries heap-overflow
[*] Allocating two adjacent buffers
[*] Buffer A: 32 bytes for user input
[*] Buffer B: 32 bytes for privilege level (should be "user")
[*] Overwriting Buffer A with 48 bytes (oops, 16 byte overflow)
[*] Checking privilege level...
[!] Unexpected privilege: 'admin' (Buffer B was corrupted!)
[*] Attempting privileged operation...
Segmentation fault (core dumped)

$ gdb ./memory-mysteries core.5678
(gdb) bt
#0  0x00007ffff7c9b152 in __strcmp_avx2 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000055555555543a in check_operation (priv=0x5555555596d0 "") at memory-mysteries.c:89
#2  0x0000555555555678 in do_admin_thing () at memory-mysteries.c:134
...

(gdb) frame 1
#1  0x000055555555543a in check_operation (priv=0x5555555596d0 "") at memory-mysteries.c:89

(gdb) print priv
$1 = 0x5555555596d0 ""

(gdb) x/32xb 0x5555555596d0
0x5555555596d0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596d8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596e0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596e8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

(gdb) # Buffer B is all zeros — was overwritten with null bytes from overflow!
(gdb) # Let's look at Buffer A (32 bytes before)
(gdb) x/48xb 0x5555555596d0-32
0x5555555596b0: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41  <- Buffer A "AAAA..."
0x5555555596b8: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0x5555555596c0: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0x5555555596c8: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41  <- End of A, start of B
0x5555555596d0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00  <- Buffer B (corrupted)
0x5555555596d8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

(gdb) # Aha! Buffer A was filled with 'A' (0x41) and the null terminator overflowed into B
(gdb) # The 48-byte string + null terminator overwrote B completely

The Core Question You Are Answering

“The crash site is just the symptom—where is the actual BUG?”

Most memory corruption bugs don’t crash immediately. A heap overflow corrupts an adjacent allocation that isn’t used until minutes later. A use-after-free might work 99 times because the memory hasn’t been reused. The backtrace tells you where the train derailed; memory inspection tells you where the track was sabotaged.

Concepts You Must Understand First

Process Memory Layout
- Where are stack, heap, data, and code segments?
- How do you identify which region an address belongs to?
- What do typical stack addresses vs heap addresses look like?
- Book Reference: “CSAPP” by Bryant — Ch. 9.7-9.8
GDB Memory Examination
- x/NFU ADDR: N=count, F=format (x,d,c,s,i), U=unit (b,h,w,g)
- Common patterns: x/20xb (20 hex bytes), x/s (string), x/10i (10 instructions)
- How to examine memory relative to variables: x/16xb &buffer
- Book Reference: “The Art of Debugging” by Matloff — Ch. 3
Heap Internals Basics
- How does malloc track allocations (size, flags in chunk headers)?
- What patterns indicate freed memory (0xdeadbeef, 0xfeeefeee, etc.)?
- What is the “red zone” and how does Address Sanitizer use it?
- Book Reference: “Secure Coding in C and C++” by Seacord — Ch. 4
Data Structure Layout
- How does the compiler arrange struct fields in memory?
- What is padding and alignment?
- How do you correlate offsetof() with memory examination?
- Book Reference: “CSAPP” by Bryant — Ch. 3.9

Questions to Guide Your Design

Scenario Selection
- What memory bugs will you demonstrate (heap overflow, use-after-free, stack buffer overflow, uninitialized memory)?
- How will you make the corruption visible but not immediately fatal?
Memory Patterns
- What recognizable patterns will you use (0x41 for ‘A’, 0xDEADBEEF, specific strings)?
- How will you demonstrate that the corruption pattern reveals the source?
Region Identification
- How will you teach identifying stack vs heap addresses?
- Will you show how to use /proc/PID/maps or info proc mappings?
Forensic Workflow
- What systematic approach will you document for investigating corruption?
- How do you trace back from corrupted memory to the corrupting code?

Thinking Exercise

Map the Memory Regions

Given this GDB output, identify each address’s region:

(gdb) print &argc
$1 = (int *) 0x7fffffffde04
(gdb) print buffer
$2 = 0x5555555592a0 "Hello"
(gdb) print main
$3 = {int (int, char **)} 0x555555555169
(gdb) print &global_counter
$4 = (int *) 0x555555558010

Questions:

Which address is on the stack? (Hint: 0x7fff… is high memory)
Which address is heap? (Hint: dynamically allocated, similar to code addresses)
Which is code? (Hint: same region as program counter)
Which is data segment? (Hint: near code but different section)

Typical x86-64 Linux layout:

0x7fff_xxxx_xxxx  Stack (grows down)
0x7f00_xxxx_xxxx  Shared libraries
...
0x5555_5555_xxxx  Heap (grows up, glibc mmap allocations)
0x5555_5555_8xxx  Data segment (.data, .bss)
0x5555_5555_5xxx  Code segment (.text)

The Interview Questions They Will Ask

“How do you determine if a crash is caused by heap corruption vs stack corruption?”
“Explain the GDB command x/20xb $rsp — what does each part mean?”
“You see address 0xdeadbeef in a pointer. What does this typically indicate?”
“How can you tell if memory was freed before being used?”
“Walk me through debugging a crash where the backtrace shows the bug is in libc’s malloc.”
“What is the difference between examining memory with print vs x in GDB?”

Hints in Layers

Hint 1: Start with Known Patterns Fill buffers with recognizable patterns before corruption: memset(buf, 'A', size) or fill with sequential bytes. When you see these patterns where they shouldn’t be, you’ve found your corruption.

Hint 2: Compare Before and After Create scenarios where you can examine memory state before and after the corruption. Use GDB breakpoints before the bug to capture “good” state, then compare with the core dump.

Hint 3: GDB Memory Examination Cheat Sheet

x/Nx ADDR    - N hex words (4 bytes each)
x/Nxb ADDR   - N hex bytes
x/Nxg ADDR   - N hex giant words (8 bytes)
x/Ns ADDR    - N strings
x/Ni ADDR    - N instructions
x/Nc ADDR    - N characters

Examples:
x/32xb &buffer     - 32 bytes of buffer
x/s $rdi           - String at first argument
x/10i $pc          - 10 instructions at program counter
x/8xg $rsp         - 8 stack slots (64-bit)

Hint 4: Use info Files for Memory Map

(gdb) info files
# Shows all loaded segments and their address ranges

(gdb) info proc mappings
# Shows /proc/PID/maps style output (if available from core)

(gdb) maintenance info sections
# Shows ELF sections with addresses

Books That Will Help

Topic	Book	Chapter
Memory Layout	“CSAPP” by Bryant & O’Hallaron	Ch. 9: Virtual Memory
GDB Memory Commands	“The Art of Debugging” by Matloff	Ch. 3: Inspecting Variables
Heap Internals	“Secure Coding in C and C++” by Seacord	Ch. 4: Dynamic Memory
Stack Frame Layout	“CSAPP” by Bryant & O’Hallaron	Ch. 3.7: Procedures
Binary Inspection	“Practical Binary Analysis” by Andriesse	Ch. 5: Binary Analysis Basics

Common Pitfalls and Debugging

Problem 1: “Memory shows all zeros but program used this data”

Why: Core dump may not include all memory pages (sparse dump). Or memory was freed.
Fix: Check if systemd-coredump limits dump size. Use coredumpctl info to see ProcessState.
Quick test: coredumpctl info <PID> | grep "Size" — compare with expected memory usage

Problem 2: “Can’t tell stack from heap addresses”

Why: Without context, addresses are just numbers. Need memory map.
Fix: Use info proc mappings in GDB or examine /proc/PID/maps from a live process.
Quick test: Stack addresses typically start with 0x7fff on x86-64 Linux

Problem 3: “Heap metadata looks corrupted”

Why: Heap overflow overwrote malloc’s bookkeeping, causing strange chunk sizes.
Fix: This is often the symptom, not the cause. Look for the buffer that overflowed into metadata.
Quick test: Look for ASCII patterns (like 0x41414141) in chunk headers

Problem 4: “Same crash, different memory contents each run”

Why: Address Space Layout Randomization (ASLR) changes addresses each run.
Fix: Disable ASLR for debugging: echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
Quick test: Re-enable after debugging! (echo 2 for full ASLR)

Definition of Done

Created at least 3 memory corruption scenarios (heap overflow, use-after-free, uninitialized)
Can identify memory region (stack/heap/data/code) from any address
Documented the x command with multiple format examples
Can trace corruption from crash site back to the bug location
Demonstrated finding a recognizable pattern in unexpected memory
Created a memory forensics workflow/checklist
Can explain heap chunk metadata and what corruption looks like

Project 4: Automated Crash Detective — GDB Scripting and Python API

File: P04-automated-crash-detective-gdb-scripting.md
Main Programming Language: Python
Alternative Programming Languages: GDB Command Scripts, Bash
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Automation, Scripting, Tooling
Software or Tool: GDB Python API, gdb.Command, batch mode
Main Book: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman

What you will build: An automated crash analysis tool that takes a core dump and executable as input and produces a structured report with backtrace, local variables, thread state, and memory analysis—all without human interaction. You will learn GDB’s Python API to build reusable debugging automation.

Why it teaches crash dump analysis: Manual GDB debugging doesn’t scale. When you have 100 crashes per day, you need automation. This project teaches you to codify your debugging knowledge into scripts that run consistently and generate reports for further analysis or integration with crash aggregation systems.

Core challenges you will face:

Learning GDB’s Python API → Maps to gdb.Command, gdb.Frame, gdb.Value
Extracting structured data → Maps to parsing GDB output vs using API objects
Handling edge cases → Maps to corrupted frames, missing symbols, stripped binaries
Generating useful reports → Maps to what information developers actually need

Real World Outcome

You will have a Python-based crash analysis tool that generates JSON or Markdown reports from core dumps automatically.

Example Output:

$ ./crash-analyzer --core ./core.5678 --exe ./myserver --format json
{
  "timestamp": "2025-01-04T10:30:45Z",
  "executable": "/home/user/myserver",
  "signal": "SIGSEGV",
  "signal_code": 11,
  "crashing_thread": 1,
  "total_threads": 4,
  "backtrace": [
    {
      "frame": 0,
      "function": "process_request",
      "file": "server.c",
      "line": 234,
      "args": {"req": "0x5555555596d0", "len": "1024"},
      "locals": {"buffer": "0x7fffffffdd00", "i": "127"}
    },
    {
      "frame": 1,
      "function": "handle_client",
      "file": "server.c",
      "line": 189,
      "args": {"client_fd": "5"}
    },
    {
      "frame": 2,
      "function": "main",
      "file": "server.c",
      "line": 312
    }
  ],
  "registers": {
    "rip": "0x555555555abc",
    "rsp": "0x7fffffffdd00",
    "rbp": "0x7fffffffde10"
  },
  "analysis": {
    "crash_type": "null_pointer_dereference",
    "likely_cause": "req pointer was NULL at frame 0",
    "stack_corrupted": false
  }
}

$ ./crash-analyzer --core ./core.5678 --exe ./myserver --format markdown
# Crash Report: myserver

**Signal:** SIGSEGV (Segmentation Fault)
**Time:** 2025-01-04 10:30:45 UTC
**Threads:** 4 (crashed in thread 1)

## Backtrace

| Frame | Function | Location | Arguments |
|-------|----------|----------|-----------|
| 0 | process_request | server.c:234 | req=0x5555555596d0, len=1024 |
| 1 | handle_client | server.c:189 | client_fd=5 |
| 2 | main | server.c:312 | argc=1, argv=... |

## Analysis

**Likely Cause:** NULL pointer dereference at `req` parameter
**Recommendation:** Add null check before line 234

The Core Question You Are Answering

“How do I turn my manual debugging workflow into a repeatable, scriptable process?”

Every time you debug a crash, you run the same commands: bt, info locals, info threads. Why type them manually? A script can do it faster, more consistently, and produce structured output for downstream processing. This is the foundation of crash aggregation systems used by Google, Microsoft, and every major software company.

Concepts You Must Understand First

GDB Batch Mode
- How does gdb -batch -x script.gdb work?
- What is the difference between -x (command file) and -ex (inline command)?
- How do you capture output to a file?
- Book Reference: “The Art of Debugging” by Matloff — Ch. 7
GDB Python API Fundamentals
- How do you load a Python script in GDB (source, python-interactive)?
- What are gdb.Frame, gdb.Value, gdb.Type, gdb.Symbol?
- How do you iterate stack frames programmatically?
- Book Reference: GDB Manual, Python API chapter (online)
Creating Custom GDB Commands
- How does the gdb.Command class work?
- How do you handle arguments in custom commands?
- How do you output structured data (JSON) from GDB?
- Book Reference: GDB Documentation — “Extending GDB with Python”
Error Handling in GDB Scripts
- What happens when gdb.parse_and_eval() fails?
- How do you detect and handle missing debug symbols?
- How do you skip corrupted frames?
- Book Reference: Experience and experimentation

Questions to Guide Your Design

Input/Output
- What input formats will you support (core file + exe, coredumpctl)?
- What output formats will you produce (JSON, Markdown, plain text)?
- How will you handle command-line arguments?
Information Extraction
- What information is essential in every report (backtrace, signal, registers)?
- What optional information adds value (locals, threads, memory)?
- How deep should you go into memory inspection?
Error Handling
- What if the executable doesn’t match the core?
- What if debug symbols are missing?
- What if a stack frame is corrupted?
Extensibility
- How will others add new analyses?
- Can you support plugins for different crash types?

Thinking Exercise

Design the API

Before coding, design the data model for crash information:

# What fields should a StackFrame have?
class StackFrame:
    frame_number: int
    function_name: str      # or "??" if unknown
    file_name: Optional[str]
    line_number: Optional[int]
    address: int
    arguments: Dict[str, str]   # name -> value as string
    locals: Dict[str, str]      # name -> value as string

# What fields should a CrashReport have?
class CrashReport:
    executable_path: str
    core_path: str
    timestamp: datetime
    signal: str              # "SIGSEGV"
    signal_number: int       # 11
    backtrace: List[StackFrame]
    registers: Dict[str, int]
    threads: List[ThreadInfo]
    analysis: AnalysisResult

Questions:

How do you get the signal name from the signal number?
What if a local variable’s value is “optimized out”?
How do you serialize gdb.Value to JSON?

The Interview Questions They Will Ask

“How would you build a system to analyze 1000 crash dumps per day?”
“What is the GDB Python API, and when would you use it over command scripts?”
“How do you handle a crash dump where the executable was compiled without debug symbols?”
“Design a crash fingerprinting algorithm—how would you identify if two crashes are the same bug?”
“What information should a crash report contain for a developer to diagnose the issue?”
“How would you integrate automated crash analysis with a CI/CD pipeline?”

Hints in Layers

Hint 1: Start with Batch Mode Before writing Python, get comfortable with GDB batch mode:

gdb -batch \
  -ex "file ./myprogram" \
  -ex "core-file ./core.123" \
  -ex "bt" \
  -ex "info threads"

Capture this output, then parse it (as a stepping stone to the Python API).

Hint 2: Basic Python Script Structure

import gdb
import json

class CrashAnalyzer(gdb.Command):
    """Analyze current core dump and output report."""

    def __init__(self):
        super().__init__("crash-analyze", gdb.COMMAND_USER)

    def invoke(self, arg, from_tty):
        report = {}

        # Get backtrace
        frame = gdb.newest_frame()
        frames = []
        while frame:
            frames.append(self.extract_frame_info(frame))
            frame = frame.older()
        report["backtrace"] = frames

        # Output as JSON
        print(json.dumps(report, indent=2))

    def extract_frame_info(self, frame):
        # ... implement frame extraction
        pass

CrashAnalyzer()  # Register the command

Hint 3: Handle Missing Symbols Gracefully

def get_function_name(frame):
    try:
        name = frame.name()
        return name if name else "??"
    except gdb.error:
        return "??"

def get_local_value(name, frame):
    try:
        val = frame.read_var(name)
        return str(val)
    except gdb.error as e:
        return f"<unavailable: {e}>"

Hint 4: Wrapper Script for Easy Invocation

#!/bin/bash
# crash-analyzer.sh
GDB_SCRIPT="$(dirname $0)/crash_analyzer.py"
gdb -batch \
    -ex "source $GDB_SCRIPT" \
    -ex "file $2" \
    -ex "core-file $1" \
    -ex "crash-analyze $3"  # Pass format as arg

Books That Will Help

Topic	Book	Chapter
GDB Scripting	“The Art of Debugging” by Matloff	Ch. 7: Scripting
Python APIs	GDB Manual (online)	Python API Reference
JSON Processing	Python Documentation	json module
CLI Design	“The Linux Command Line” by Shotts	Ch. 25: Scripts

Common Pitfalls and Debugging

Problem 1: “gdb.error: No frame selected”

Why: You’re trying to access frame data before loading the core file.
Fix: Ensure the core-file command runs before your analysis script.
Quick test: Run commands interactively first to verify order

Problem 2: “Cannot convert gdb.Value to JSON”

Why: gdb.Value objects aren’t JSON-serializable; you need to convert to strings/ints.
Fix: Use str(value) or int(value) depending on type. Handle errors for complex types.
Quick test: print(type(val), str(val)) in your script

Problem 3: “Script works interactively but fails in batch mode”

Why: Batch mode may have different timing or output buffering.
Fix: Ensure all gdb commands complete before Python code runs. Use gdb.execute() with to_string=True to capture output.
Quick test: Add debug prints to track execution flow

Problem 4: “Missing symbols for system libraries”

Why: Debug symbols for libc, etc. are in separate packages.
Fix: Install debug symbol packages (libc6-dbg on Debian, glibc-debuginfo on Fedora).
Quick test: gdb -ex "info sharedlibrary" to see which libs lack symbols

Definition of Done

Script loads core dump and executable via command-line arguments
Produces JSON output with backtrace, signal, and thread info
Produces human-readable (Markdown) output as alternative
Handles missing debug symbols gracefully (shows addresses, not errors)
Handles multiple threads and identifies the crashing thread
Includes basic analysis (null pointer detection, crash type classification)
Works with coredumpctl integration (can extract core from systemd storage)
Documented installation and usage instructions

Project 5: Multi-threaded Mayhem — Analyzing Concurrent Crashes

File: P05-multi-threaded-mayhem-concurrent-crashes.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Concurrency, Thread Debugging, Race Conditions
Software or Tool: GDB thread commands, pthreads, helgrind
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A multi-threaded test suite that demonstrates various concurrency bugs (data races, deadlocks, thread-unsafe crashes) and the GDB techniques to diagnose them from core dumps. You’ll learn to navigate thread state in crashes where the symptom is in one thread but the cause is in another.

Why it teaches crash dump analysis: Single-threaded debugging is straightforward—follow the backtrace to the bug. Multi-threaded crashes are puzzles: Thread A crashes because Thread B corrupted shared data. Thread C is deadlocked waiting for a mutex Thread D holds. This project teaches you to think across threads and correlate state.

Core challenges you will face:

Navigating thread state in GDB → Maps to info threads, thread N, thread apply all
Understanding thread-specific crash context → Maps to which thread received the signal
Correlating data across threads → Maps to finding the “other thread” that caused corruption
Recognizing concurrency bug patterns → Maps to data races, deadlocks, use-after-free across threads

Real World Outcome

You will have a collection of multi-threaded crash scenarios and the skills to diagnose which thread caused the problem, not just which thread crashed.

Example Output:

$ ./thread-chaos race-condition
[*] Spawning 4 threads to increment shared counter
[*] Thread 0 starting...
[*] Thread 1 starting...
[*] Thread 2 starting...
[*] Thread 3 starting...
[*] Thread 1 corrupted shared data structure (intentionally)
[*] Thread 3 accessing corrupted data...
Segmentation fault (core dumped)

$ gdb ./thread-chaos core.9876
(gdb) info threads
  Id   Target Id                    Frame
* 1    Thread 0x7ffff7fb1000 (LWP 9876) 0x00007ffff7c9b152 in __strcmp_avx2 ()
  2    Thread 0x7ffff6fb0700 (LWP 9877) 0x00007ffff7ce2a8d in __lll_lock_wait ()
  3    Thread 0x7ffff67af700 (LWP 9878) 0x0000555555555420 in worker_thread ()
  4    Thread 0x7ffff5fae700 (LWP 9879) 0x00007ffff7d0b3bf in __GI___nanosleep ()

(gdb) # Thread 1 (LWP 9876) is the crashed thread - marked with *
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fb1000 (LWP 9876))]
#0  0x00007ffff7c9b152 in __strcmp_avx2 () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0  0x00007ffff7c9b152 in __strcmp_avx2 ()
#1  0x00005555555554a3 in process_item (item=0xdeadbeef) at thread-chaos.c:78
#2  0x0000555555555523 in worker_thread (arg=0x0) at thread-chaos.c:95
#3  0x00007ffff7d8b6db in start_thread () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) # item=0xdeadbeef is suspicious - looks like corruption pattern
(gdb) # Let's check all threads' backtraces

(gdb) thread apply all bt

Thread 4 (Thread 0x7ffff5fae700 (LWP 9879)):
#0  0x00007ffff7d0b3bf in __GI___nanosleep ()
#1  0x0000555555555389 in sleepy_thread (arg=0x3) at thread-chaos.c:67

Thread 3 (Thread 0x7ffff67af700 (LWP 9878)):
#0  0x0000555555555420 in worker_thread (arg=0x2) at thread-chaos.c:92
#1  0x00007ffff7d8b6db in start_thread ()

Thread 2 (Thread 0x7ffff6fb0700 (LWP 9877)):
#0  0x00007ffff7ce2a8d in __lll_lock_wait ()
#1  0x00007ffff7ce5393 in pthread_mutex_lock ()
#2  0x0000555555555501 in corrupt_shared_data (arg=0x1) at thread-chaos.c:88
#3  0x00007ffff7d8b6db in start_thread ()

(gdb) # Thread 2 is in corrupt_shared_data! That's likely the culprit
(gdb) # Thread 1 crashed, but Thread 2's function name suggests it corrupted the data

The Core Question You Are Answering

“In a multi-threaded crash, which thread actually CAUSED the problem?”

The thread that receives SIGSEGV is often the victim, not the perpetrator. Thread A writes garbage to a shared pointer; Thread B later dereferences it and crashes. The backtrace for Thread B shows the crash in innocent code. You need to examine ALL threads to find the real bug.

Concepts You Must Understand First

Thread Representation in Core Dumps
- How are threads captured (all threads, all registers)?
- What is LWP (Light Weight Process)?
- Which thread is marked as the “crashing thread”?
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 29
GDB Thread Commands
- info threads — list all threads with current frame
- thread N — switch to thread N
- thread apply all CMD — run CMD on each thread
- thread apply all bt — backtrace for all threads (most useful!)
- Book Reference: “The Art of Debugging” by Matloff — Ch. 5
Concurrency Bug Patterns
- Data race: Two threads access same data, at least one writes, no synchronization
- Deadlock: Circular wait on locks (hard to see in crash, easier in live debug)
- Use-after-free across threads: Thread A frees, Thread B uses
- Race on refcount: Thread A decrements to 0 and frees while Thread B still using
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 30
Mutex and Synchronization State
- How do you examine mutex state in a core dump?
- What does a “waiting on lock” frame look like?
- How do you identify which thread holds a lock?
- Book Reference: glibc/pthread internals documentation

Questions to Guide Your Design

Scenario Selection
- What concurrency bugs will you demonstrate (race condition, deadlock, cross-thread use-after-free)?
- How will you make the crashes reproducible (or intentionally non-deterministic)?
Thread Identification
- How will you name or tag threads so they’re identifiable in GDB?
- Will you use pthread_setname_np()?
Evidence Planting
- How will you make it clear which thread caused the problem (comments, patterns)?
- What corruption patterns will help identify the source?
Analysis Workflow
- What systematic approach will you document for multi-threaded crash analysis?
- How do you correlate state across threads?

Thinking Exercise

Map Thread Interactions

Given 4 threads accessing a shared linked list:

Thread 1: Reads nodes
Thread 2: Adds nodes
Thread 3: Removes nodes
Thread 4: Reads nodes

Questions:

If Thread 1 crashes dereferencing a freed node, which thread might have freed it?
If Thread 2 crashes while adding a node, could another thread be involved?
How would you use thread apply all bt to investigate?

Draw the interaction diagram:

Thread 1 (Read)     Thread 2 (Add)      Thread 3 (Remove)    Thread 4 (Read)
     |                   |                   |                    |
     |   ←── shared_list (mutex protected?) ──→                   |
     |                   |                   |                    |
   read(node)        add(new)           remove(node)          read(node)
     |                   |                   |                    |
     ↓                   ↓                   ↓                    ↓
 If node freed      Race with             Frees node        If node freed
 → SIGSEGV          remove?               |                 → SIGSEGV
                                          ↓
                                    If reader still
                                    using → crash

The Interview Questions They Will Ask

“How do you determine which thread caused a crash in a multi-threaded program?”
“What is a data race, and how would you detect one from a core dump?”
“The backtrace shows the crash in a standard library function (strcmp). How do you find your bug?”
“How do you examine mutex state in a core dump?”
“Describe a scenario where the crashing thread is NOT where the bug is.”
“What tools besides GDB help find concurrency bugs? (helgrind, tsan)”

Hints in Layers

Hint 1: Name Your Threads Use pthread_setname_np() to give threads meaningful names. GDB will show these in info threads:

pthread_setname_np(pthread_self(), "worker-1");

Hint 2: Use Recognizable Corruption Patterns When one thread intentionally corrupts data, use patterns that stand out:

// Corrupting thread:
shared_ptr = (void*)0xDEADBEEF;  // Obvious bad pointer

// Or fill with pattern:
memset(shared_buffer, 0x41, size);  // All 'A's

Hint 3: Thread Apply All Is Your Friend

(gdb) thread apply all bt         # All backtraces
(gdb) thread apply all bt full    # All backtraces with locals
(gdb) thread apply all print shared_var  # Check shared var in each thread

Hint 4: Look for Mutex Wait Patterns A thread blocked on a mutex will have a frame like:

#0  __lll_lock_wait () at lowlevellock.S:49
#1  pthread_mutex_lock () at pthread_mutex_lock.c:80

This tells you which thread is waiting. Use info threads to find who might be holding the lock.

Books That Will Help

Topic	Book	Chapter
POSIX Threads	“The Linux Programming Interface” by Kerrisk	Ch. 29-30
Thread Debugging	“The Art of Debugging” by Matloff	Ch. 5: Debugging in Multi-threaded Environments
Race Conditions	“C Programming: A Modern Approach” by King	Ch. 19: Program Design
Lock-free Patterns	“C++ Concurrency in Action” by Williams	Ch. 5-6

Common Pitfalls and Debugging

Problem 1: “All threads show the same backtrace”

Why: Copy/paste error, or you’re looking at the wrong column in info threads.
Fix: Use thread apply all bt which clearly labels each thread’s backtrace.
Quick test: Count unique LWP numbers in info threads

Problem 2: “Can’t tell which thread holds the mutex”

Why: Mutex internals are opaque in most debuggers.
Fix: Look for the thread that’s NOT waiting on the mutex and has the lock in scope. Or examine pthread_mutex_t internals (glibc-specific).
Quick test: thread apply all print my_mutex to see state in each context

Problem 3: “Race condition doesn’t reproduce”

Why: Races are timing-dependent by nature.
Fix: Add sleeps, use multiple runs, or use sched_yield() to increase collision probability.
Quick test: Run in a loop: for i in {1..100}; do ./thread-chaos; done

Problem 4: “Thread that caused corruption has already exited”

Why: Threads can exit before the crash occurs.
Fix: Look for evidence in remaining threads or memory. The corruption pattern itself may indicate the source.
Quick test: Check thread count at crash time vs expected

Definition of Done

Created at least 3 multi-threaded crash scenarios (race, deadlock attempt, cross-thread UAF)
Can use info threads and thread apply all bt effectively
Can identify the crashing thread vs the culprit thread
Documented the thread analysis workflow
Threads have meaningful names visible in GDB
Demonstrated finding the “other thread” that caused a crash
Can explain how mutex wait patterns appear in backtraces

Project 6: The Stripped Binary Challenge — Debugging Without Symbols

File: P06-stripped-binary-debugging-without-symbols.md
Main Programming Language: C
Alternative Programming Languages: C++, Assembly
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Reverse Engineering, Assembly, Binary Analysis
Software or Tool: GDB, objdump, readelf, IDA Free or Ghidra
Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you will build: A debugging workflow for analyzing crashes in stripped binaries (no debug symbols, no function names). You’ll learn to use disassembly, register analysis, and memory patterns to understand crashes when the backtrace shows only hex addresses.

Why it teaches crash dump analysis: Production binaries are often stripped to reduce size and protect intellectual property. Third-party libraries and closed-source software never have your debug symbols. Security researchers analyze malware that’s intentionally obfuscated. This project teaches you to debug when you have nothing but the binary and the crash.

Core challenges you will face:

Reading disassembly → Maps to understanding x86-64 instructions
Correlating addresses to code → Maps to using objdump and readelf
Understanding calling conventions → Maps to finding function arguments in registers
Reconstructing context without symbols → Maps to pattern recognition in memory

Real World Outcome

You will be able to extract useful crash information from binaries with no debug symbols—a skill that distinguishes expert debuggers.

Example Output:

$ file ./mystery-server
./mystery-server: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
    interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped

$ # No debug info! Let's crash it and analyze
$ ./mystery-server &
[1] 4567
$ curl http://localhost:8080/crash
curl: (52) Empty reply from server
$ # Server crashed

$ gdb ./mystery-server core.4567
(gdb) bt
#0  0x0000000000401a3c in ?? ()
#1  0x0000000000401b89 in ?? ()
#2  0x0000000000401df2 in ?? ()
#3  0x00000000004015a1 in ?? ()
#4  0x00007ffff7db3d90 in __libc_start_call_main () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) # All "??" - no symbols. But we can still analyze!

(gdb) x/10i 0x0000000000401a3c
   0x401a3c:    mov    (%rax),%edx         # Crashed here - dereferencing RAX
   0x401a3e:    add    %edx,%r12d
   0x401a41:    add    $0x8,%rax
   0x401a45:    cmp    %rax,%rbx
   ...

(gdb) info registers rax
rax            0x0                 0     # RAX is 0! NULL pointer dereference

(gdb) # Let's find what function we're in by looking at frame 1's call site
(gdb) x/5i 0x0000000000401b89-5
   0x401b84:    call   0x401a20           # Calling our crashing function
   0x401b89:    mov    %eax,%r13d         # Return point (frame 1's address)

(gdb) # So 0x401a20 is the start of the crashing function
(gdb) # Let's see what it does

(gdb) x/30i 0x401a20
   0x401a20:    push   %rbp
   0x401a21:    push   %rbx
   0x401a22:    push   %r12
   0x401a24:    mov    %rdi,%rbp         # First arg (pointer)
   0x401a27:    mov    %rsi,%rbx         # Second arg (count)
   ...
   0x401a3c:    mov    (%rax),%edx       # CRASH: rax derived from arg

The Core Question You Are Answering

“How do I debug a crash when the binary tells me NOTHING?”

Stripped binaries are the norm in production. When something crashes, you won’t have source lines, function names, or variable names. But you still have the CPU state, memory contents, and the binary itself. With assembly knowledge and systematic analysis, you can still find the bug.

Concepts You Must Understand First

x86-64 Calling Convention
- Arguments 1-6 are in: RDI, RSI, RDX, RCX, R8, R9
- Return value is in RAX
- Callee-saved: RBP, RBX, R12-R15
- Stack grows down, RSP is stack pointer
- Book Reference: “CSAPP” by Bryant — Ch. 3.7
Basic x86-64 Instructions
- mov (%rax), %edx — Load memory at address in RAX into EDX
- call ADDR — Call function at ADDR
- push/pop — Stack operations
- cmp/je/jne — Comparison and conditional jumps
- Book Reference: “CSAPP” by Bryant — Ch. 3
ELF Structure Without Symbols
- How to find function boundaries (look for push rbp; mov rsp, rbp)
- Using .plt and .got for library function identification
- String references can reveal function purpose
- Book Reference: “Practical Binary Analysis” by Andriesse — Ch. 2-3
GDB Disassembly Commands
- x/Ni ADDR — Disassemble N instructions at ADDR
- disassemble ADDR,+LEN — Disassemble range
- info registers — Show all registers
- x/s ADDR — Try to interpret memory as string
- Book Reference: GDB Manual, Examining Memory

Questions to Guide Your Design

Scenario Creation
- How will you create a stripped binary that crashes in interesting ways?
- Will you keep a non-stripped version for verification?
Analysis Workflow
- What systematic steps will you follow for stripped binary analysis?
- How will you document the workflow for future reference?
Tool Integration
- Will you use objdump, readelf, or both?
- Will you introduce a disassembler (Ghidra, IDA Free)?
Pattern Recognition
- What common patterns will you learn to recognize (function prologues, loops, string operations)?
- How will you identify library calls?

Thinking Exercise

Decode the Crash

Given this GDB output from a stripped binary:

(gdb) bt
#0  0x0000000000401234 in ?? ()
#1  0x00007ffff7ca5678 in strlen () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000401567 in ?? ()

(gdb) info registers
rax            0x0                 0
rdi            0x0                 0

Questions:

Frame 1 is strlen from libc. What does this tell you about the crash?
RDI is 0, and strlen takes its argument in RDI. What happened?
The bug is likely in frame 2 (0x401567). Why?
What would you examine next?

Analysis:

strlen crashed because it received NULL (RDI=0)
Frame 2 called strlen with a NULL pointer
Next step: disassemble around 0x401567 to see why NULL was passed
Look for: mov $0x0,%rdi or a conditional that should have checked for NULL

The Interview Questions They Will Ask

“You have a core dump from a stripped binary. Walk me through your analysis approach.”
“What x86-64 registers contain function arguments 1, 2, and 3?”
“How can you find function boundaries in a stripped binary?”
“The crash is in libc’s malloc. How do you find the bug in your code?”
“How do you identify what function a code block implements without symbols?”
“What tools besides GDB help with stripped binary analysis?”

Hints in Layers

Hint 1: Find Function Starts Look for the standard function prologue:

push   %rbp
mov    %rsp,%rbp
sub    $0xNN,%rsp      ; Allocate stack space

Every push %rbp is likely a function entry.

Hint 2: Use Cross-References Find who calls the crashing function:

(gdb) x/5i RETURN_ADDRESS-5
# Look for "call CRASHING_FUNC_ADDR"

Hint 3: Library Calls Are Labeled Even in stripped binaries, calls to libc functions go through PLT:

call   0x401030 <strlen@plt>

GDB shows these names because they’re in the dynamic symbol table.

Hint 4: String References Look for strings to understand purpose:

$ strings -t x ./mystery-server | grep -i error
  4a20 Error: invalid input
  4b30 Connection error

Then in GDB:

(gdb) x/s 0x404a20
0x404a20: "Error: invalid input"

Find code that references this address.

Books That Will Help

Topic	Book	Chapter
x86-64 Assembly	“CSAPP” by Bryant & O’Hallaron	Ch. 3: Machine-Level Representation
ELF Format	“Practical Binary Analysis” by Andriesse	Ch. 2: ELF Binary Format
Reverse Engineering	“Practical Reverse Engineering” by Dang	Ch. 1-2
GDB Advanced	“The Art of Debugging” by Matloff	Ch. 7: Advanced Topics

Common Pitfalls and Debugging

Problem 1: “Can’t tell where functions start/end”

Why: Without symbols, there are no boundaries marked.
Fix: Look for push %rbp (function start) and ret (function end). Also look for alignment padding (nop instructions).
Quick test: objdump -d ./binary | grep -E "(push.*%rbp|retq)"

Problem 2: “Register values don’t make sense”

Why: You’re looking at registers after the crash, not before. Some may be corrupted.
Fix: Focus on callee-saved registers (RBP, RBX, R12-R15) as they preserve values across calls.
Quick test: Verify stack integrity first

Problem 3: “Can’t find the string that caused the crash”

Why: String might be on the heap (not in binary) or already freed.
Fix: Examine memory at the pointer address. Check if it’s a valid address range.
Quick test: info proc mappings to see valid address ranges

Problem 4: “Disassembly looks like garbage”

Why: You might be disassembling data, not code. Or wrong address.
Fix: Use the addresses from the backtrace. They’re valid instruction pointers.
Quick test: x/1i $pc should always show a valid instruction

Definition of Done

Created a stripped binary that crashes in an interesting way
Can analyze the crash using only GDB and the binary (no source)
Documented the x86-64 calling convention and key registers
Can find function boundaries without symbols
Can identify what libc functions are being called
Can trace from crash site to the actual bug
Created a stripped binary analysis workflow/checklist
Verified findings against the (hidden) source code

Project 7: Minidump Parser — Understanding Compact Crash Formats

File: P07-minidump-parser-compact-crash-formats.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Binary Parsing, File Formats, Cross-Platform
Software or Tool: Breakpad, minidump_dump, custom parser
Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you will build: A parser for Google Breakpad’s minidump format—a compact, cross-platform crash dump format used by Chrome, Firefox, and many other applications. You’ll learn to read the minidump specification and extract crash information from real-world minidump files.

Why it teaches crash dump analysis: Not all crash dumps are ELF core files. Breakpad minidumps are ubiquitous in commercial software because they’re small (kilobytes vs megabytes), cross-platform, and contain just enough information for diagnosis. Understanding this format exposes you to how the industry handles crash reporting at scale.

Core challenges you will face:

Reading a binary file format specification → Maps to understanding headers, streams, and directories
Parsing variable-length structures → Maps to handling strings, lists, and nested data
Extracting platform-specific data → Maps to handling x86-64 vs ARM vs Windows contexts
Correlating with symbol files → Maps to understanding .sym files and address resolution

Real World Outcome

You will have a working minidump parser that extracts crash information comparable to Google’s minidump_dump tool.

Example Output:

$ ./minidump-parser ./crash.dmp
=== Minidump Analysis: crash.dmp ===

Header:
  Signature: MDMP
  Version: 0xa793
  Stream Count: 12
  Timestamp: 2025-01-04 15:30:45 UTC

Crash Info:
  Exception Code: EXCEPTION_ACCESS_VIOLATION (0xc0000005)
  Crash Address: 0x00007ff6123456ab
  Crashing Thread: 0

Threads (4 total):
  Thread 0 (Crashed):
    Stack: 0x000000c8f9b00000 - 0x000000c8f9c00000
    Context:
      RIP: 0x00007ff6123456ab
      RSP: 0x000000c8f9bff8a0
      RBP: 0x000000c8f9bff8f0

  Thread 1:
    Stack: 0x000000c8f9400000 - 0x000000c8f9500000
    Context: [suspended]

Modules (15 loaded):
  0x00007ff612340000 - 0x00007ff612380000  myapp.exe
  0x00007ffc12340000 - 0x00007ffc12560000  ntdll.dll
  0x00007ffc10000000 - 0x00007ffc10200000  kernel32.dll
  ...

Memory Regions (5 captured):
  0x000000c8f9bff800 - 0x000000c8f9c00000 (2048 bytes) - Stack near crash
  ...

Stack Trace (unsymbolicated):
  #0  0x00007ff6123456ab  myapp.exe + 0x156ab
  #1  0x00007ff612345123  myapp.exe + 0x15123
  #2  0x00007ffc12345678  ntdll.dll + 0x5678

The Core Question You Are Answering

“How do crash reporting systems represent crashes in a portable, compact format?”

ELF core dumps are Linux-specific and huge. Minidumps solve this: they’re small, cross-platform, and contain just the information needed for diagnosis. Understanding this format teaches you how production crash reporting actually works at companies like Google, Mozilla, and Microsoft.

Concepts You Must Understand First

Minidump Format Structure
- Header: Signature, version, stream directory location
- Stream directory: Array of (type, size, offset) entries
- Streams: Thread list, module list, exception, memory, system info
- Book Reference: MSDN Minidump File Format documentation
Stream Types You’ll Parse
- MINIDUMP_STREAM_TYPE: ThreadListStream, ModuleListStream, ExceptionStream
- Memory streams: MemoryListStream, Memory64ListStream
- Context stream: Thread context (registers) per architecture
- Book Reference: Breakpad source code (src/google_breakpad/common/minidump_format.h)
CPU Context Structures
- Different for x86, x86-64, ARM, ARM64
- Contains all registers at crash time
- Must know architecture to interpret correctly
- Book Reference: Processor-specific ABI documents
Symbol Files (.sym)
- Breakpad’s text-based symbol format
- Maps addresses to function names and source lines
- PUBLIC records for function starts, FUNC + line records for details
- Book Reference: Breakpad symbol file format documentation

Questions to Guide Your Design

Parsing Strategy
- Will you read the entire file into memory or seek to each stream?
- How will you handle endianness (minidumps from Windows are little-endian)?
Platform Support
- Will you support x86-64 contexts only, or multiple architectures?
- How will you detect the architecture from the minidump?
Output Format
- What human-readable format will you produce?
- Will you support JSON output for programmatic use?
Symbol Integration
- Will you support loading .sym files for symbolication?
- How will you match modules to their symbol files?

Thinking Exercise

Map the Minidump Structure

Given a hex dump of a minidump header:

00000000: 4d44 4d50 93a7 0000 0c00 0000 2001 0000  MDMP........
00000010: 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx  ................

Questions:

What is the signature (first 4 bytes)?
What is the version (next 4 bytes)?
How many streams are in the directory?
Where is the stream directory located?

Decode:

Signature: "MDMP" (0x504d444d in little-endian)
Version: 0x0000a793
Stream Count: 12 (0x0000000c)
Stream Directory RVA: 0x00000120 (288 bytes into file)

The Interview Questions They Will Ask

“What is a minidump, and how does it differ from a core dump?”
“Walk me through the structure of a minidump file.”
“How does Breakpad capture crash information without stopping the process?”
“What information would you need to symbolicate a minidump stack trace?”
“How would you design a crash reporting system for a mobile app?”
“What are the privacy implications of crash dumps, and how do minidumps address them?”

Hints in Layers

Hint 1: Start with the Header The header is fixed-size and tells you where everything else is:

typedef struct {
    uint32_t signature;    // "MDMP"
    uint32_t version;
    uint32_t stream_count;
    uint32_t stream_directory_rva;  // Offset to directory
    uint32_t checksum;
    uint32_t timestamp;
    uint64_t flags;
} MINIDUMP_HEADER;

Hint 2: Parse the Stream Directory Each entry tells you the type, size, and location of a stream:

typedef struct {
    uint32_t stream_type;  // e.g., ThreadListStream = 3
    uint32_t size;
    uint32_t rva;          // Offset in file
} MINIDUMP_DIRECTORY;

Hint 3: Use Existing Tools for Verification Breakpad’s minidump_dump tool shows you what to expect:

$ minidump_dump crash.dmp
# Compare your output with this

Hint 4: Handle Variable-Length Strings Minidumps use MINIDUMP_STRING for module names:

typedef struct {
    uint32_t length;      // In bytes, not including terminator
    uint16_t buffer[1];   // UTF-16LE string data
} MINIDUMP_STRING;

Read length bytes, convert from UTF-16LE to your preferred encoding.

Books That Will Help

Topic	Book	Chapter
Binary Parsing	“Practical Binary Analysis” by Andriesse	Ch. 2-3: Binary Formats
File I/O in C	“The C Programming Language” by K&R	Ch. 7: Input and Output
Endianness	“CSAPP” by Bryant	Ch. 2.1: Information Storage
Windows Internals	“Windows Internals” by Russinovich	Ch. 3: System Mechanisms

Common Pitfalls and Debugging

Problem 1: “Signature doesn’t match ‘MDMP’“

Why: You might be reading big-endian or the file isn’t a minidump.
Fix: Check byte order. Minidumps are little-endian. Verify with file command.
Quick test: xxd -l 16 crash.dmp should show “MDMP” as ASCII

Problem 2: “Stream RVA points to wrong location”

Why: RVA is relative to file start, not current position. Off-by-one error.
Fix: Seek to absolute position in file, not relative.
Quick test: Print RVAs and manually verify with hex editor

Problem 3: “Module names are garbled”

Why: Minidumps store strings as UTF-16LE, not ASCII.
Fix: Convert UTF-16LE to UTF-8. The length field is in bytes.
Quick test: Check if bytes alternate with 0x00 (ASCII in UTF-16)

Problem 4: “Context structure size doesn’t match”

Why: Context structure varies by CPU architecture.
Fix: Check SystemInfo stream for processor architecture, use correct struct.
Quick test: Print context size vs expected for the architecture

Definition of Done

Can read and validate the minidump header
Can enumerate all streams in the stream directory
Can parse ThreadListStream and show thread information
Can parse ModuleListStream and show loaded modules
Can parse ExceptionStream and show crash details
Can extract and display CPU context (registers) for crashing thread
Output matches minidump_dump tool for test files
Documented the minidump format as learned

Project 8: Kernel Panic Anatomy — Triggering and Capturing with kdump

File: P08-kernel-panic-kdump-capture.md
Main Programming Language: C (kernel module)
Alternative Programming Languages: Shell scripting for setup
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Kernel Development, System Recovery, Enterprise Linux
Software or Tool: kdump, kexec, crash, kernel module development
Main Book: “Linux Kernel Development” by Robert Love

What you will build: A controlled kernel panic environment using kdump—the industry-standard mechanism for capturing kernel crash dumps. You’ll write a simple kernel module that triggers a panic on demand and configure kdump to capture the vmcore for analysis.

Why it teaches crash dump analysis: User-space crashes are tame compared to kernel panics. When the kernel crashes, there’s no OS to save you—kdump uses a clever trick (kexec) to boot a minimal “capture kernel” that saves memory before rebooting. This is how enterprise Linux (RHEL, SUSE) handles kernel crashes, and understanding it is essential for systems engineers.

Core challenges you will face:

Configuring kdump correctly → Maps to kernel parameters, crashkernel reservation
Writing a kernel module → Maps to basic kernel development
Understanding kexec → Maps to the boot-into-capture-kernel mechanism
Handling the vmcore → Maps to where and how the dump is saved

Real World Outcome

You will have a working kdump setup in a VM that reliably captures kernel panics, along with a kernel module that triggers panics on demand for testing.

Example Output:

$ # On a VM configured with kdump
$ sudo kdumpctl status
kdump: Kdump is operational

$ cat /proc/cmdline
... crashkernel=256M ...

$ # Our panic-trigger module
$ ls /sys/kernel/panic_trigger/
trigger

$ # Trigger the panic (VM will crash and reboot)
$ echo 1 | sudo tee /sys/kernel/panic_trigger/trigger
[  123.456789] panic_trigger: Triggering kernel panic!
[  123.456790] Kernel panic - not syncing: Manually triggered panic
...

$ # After reboot, kdump captured the vmcore
$ ls /var/crash/127.0.0.1-2025-01-04-15:30:45/
vmcore  vmcore-dmesg.txt

$ # Check that it's valid
$ sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
    /var/crash/127.0.0.1-2025-01-04-15:30:45/vmcore
crash> bt
PID: 1234   TASK: ffff8881234abcd0  CPU: 0   COMMAND: "tee"
 #0 [ffffc90001234e40] machine_kexec at ffffffff81060abc
 #1 [ffffc90001234e90] __crash_kexec at ffffffff811234de
 #2 [ffffc90001234f00] panic at ffffffff81890123
 #3 [ffffc90001234f80] panic_trigger_write at ffffffffc0001234 [panic_trigger]
 ...

The Core Question You Are Answering

“When the kernel itself crashes, how do you capture the state when there’s no OS to help?”

The kernel is the foundation—when it panics, everything stops. The ingenious solution is kdump: pre-load a second “capture kernel” into reserved memory, and when panic occurs, kexec immediately boots into it. This minimal kernel’s only job is to save the crashed kernel’s memory to disk before rebooting. It’s one of the most elegant debugging mechanisms in systems software.

Concepts You Must Understand First

Kernel Panic Basics
- What triggers a kernel panic (BUG(), NULL deref in kernel, deadlock)?
- What information is printed to console?
- Why can’t you just write to disk from the panicking kernel?
- Book Reference: “Linux Kernel Development” by Love — Ch. 18: Debugging
kexec and kdump Architecture
- kexec: Boot a new kernel without going through BIOS
- kdump: Use kexec to boot into a pre-loaded capture kernel
- crashkernel parameter: Reserve memory for the capture kernel
- Book Reference: kernel.org documentation on kdump
Kernel Module Basics
- Module loading with insmod/modprobe
- init and exit functions
- Sysfs interface for triggering actions
- Book Reference: “Linux Device Drivers” by Corbet — Ch. 1-2
vmcore Format
- The memory dump created by kdump
- ELF format with special notes
- Contains all kernel memory at panic time
- Book Reference: crash utility documentation

Questions to Guide Your Design

VM Setup
- What virtualization platform will you use (QEMU, VirtualBox, VMware)?
- How much RAM should you reserve for crashkernel?
kdump Configuration
- Where should vmcores be saved (local disk, NFS, SSH)?
- What kernel debug symbols do you need?
Panic Module Design
- What trigger mechanism (sysfs file, procfs, ioctl)?
- Should it support different panic types (BUG, NULL, deadlock)?
Safety Considerations
- How will you ensure this only runs in a VM?
- What warnings should the module print?

Thinking Exercise

Trace the kdump Boot Sequence

Map out what happens when you trigger a panic with kdump configured:

1. User writes to /sys/kernel/panic_trigger/trigger
2. Kernel module calls panic("...")
3. Kernel enters panic() function:
   - Disables interrupts, stops other CPUs
   - Prints panic message to console
   - Calls __crash_kexec() if kdump is configured
4. __crash_kexec():
   - Copies registers and memory info to pre-defined location
   - Jumps to the capture kernel (loaded at boot time)
5. Capture kernel boots:
   - Minimal initramfs with makedumpfile
   - Reads original kernel's memory from /proc/vmcore
   - Writes vmcore to /var/crash/
   - Reboots into normal kernel
6. After reboot:
   - vmcore is available for analysis with crash utility

Questions:

Why must the capture kernel be pre-loaded (not loaded at panic time)?
Why does crashkernel memory need to be reserved at boot?
What if the panic happens in the capture kernel?

The Interview Questions They Will Ask

“A production server kernel panicked overnight. Walk me through your investigation process.”
“What is kdump, and how does it differ from regular core dumps?”
“How does kexec boot a new kernel without going through BIOS?”
“What is the crashkernel boot parameter, and how do you determine the right value?”
“What information do you need to analyze a vmcore?”
“How would you configure kdump to send crash dumps to a remote server?”

Hints in Layers

Hint 1: Start with kdump Configuration Before writing modules, get kdump working with manual triggers:

# Fedora/RHEL
sudo dnf install kexec-tools crash kernel-debuginfo
sudo systemctl enable kdump
# Add crashkernel=256M to kernel command line
sudo reboot

# Test kdump is operational
sudo kdumpctl status

Hint 2: Simple Panic Module Skeleton

// panic_trigger.c
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sysfs.h>
#include <linux/kobject.h>

static struct kobject *panic_kobj;

static ssize_t trigger_store(struct kobject *kobj,
    struct kobj_attribute *attr, const char *buf, size_t count)
{
    pr_alert("panic_trigger: Triggering kernel panic!\n");
    panic("Manually triggered panic from panic_trigger module");
    return count;  // Never reached
}

static struct kobj_attribute trigger_attr = __ATTR_WO(trigger);

static int __init panic_trigger_init(void) { /* ... */ }
static void __exit panic_trigger_exit(void) { /* ... */ }

Hint 3: Verify Debug Symbols The crash utility needs matching kernel debug symbols:

# Check kernel version
uname -r

# Install debug symbols
# Fedora: sudo dnf debuginfo-install kernel
# Ubuntu: sudo apt install linux-image-$(uname -r)-dbgsym

# Verify
ls /usr/lib/debug/lib/modules/$(uname -r)/vmlinux

Hint 4: Test in VM Only! Add a safety check to your module:

static int __init panic_trigger_init(void)
{
    if (!hypervisor_is_present()) {
        pr_err("panic_trigger: Refusing to load on bare metal!\n");
        return -EPERM;
    }
    // ...
}

Books That Will Help

Topic	Book	Chapter
Kernel Modules	“Linux Device Drivers, 3rd Ed” by Corbet	Ch. 1-2: Building Modules
Kernel Debugging	“Linux Kernel Development” by Love	Ch. 18: Debugging
Kernel Internals	“Understanding the Linux Kernel” by Bovet	Ch. 4: Interrupts
System Boot	“How Linux Works” by Ward	Ch. 5: How Linux Boots

Common Pitfalls and Debugging

Problem 1: “kdump service won’t start”

Why: crashkernel parameter missing or insufficient memory reserved.
Fix: Add crashkernel=256M (or more) to kernel command line in GRUB.
Quick test: cat /proc/cmdline | grep crashkernel

Problem 2: “Panic happens but no vmcore created”

Why: Capture kernel failed to boot or write. Check crash directory.
Fix: Check /var/crash/ for partial dumps or errors. Check serial console output.
Quick test: journalctl -b -1 | grep -i kdump (previous boot logs)

Problem 3: “crash says ‘vmcore and vmlinux do not match’“

Why: Debug symbols are for a different kernel version.
Fix: Install debug symbols for the exact kernel that crashed.
Quick test: crash --version and compare kernel versions

Problem 4: “Module won’t load: ‘Unknown symbol’“

Why: Missing kernel headers or mismatched versions.
Fix: Install kernel-devel package for your running kernel.
Quick test: ls /lib/modules/$(uname -r)/build

Definition of Done

VM configured with kdump operational (kdumpctl status shows ready)
Kernel module created that triggers panic via sysfs
Successfully triggered panic and vmcore was captured
vmcore can be opened with crash utility
Can extract backtrace showing the panic call chain
Documented the complete kdump setup process
Module includes safety check to prevent loading on bare metal

Project 9: Analyzing Kernel Crashes with the crash Utility

File: P09-analyzing-kernel-crashes-crash-utility.md
Main Programming Language: crash commands (interactive)
Alternative Programming Languages: crash extensions in C
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Kernel Internals, Debugging, System Analysis
Software or Tool: crash utility, kernel debug symbols
Main Book: “Understanding the Linux Kernel” by Bovet & Cesati

What you will build: Expertise in using the crash utility to analyze kernel vmcores. You’ll investigate real kernel panics (from Project 8 and downloaded examples) and learn to navigate kernel data structures, examine process state, and identify root causes of kernel crashes.

Why it teaches crash dump analysis: The crash utility is to kernel debugging what GDB is to user-space. It’s the standard tool used by Red Hat, SUSE, and kernel developers worldwide. Mastering it opens doors to kernel debugging, enterprise support, and deep systems understanding.

Core challenges you will face:

Navigating crash’s command set → Maps to bt, task, vm, kmem, and dozens more
Understanding kernel data structures → Maps to task_struct, mm_struct, etc.
Correlating kernel state → Maps to finding what led to the panic
Reading kernel source alongside crash → Maps to effective debugging workflow

Real World Outcome

You will have a systematic workflow for analyzing kernel crashes using crash, with a command reference and documented investigation of real vmcores.

Example Output:

$ sudo crash /usr/lib/debug/lib/modules/5.15.0/vmlinux /var/crash/vmcore

crash 8.0.0
...
      KERNEL: /usr/lib/debug/lib/modules/5.15.0/vmlinux
    DUMPFILE: /var/crash/vmcore
        DATE: Sat Jan  4 15:30:45 UTC 2025
      UPTIME: 01:23:45
LOAD AVERAGE: 0.50, 0.35, 0.20
       TASKS: 256
    NODENAME: myserver
     RELEASE: 5.15.0-generic
     VERSION: #1 SMP
     MACHINE: x86_64
      MEMORY: 8 GB
       PANIC: "Kernel panic - not syncing: Manually triggered panic"

crash> bt
PID: 1234   TASK: ffff888123456780  CPU: 2   COMMAND: "tee"
 #0 [ffffc90001234000] machine_kexec at ffffffff81060abc
 #1 [ffffc90001234050] __crash_kexec at ffffffff81123def
 #2 [ffffc90001234100] panic at ffffffff818901ab
 #3 [ffffc90001234180] panic_trigger_write at ffffffffc0123456 [panic_trigger]
 #4 [ffffc900012341d0] kernfs_fop_write_iter at ffffffff8132abcd
 #5 [ffffc90001234250] vfs_write at ffffffff8128def0
 #6 [ffffc90001234290] ksys_write at ffffffff8128f123
 #7 [ffffc900012342d0] do_syscall_64 at ffffffff81890456
 #8 [ffffc90001234300] entry_SYSCALL_64_after_hwframe at ffffffff82000089

crash> task
PID: 1234   TASK: ffff888123456780  CPU: 2   COMMAND: "tee"
struct task_struct {
  state = 0x0,
  stack = 0xffffc90001230000,
  pid = 1234,
  tgid = 1234,
  comm = "tee",
  ...
}

crash> vm
PID: 1234   TASK: ffff888123456780  CPU: 2   COMMAND: "tee"
       MM               PGD          RSS    TOTAL_VM
ffff888112233440  ffff88810abcd000  2048k    12288k
      VMA           START       END     FLAGS FILE
ffff888100001000 7f0000000000 7f0000021000 8000875 /usr/bin/tee
...

crash> log | tail -20
[  123.456789] panic_trigger: Triggering kernel panic!
[  123.456790] Kernel panic - not syncing: Manually triggered panic

The Core Question You Are Answering

“How do I investigate a kernel crash when I have gigabytes of kernel memory and thousands of data structures?”

A vmcore contains everything: all processes, all memory, all kernel state. The challenge is navigation. The crash utility provides commands to examine specific data structures, follow pointers, and correlate state across subsystems. It’s like GDB but for the entire operating system state.

Concepts You Must Understand First

crash Command Categories
- Process commands: bt, task, ps, foreach
- Memory commands: vm, kmem, rd, wr
- Kernel state: log, timer, irq, runq
- System info: sys, mach, net
- Book Reference: crash whitepaper by Dave Anderson (Red Hat)
Key Kernel Data Structures
- task_struct: Process descriptor
- mm_struct: Memory management
- inode, dentry: Filesystem
- sk_buff, socket: Networking
- Book Reference: “Understanding the Linux Kernel” by Bovet — throughout
Kernel Stack Traces
- How to read kernel backtraces
- Identifying the panic function
- Finding the root cause vs symptom
- Book Reference: “Linux Kernel Development” by Love — Ch. 18
Using Source Code
- Correlating crash output with kernel source
- Using LXR or elixir.bootlin.com
- Understanding inline functions and macros
- Book Reference: Kernel source code itself

Questions to Guide Your Design

Learning Path
- What commands will you focus on first?
- How will you practice each command?
Sample Vmcores
- Where will you get vmcores to practice with?
- Will you create different crash types in Project 8?
Documentation
- What format will your crash command reference take?
- How will you document your analysis workflow?
Advanced Features
- Will you explore crash extensions?
- Will you write custom crash macros?

Thinking Exercise

Trace a NULL Pointer Panic

Given this crash backtrace:

crash> bt
 #0 page_fault at ffffffff81789abc
 #1 do_page_fault at ffffffff8101def0
 #2 async_page_fault at ffffffff82000123
 #3 my_driver_read at ffffffffc0123456 [my_driver]
 #4 vfs_read at ffffffff8128abc0
 #5 ksys_read at ffffffff8128bcd0
 #6 do_syscall_64 at ffffffff81890456

Questions:

What is the immediate cause of the panic?
Which frame contains your (or the third-party) code?
What commands would you use to investigate my_driver_read?

Investigation steps:

crash> dis my_driver_read            # Disassemble the function
crash> bt -f                         # Show frame addresses
crash> x/20x <frame_address>         # Examine stack at crash point
crash> task                          # What process triggered this?
crash> files                         # What files did it have open?

The Interview Questions They Will Ask

“Walk me through analyzing a kernel panic using the crash utility.”
“What does the ‘bt’ command show, and how do you interpret kernel stack traces?”
“How do you find what process was running when the kernel panicked?”
“A customer reports a production kernel panic. What information do you need?”
“How do you examine the contents of a kernel data structure in crash?”
“What is the difference between ‘rd’ and ‘struct’ commands in crash?”

Hints in Layers

Hint 1: Essential Commands to Learn First

bt          # Backtrace of current/specified task
log         # Kernel ring buffer (dmesg)
task        # Current task's task_struct
ps          # Process list
vm          # Virtual memory info
kmem -i     # Memory usage summary
sys         # System information

Hint 2: Investigating a Specific Task

crash> ps | grep suspicious_process
   1234      1   0  ffff888123456780  RU   0.5   myprocess

crash> set 1234                    # Set context to PID 1234
crash> bt                          # Backtrace of that process
crash> files                       # Open files
crash> vm                          # Memory maps
crash> task -R                     # Reveal task_struct contents

Hint 3: Examining Memory

crash> kmem -s                     # Slab cache info
crash> kmem <address>              # What does this address point to?
crash> rd <address> 64             # Read 64 bytes at address
crash> struct task_struct <addr>  # Interpret as task_struct
crash> list task_struct.tasks -H <head> # Walk a linked list

Hint 4: Using foreach

crash> foreach bt                  # Backtrace of ALL tasks
crash> foreach RU bt               # Backtrace of running tasks only
crash> foreach files               # Open files for all processes

Books That Will Help

Topic	Book	Chapter
crash Basics	crash whitepaper by Dave Anderson	Entire document
Kernel Internals	“Understanding the Linux Kernel” by Bovet	Ch. 3: Processes
Memory Management	“Understanding the Linux Kernel” by Bovet	Ch. 8: Memory
Kernel Debugging	“Linux Kernel Development” by Love	Ch. 18: Debugging

Common Pitfalls and Debugging

Problem 1: “crash says ‘cannot find vmlinux’“

Why: Debug symbols not installed or wrong path.
Fix: Install kernel debuginfo package. Use full path to vmlinux.
Quick test: find /usr/lib/debug -name "vmlinux*"

Problem 2: “Backtrace shows only addresses, no function names”

Why: Missing debug symbols or wrong vmlinux.
Fix: Ensure vmlinux matches the kernel that crashed exactly.
Quick test: crash -s vmlinux vmcore shows version mismatch warnings

Problem 3: “‘struct’ command shows garbage”

Why: Wrong address or data structure mismatch.
Fix: Verify the address with kmem <addr> first. Check kernel version.
Quick test: kmem -s to find valid slab addresses to practice with

Problem 4: “Can’t find the process that crashed”

Why: The crashing context might be interrupt or kernel thread.
Fix: Check bt output for COMMAND field. Use ps -k for kernel threads.
Quick test: bt shows PID and COMMAND at the top

Definition of Done

Can load a vmcore into crash and get basic info
Mastered bt, log, task, ps, vm commands
Can investigate which process/thread triggered a panic
Can examine kernel data structures with struct command
Can navigate memory with rd, kmem, and address interpretation
Created a crash command cheat sheet with examples
Analyzed at least 3 different types of kernel crashes
Documented a complete investigation workflow

Project 10: Centralized Crash Reporter — Production-Grade Infrastructure

File: P10-centralized-crash-reporter-infrastructure.md
Main Programming Language: Python (backend), C (collector)
Alternative Programming Languages: Go, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: System Design, Distributed Systems, DevOps
Software or Tool: systemd-coredump, REST API, PostgreSQL, object storage
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you will build: A complete crash reporting infrastructure that collects crashes from multiple hosts, stores them centrally, generates analysis reports, fingerprints for deduplication, and provides a web interface for browsing crashes. This is a mini-version of Sentry, Crashlytics, or Mozilla’s crash-stats.

Why it teaches crash dump analysis: Everything you’ve learned comes together: core dump capture, GDB analysis, automation, fingerprinting. This project teaches you to think at scale: How do you handle 10,000 crashes per day? How do you identify the top 10 bugs? How do you integrate with alerting systems?

Core challenges you will face:

Reliable crash collection → Maps to handling partial dumps, network failures
Scalable storage → Maps to compression, retention, object storage
Fingerprinting and deduplication → Maps to grouping same bugs together
Analysis pipeline → Maps to automated report generation at scale
User interface → Maps to presenting crash data usefully

Real World Outcome

You will have a working crash reporting system that you can deploy for real services, demonstrating end-to-end understanding of production crash analysis.

Example Output:

Web Interface:

┌─────────────────────────────────────────────────────────────────┐
│ Crash Reporter Dashboard                           Last 24h ▼  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ Total Crashes: 847    Unique Bugs: 23    Hosts Affected: 12    │
│                                                                 │
│ Top Crashes (by occurrence)                                     │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ #1  NULL_DEREF_process_request_server.c:234   (423 crashes)│ │
│ │     First: Jan 3, 14:22  Last: Jan 4, 08:45                │ │
│ │     Hosts: web-01, web-02, web-03                          │ │
│ │     [View Stack] [View Crashes] [Mark Fixed]               │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ #2  SEGV_handle_upload_api.c:892             (198 crashes) │ │
│ │     First: Jan 4, 02:15  Last: Jan 4, 08:30                │ │
│ │     Hosts: api-01, api-02                                  │ │
│ │     [View Stack] [View Crashes] [Mark Fixed]               │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

API Output:

$ curl https://crash-reporter.internal/api/v1/crashes/latest
{
  "crashes": [
    {
      "id": "crash-2025-01-04-abc123",
      "fingerprint": "NULL_DEREF_process_request_server.c:234",
      "timestamp": "2025-01-04T08:45:23Z",
      "host": "web-01",
      "executable": "/opt/myapp/server",
      "signal": "SIGSEGV",
      "backtrace": [
        {"frame": 0, "function": "process_request", "file": "server.c", "line": 234},
        {"frame": 1, "function": "handle_client", "file": "server.c", "line": 189}
      ],
      "analysis": {
        "crash_type": "null_pointer_dereference",
        "variable": "req",
        "recommendation": "Add null check for req parameter"
      }
    }
  ],
  "total": 847,
  "page": 1,
  "per_page": 50
}

Collection Agent:

$ sudo crash-agent status
Crash Collection Agent v1.0.0
  Status: Running
  Server: https://crash-reporter.internal
  Collected today: 12 crashes
  Queue: 0 pending

$ # When a crash happens:
[2025-01-04 08:45:23] Detected new core dump: /var/lib/systemd/coredump/core.server.1234.xxx
[2025-01-04 08:45:24] Analyzing with GDB...
[2025-01-04 08:45:25] Generated report (fingerprint: NULL_DEREF_process_request_server.c:234)
[2025-01-04 08:45:26] Uploaded to server (crash-2025-01-04-abc123)
[2025-01-04 08:45:26] Core dump retained locally for 7 days

The Core Question You Are Answering

“How do you build a system that turns thousands of raw crashes into actionable insights?”

A crash dump is useless if it sits in /var/crash on one server. You need: collection (get dumps from everywhere), analysis (extract useful information), aggregation (group similar crashes), storage (keep them queryable), and presentation (show developers what matters). This is crash analysis at production scale.

Concepts You Must Understand First

System Architecture
- Collection agents on each host
- Central server with API and storage
- Analysis workers (possibly distributed)
- Web frontend for humans
- Book Reference: “Designing Data-Intensive Applications” by Kleppmann — Ch. 1-2
Crash Fingerprinting
- What identifies a unique bug? (crash location, call stack, signal)
- Hash functions for fingerprints
- Handling variability (ASLR, different call paths to same bug)
- Book Reference: Mozilla crash-stats documentation (online)
Data Pipeline
- Collection: Watch for new core dumps, extract metadata
- Analysis: Run GDB, generate report, compute fingerprint
- Storage: Metadata in database, cores in object storage
- Book Reference: “Data Pipelines with Apache Airflow” patterns
Scalability Considerations
- Rate limiting (prevent DoS from crash loops)
- Retention policies (can’t keep everything forever)
- Sampling (for very high volume)
- Book Reference: “Site Reliability Engineering” by Google

Questions to Guide Your Design

Collection
- How will agents discover new crashes (inotify, polling, coredumpctl)?
- What happens if the network is down?
- How do you handle crash loops (same bug crashing repeatedly)?
Analysis
- Will you analyze on the host or send raw cores to the server?
- How will you handle missing debug symbols?
- What timeout for analysis?
Storage
- What metadata goes in the database?
- Where do you store actual core dumps (if at all)?
- What’s your retention policy?
Fingerprinting
- What parts of the stack trace go into the fingerprint?
- How do you handle different code paths to the same bug?
- How do you detect when a bug is “fixed”?
Interface
- What views do users need (top bugs, recent crashes, specific host)?
- How do you integrate with alerting (PagerDuty, Slack)?
- Can developers mark bugs as “known” or “fixed”?

Thinking Exercise

Design the Data Flow

Trace a crash from occurrence to dashboard:

Host (web-01)                    Central Server                   User
    │                                  │                            │
    ├──[1. Crash occurs]              │                            │
    │   core.server.1234.xxx          │                            │
    │                                  │                            │
    ├──[2. Agent detects crash]       │                            │
    │   inotify on /var/crash         │                            │
    │                                  │                            │
    ├──[3. Agent analyzes locally]    │                            │
    │   GDB batch → JSON report       │                            │
    │                                  │                            │
    ├──[4. Agent POSTs to server]────►├──[5. Server receives]      │
    │   POST /api/v1/crashes          │   Validates, stores        │
    │   {report, fingerprint}         │                            │
    │                                  │                            │
    │                                  ├──[6. Fingerprint lookup]   │
    │                                  │   Existing bug or new?     │
    │                                  │                            │
    │                                  ├──[7. Update aggregates]    │
    │                                  │   Increment counter        │
    │                                  │   Update last_seen         │
    │                                  │                            │
    │                                  ├──[8. Trigger alerts]       │
    │                                  │   If new bug or spike      │
    │                                  │       │                    │
    │                                  │       └───────────────────►│
    │                                  │         Slack/PagerDuty    │
    │                                  │                            │
    │                                  │                 ◄──────────┤
    │                                  │            [9. User views  │
    │                                  │                dashboard]  │

Questions:

What if step 4 fails (network issue)?
What if the same crash happens 1000 times in 1 minute?
How do you handle crashes from different versions of the same software?

The Interview Questions They Will Ask

“Design a crash reporting system for a company with 1000 servers.”
“How would you implement crash fingerprinting to group similar crashes?”
“What’s your retention strategy for crash dumps at scale?”
“How do you handle a ‘crash storm’ where a bug causes thousands of crashes per minute?”
“How would you integrate crash reporting with your deployment pipeline?”
“What privacy concerns exist with crash dumps, and how do you address them?”

Hints in Layers

Hint 1: Start with the Agent Build the collection agent first—it’s the foundation:

# crash_agent.py - skeleton
import subprocess
import requests
import json

def watch_for_crashes():
    # Use coredumpctl or inotify to detect new crashes
    pass

def analyze_crash(core_path, exe_path):
    # Run GDB batch mode, return JSON report
    pass

def compute_fingerprint(report):
    # Hash crash location + top N frames
    pass

def upload_crash(report, fingerprint):
    # POST to central server
    pass

Hint 2: Fingerprint Algorithm A simple but effective fingerprint:

def compute_fingerprint(report):
    components = [
        report['signal'],
        report['backtrace'][0]['function'],  # Crash function
        report['backtrace'][0].get('file', 'unknown'),
        report['backtrace'][0].get('line', 0),
    ]
    # Add a few more frames for uniqueness
    for frame in report['backtrace'][1:4]:
        components.append(frame['function'])

    fingerprint_string = '|'.join(str(c) for c in components)
    return hashlib.sha256(fingerprint_string.encode()).hexdigest()[:16]

Hint 3: Database Schema

CREATE TABLE crashes (
    id UUID PRIMARY KEY,
    fingerprint VARCHAR(64),
    timestamp TIMESTAMP,
    host VARCHAR(255),
    executable VARCHAR(512),
    signal VARCHAR(32),
    report JSONB,  -- Full analysis report
    FOREIGN KEY (fingerprint) REFERENCES bugs(fingerprint)
);

CREATE TABLE bugs (
    fingerprint VARCHAR(64) PRIMARY KEY,
    first_seen TIMESTAMP,
    last_seen TIMESTAMP,
    total_count INTEGER,
    status VARCHAR(32),  -- new, known, fixed
    title VARCHAR(255)   -- Human-readable summary
);

Hint 4: Rate Limiting Crash Loops

class CrashRateLimiter:
    def __init__(self, max_per_minute=10):
        self.max_per_minute = max_per_minute
        self.fingerprint_counts = defaultdict(list)

    def should_report(self, fingerprint):
        now = time.time()
        # Clean old entries
        self.fingerprint_counts[fingerprint] = [
            t for t in self.fingerprint_counts[fingerprint]
            if now - t < 60
        ]
        # Check limit
        if len(self.fingerprint_counts[fingerprint]) >= self.max_per_minute:
            return False
        self.fingerprint_counts[fingerprint].append(now)
        return True

Books That Will Help

Topic	Book	Chapter
System Design	“Designing Data-Intensive Applications” by Kleppmann	Ch. 1-3, 10-11
API Design	“REST API Design Rulebook” by Masse	Throughout
Database Design	“Database Internals” by Petrov	Ch. 1-2
Operations	“Site Reliability Engineering” by Google	Ch. 4, 12

Common Pitfalls and Debugging

Problem 1: “Agent can’t keep up with crash rate”

Why: Analysis is slow, crashes queue up.
Fix: Implement rate limiting, skip duplicates within time window, async analysis.
Quick test: Monitor queue depth over time

Problem 2: “Fingerprints are too specific (every crash is ‘unique’)”

Why: Including too much variable data (addresses, PIDs).
Fix: Only use stable components (function names, file/line, signal). Strip addresses.
Quick test: Same bug should produce same fingerprint across runs

Problem 3: “Database grows too fast”

Why: Storing too much per crash, no retention.
Fix: Store summary in DB, full report in object storage. Implement TTL.
Quick test: Track DB size growth over time

Problem 4: “Can’t analyze crashes without debug symbols”

Why: Agents might not have symbols for all software.
Fix: Symbol server that agents can query. Fall back to address-only fingerprint.
Quick test: Test with stripped binary, verify graceful degradation

Definition of Done

Project Comparison Table

#	Project	Difficulty	Time	Depth of Understanding	Fun Factor	Real-World Value
1	The First Crash — Core Dump Generation	Level 1: Beginner	4-6 hours	★★☆☆☆ Foundation	★★★☆☆	Essential baseline
2	The GDB Backtrace — Extracting Crash Context	Level 1: Beginner	6-8 hours	★★★☆☆ Practical	★★★★☆	Daily debugging
3	The Memory Inspector — Deep State Examination	Level 2: Intermediate	10-15 hours	★★★★☆ Deep	★★★★☆	Real investigation
4	Automated Crash Detective — GDB Scripting	Level 2: Intermediate	15-20 hours	★★★★☆ Automation	★★★★★	CI/CD integration
5	Multi-threaded Mayhem — Concurrent Crashes	Level 3: Advanced	20-30 hours	★★★★★ Expert	★★★☆☆	Production systems
6	The Stripped Binary Challenge	Level 3: Advanced	15-25 hours	★★★★★ Expert	★★★★★	Security/Forensics
7	Minidump Parser — Compact Crash Formats	Level 2: Intermediate	15-20 hours	★★★★☆ Format Mastery	★★★★☆	Cross-platform
8	Kernel Panic Anatomy — kdump Configuration	Level 3: Advanced	20-30 hours	★★★★★ Kernel-Level	★★★☆☆	System reliability
9	Analyzing Kernel Crashes with crash	Level 3: Advanced	20-30 hours	★★★★★ Kernel-Level	★★★★☆	Kernel debugging
10	Centralized Crash Reporter	Level 3: Advanced	40-60 hours	★★★★★ Architecture	★★★★★	Production essential

Time Investment Summary

Learning Path	Total Time	Projects
Minimum Viable	10-14 hours	Projects 1, 2
Working Professional	40-60 hours	Projects 1-4
Expert Track	100-150 hours	Projects 1-6
Full Mastery	160-240 hours	All 10 projects

Recommendation

If You Are New to Crash Analysis

Start with Project 1: The First Crash

This is non-negotiable. You need to understand:

How core dumps are generated
Where they’re stored
What triggers their creation

Many developers have never seen a core dump because ulimit defaults to 0. Fix that first.

Then immediately do Project 2: The GDB Backtrace

This teaches you the 80% of crash debugging that covers 95% of real-world cases. Once you can read a backtrace and examine variables, you’re dangerous.

If You Already Debug with GDB

Start with Project 3: The Memory Inspector

Go deeper. Learn to examine arbitrary memory, understand heap corruption, and trace data structures. This separates the competent from the experts.

Then do Project 4: Automated Crash Detective

Automation pays dividends. A GDB Python script that runs on every crash in CI catches bugs before they hit production.

If You Work on Production Systems

Prioritize Projects 5 and 10

Project 5 (Multi-threaded Mayhem) teaches you to debug the crashes that only happen under load.
Project 10 (Centralized Crash Reporter) gives you visibility into crash patterns across your fleet.

If You’re a Kernel Developer or SRE

Projects 8 and 9 are essential

Kernel crashes are different. You need kdump configured, you need to know the crash utility, and you need to understand kernel data structures.

If You Work in Security or Forensics

Project 6: The Stripped Binary Challenge

Real-world malware and proprietary software rarely come with symbols. Learning to debug without them is a superpower.

Recommended Learning Order

For Most Developers:                For Kernel/Systems:
┌──────────────────────┐           ┌──────────────────────┐
│ Project 1 (Foundation)│           │ Project 1 (Foundation)│
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 2 (GDB Basics)│          │ Project 2 (GDB Basics)│
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 3 (Deep Mem) │           │ Project 8 (kdump)    │
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 4 (Automation)│          │ Project 9 (crash)    │
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 5 (Threading)│           │ Project 5 (Threading)│
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 10 (Reporter)│           │ Project 10 (Reporter)│
└──────────────────────┘           └──────────────────────┘

Final Overall Project: Production Crash Forensics Platform

The Goal

Combine everything you’ve learned into a complete crash forensics platform that could be deployed in a real production environment.

System Name: CrashLens

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           CrashLens Platform                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │ Collection      │    │ Collection      │    │ Collection      │         │
│  │ Agent (Host 1)  │    │ Agent (Host 2)  │    │ Agent (Host N)  │         │
│  │                 │    │                 │    │                 │         │
│  │ • systemd hook  │    │ • systemd hook  │    │ • systemd hook  │         │
│  │ • minidump gen  │    │ • minidump gen  │    │ • minidump gen  │         │
│  │ • symbol fetch  │    │ • symbol fetch  │    │ • symbol fetch  │         │
│  └────────┬────────┘    └────────┬────────┘    └────────┬────────┘         │
│           │                      │                      │                   │
│           └──────────────────────┼──────────────────────┘                   │
│                                  ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │                         Message Queue (Redis/RabbitMQ)                │ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                  │                                          │
│           ┌──────────────────────┴───────────────────────┐                  │
│           ▼                                              ▼                  │
│  ┌─────────────────────────┐                ┌─────────────────────────┐    │
│  │    Analysis Workers     │                │    Symbol Server        │    │
│  │                         │◄───────────────│                         │    │
│  │ • GDB Python automation │    symbols     │ • debuginfod-compatible│    │
│  │ • Fingerprint generation│                │ • Build ID indexed     │    │
│  │ • Stack unwinding       │                │ • S3 backend           │    │
│  │ • Minidump parsing      │                │                         │    │
│  └───────────┬─────────────┘                └─────────────────────────┘    │
│              │                                                              │
│              ▼                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │                         PostgreSQL Database                           │ │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐      │ │
│  │  │ crashes    │  │ bugs       │  │ binaries   │  │ users      │      │ │
│  │  │            │  │            │  │            │  │            │      │ │
│  │  │ • metadata │  │ • fingerp. │  │ • build_id │  │ • API keys │      │ │
│  │  │ • frames   │  │ • status   │  │ • symbols  │  │ • teams    │      │ │
│  │  │ • memory   │  │ • issue_id │  │ • version  │  │ • alerts   │      │ │
│  │  └────────────┘  └────────────┘  └────────────┘  └────────────┘      │ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                  │                                          │
│              ┌───────────────────┴───────────────────┐                      │
│              ▼                                       ▼                      │
│  ┌─────────────────────────┐            ┌─────────────────────────┐        │
│  │       REST API          │            │     Web Dashboard       │        │
│  │                         │            │                         │        │
│  │ • Crash submission      │            │ • Bug list view         │        │
│  │ • Query/search          │            │ • Crash timeline        │        │
│  │ • Bug management        │            │ • Stack trace viewer    │        │
│  │ • Integration webhooks  │            │ • Trend analysis        │        │
│  └─────────────────────────┘            │ • Issue tracker link    │        │
│                                         └─────────────────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Phases

Phase 1: Foundation (Projects 1, 2, 10)

Implement basic collection agent using systemd-coredump hooks
Create central database schema for crashes and bugs
Build simple REST API for crash submission
Implement basic fingerprinting from stack traces

Phase 2: Analysis Engine (Projects 3, 4, 6)

Create GDB Python scripts for automated analysis
Implement minidump generation for reduced storage
Handle stripped binaries with graceful degradation
Add memory region extraction for heap analysis

Phase 3: Multi-threading and Advanced (Project 5)

Extend fingerprinting to handle concurrent crashes
Detect and flag potential race conditions
Implement deadlock detection in crash analysis
Add thread state comparison tools

Phase 4: Kernel Support (Projects 8, 9)

Add kdump collection support for kernel panics
Implement crash utility integration for kernel analysis
Create kernel-specific fingerprinting
Handle vmcore files in analysis pipeline

Phase 5: Production Polish (Project 7)

Implement Breakpad minidump format support
Add cross-platform crash ingestion (Windows, macOS)
Create symbol server with debuginfod compatibility
Build trend analysis and anomaly detection

Success Criteria

Your CrashLens platform is complete when:

Stretch Goals

Machine Learning: Classify crashes by root cause
Automated Bisect: Integrate with git to find regressing commits
Live Debugging: Connect to running process from dashboard
Distributed Tracing: Correlate crashes with request traces
Cost Analysis: Estimate business impact of each bug

From Learning to Production: What Is Next

After completing these projects, you understand crash dump analysis deeply. Here’s how your skills map to production tools and what gaps remain:

Your Project	Production Equivalent	Gap to Fill
Project 1: Core Dump Generation	systemd-coredump, apport	Multi-distro configuration, storage policies
Project 2: GDB Backtrace	GDB, LLDB	IDE integration, remote debugging
Project 3: Memory Inspector	Valgrind, ASan, MSan	Runtime instrumentation, sanitizers
Project 4: GDB Scripting	Mozilla rr, Pernosco	Record/replay debugging
Project 5: Multi-threaded	Helgrind, ThreadSanitizer	Data race detection
Project 6: Stripped Binaries	Ghidra, IDA Pro, Binary Ninja	Full reverse engineering
Project 7: Minidump Parser	Breakpad, Crashpad	Client library integration
Project 8: kdump	Red Hat Crash, Oracle kdump	Enterprise kernel support
Project 9: crash Utility	crash + extensions	Custom crash plugins
Project 10: Crash Reporter	Sentry, Backtrace.io, Raygun	SaaS scale, ML classification

Career Applications

Site Reliability Engineering (SRE)

Your Project 10 skills directly apply to building observability infrastructure
Project 8/9 kernel skills are essential for infrastructure debugging
Fingerprinting knowledge helps reduce alert fatigue

Security Engineering

Project 6 stripped binary skills are core to malware analysis
Memory examination skills from Project 3 help in exploit development
Core dump analysis is essential for incident response

Systems Programming

Every project contributes to building robust, debuggable systems
Understanding crash formats helps design better error handling
Automation skills from Project 4 integrate into CI/CD pipelines

Kernel Development

Projects 8/9 are essential prerequisites
Understanding of ELF format and memory layout is foundational
Crash analysis is a daily activity for kernel developers

Summary

This learning path covers Linux Crash Dump Analysis through 10 hands-on projects, taking you from basic core dump generation to building a production-grade crash reporting platform.

#	Project Name	Main Language	Difficulty	Time Estimate
1	The First Crash — Core Dump Generation	C	Level 1: Beginner	4-6 hours
2	The GDB Backtrace — Extracting Crash Context	C	Level 1: Beginner	6-8 hours
3	The Memory Inspector — Deep State Examination	C	Level 2: Intermediate	10-15 hours
4	Automated Crash Detective — GDB Scripting	Python	Level 2: Intermediate	15-20 hours
5	Multi-threaded Mayhem — Analyzing Concurrent Crashes	C	Level 3: Advanced	20-30 hours
6	The Stripped Binary Challenge — Debugging Without Symbols	C/Assembly	Level 3: Advanced	15-25 hours
7	Minidump Parser — Understanding Compact Crash Formats	C/Python	Level 2: Intermediate	15-20 hours
8	Kernel Panic Anatomy — Triggering and Capturing with kdump	C	Level 3: Advanced	20-30 hours
9	Analyzing Kernel Crashes with the crash Utility	C	Level 3: Advanced	20-30 hours
10	Centralized Crash Reporter — Production-Grade Infrastructure	Python	Level 3: Advanced	40-60 hours

Recommended Learning Paths

For Beginners: Start with Projects 1 → 2 → 3 → 4

For Intermediate Developers: Projects 1 → 2 → 4 → 5 → 10

For Kernel/Systems Engineers: Projects 1 → 2 → 8 → 9 → 5

For Security Professionals: Projects 1 → 2 → 3 → 6 → 7

Expected Outcomes

After completing these projects, you will:

Generate and configure core dumps on any Linux system with confidence
Navigate GDB fluently using commands like bt, frame, print, x, and info
Examine memory in depth including heap structures, stack frames, and data corruption
Automate crash analysis using GDB’s Python API for CI/CD integration
Debug multi-threaded crashes including race conditions and deadlocks
Analyze stripped binaries using assembly-level debugging techniques
Parse minidump formats for compact, portable crash representation
Configure and use kdump to capture kernel panics
Navigate kernel crashes using the crash utility and kernel data structures
Design and build production crash reporting infrastructure

The Deeper Understanding

Beyond the technical skills, you’ll understand:

Why crashes happen: Not just “null pointer,” but the architectural reasons software fails
What memory really is: A flat array of bytes with conventions layered on top
How debuggers work: They’re not magic—they read the same files you can read
Why symbols matter: And how to work without them when necessary
How to think forensically: Reconstructing what happened from incomplete evidence

You’ll have built 10+ working projects that demonstrate deep understanding of crash dump analysis from first principles.

Additional Resources and References

Standards and Specifications

ELF Format: System V ABI and Linux Extensions
DWARF Debugging Format: DWARF 5 Standard
Breakpad Minidump Format: Google Breakpad Documentation
x86-64 ABI: System V AMD64 ABI

Official Documentation

GDB Manual: https://sourceware.org/gdb/current/onlinedocs/gdb/
systemd-coredump: https://www.freedesktop.org/software/systemd/man/systemd-coredump.html
Linux Kernel kdump: https://www.kernel.org/doc/html/latest/admin-guide/kdump/kdump.html
crash Utility: https://crash-utility.github.io/

Books (Essential Reading)

Book	Author	Why It Matters
Computer Systems: A Programmer’s Perspective	Bryant & O’Hallaron	Foundation for understanding memory, processes, and linking
The Linux Programming Interface	Michael Kerrisk	Comprehensive coverage of signals, process memory, and core dumps
Linux Kernel Development	Robert Love	Essential for kernel crash analysis
Understanding the Linux Kernel	Bovet & Cesati	Deep dive into kernel internals
Debugging with GDB	Richard Stallman et al.	Official GDB documentation in book form
Practical Binary Analysis	Dennis Andriesse	Reverse engineering and binary formats
Expert C Programming	Peter van der Linden	Deep C knowledge for understanding crashes

Online Resources

Julia Evans’ Debugging Zines: https://wizardzines.com/ — Approachable visual guides
Brendan Gregg’s Blog: https://www.brendangregg.com/ — Performance and debugging
LWN.net Kernel Articles: https://lwn.net/ — In-depth kernel coverage
GDB Dashboard: https://github.com/cyrus-and/gdb-dashboard — Enhanced GDB interface

Tools Referenced

Tool	Purpose	Installation
GDB	GNU Debugger	`apt install gdb`
LLDB	LLVM Debugger	`apt install lldb`
coredumpctl	systemd core dump management	Part of systemd
crash	Kernel crash analysis	`apt install crash`
Breakpad	Client crash reporting	Build from source
debuginfod	Symbol server	`apt install debuginfod`
objdump	Binary examination	Part of binutils
readelf	ELF file analysis	Part of binutils
addr2line	Address to source mapping	Part of binutils

Community and Help

GDB Mailing List: https://sourceware.org/gdb/mailing-lists/
Stack Overflow Tags: [gdb], [core-dump], [crash-dump]
Reddit: r/linux, r/linuxadmin, r/ReverseEngineering
Linux Kernel Mailing List (LKML): For kernel crash questions

This learning path was designed to take you from zero knowledge to expert-level crash dump analysis skills. The projects are ordered to build on each other, with each one adding new concepts and reinforcing what came before. Complete them all, and you’ll have skills that set you apart in systems programming, site reliability engineering, security research, or kernel development.