Sprint: Linux Crash Dump Analysis Mastery - From Core Dumps to Kernel Panics

Goal: Master the art of post-mortem debugging in Linux. You will learn to analyze user-space core dumps with GDB, understand the ELF format that stores crash data, diagnose multi-threaded crashes, automate triage workflows, and ultimately dissect kernel panics with the crash utility. By the end, you will confidently find the root cause of any crash—from a simple segmentation fault to a full kernel panic.


Introduction

When a program crashes, it often leaves behind a core dump—a snapshot of the process’s memory and CPU state at the moment of death. This file is your crime scene evidence. Learning to analyze it transforms you from a developer who guesses at bugs into one who knows the root cause.

This guide covers user-space crash analysis (segfaults, memory corruption, multi-threaded races) and introduces kernel crash analysis (panics, oops, kdump). You will build 10 projects that progressively deepen your understanding.

What You Will Build

  1. Crash-generating programs - Controlled bugs that produce core dumps
  2. GDB analysis workflows - Manual and scripted backtrace extraction
  3. Memory inspection tools - Raw memory examination and corruption detection
  4. Automated triage scripts - Python/GDB automation for production systems
  5. Multi-threaded crash scenarios - Data race and deadlock analysis
  6. Stripped binary debugging - Working without debug symbols
  7. Minidump parser - Understanding Breakpad/Crashpad formats
  8. Kernel panic triggers - Writing buggy kernel modules safely in VMs
  9. Kernel crash analysis - Using crash on vmcore files
  10. Centralized crash reporter - A mini-Sentry for your infrastructure

Big Picture: The Crash Analysis Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CRASH ANALYSIS PIPELINE                              │
└─────────────────────────────────────────────────────────────────────────────┘

  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │  APPLICATION │     │    KERNEL    │     │  CORE DUMP   │     │   ANALYSIS   │
  │    CRASH     │────►│   HANDLER    │────►│    FILE      │────►│    TOOLS     │
  └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
         │                    │                    │                    │
         │                    │                    │                    │
         ▼                    ▼                    ▼                    ▼
   ┌───────────┐        ┌───────────┐        ┌───────────┐        ┌───────────┐
   │ SIGSEGV   │        │core_pattern│       │ ELF Format │       │   GDB     │
   │ SIGABRT   │        │ ulimit -c  │       │ PT_NOTE    │       │  crash    │
   │ SIGFPE    │        │ systemd-   │       │ PT_LOAD    │       │ minidump  │
   │ SIGBUS    │        │ coredump   │       │ Registers  │       │ stackwalk │
   └───────────┘        └───────────┘        └───────────┘        └───────────┘

  USER-SPACE CRASHES:                    KERNEL CRASHES:
  ┌─────────────────────────────┐       ┌─────────────────────────────────────┐
  │  Process → Signal → Core    │       │  Panic → kexec → vmcore → crash    │
  │  GDB loads: executable +    │       │  crash loads: vmlinux + vmcore     │
  │  core file + debug symbols  │       │  Debug kernel symbols required     │
  └─────────────────────────────┘       └─────────────────────────────────────┘

Scope

In Scope:

  • Linux user-space core dump analysis (GDB)
  • ELF core dump format internals
  • Multi-threaded crash debugging
  • GDB Python scripting for automation
  • Kernel crash dumps (kdump/vmcore basics)
  • The crash utility for kernel analysis
  • Minidump format (Breakpad/Crashpad)

Out of Scope:

  • Windows crash dumps (WinDbg, minidumps on Windows)
  • macOS crash reports
  • Hardware debugging (JTAG, logic analyzers)
  • Live debugging techniques (covered in other guides)

How to Use This Guide

Reading Strategy

  1. Read the Theory Primer first - The concepts explained before the projects give you the mental model needed to understand why techniques work.

  2. Follow the project order - Projects 1-3 build foundational skills. Projects 4-6 add intermediate complexity. Projects 7-10 tackle advanced topics.

  3. Don’t skip the “Thinking Exercise” - These pre-project exercises build the mental models that make debugging intuitive.

  4. Use the “Definition of Done” - Each project has explicit completion criteria. Don’t move on until you’ve hit them.

Workflow for Each Project

1. Read the project overview and "Core Question"
2. Complete the "Thinking Exercise"
3. Implement the project (use hints only when stuck)
4. Verify against "Real World Outcome" examples
5. Review the "Common Pitfalls" section
6. Complete the "Definition of Done" checklist
7. Attempt the interview questions

Time Investment

Project Type Time Estimate Examples
Beginner 4-8 hours Projects 1-2
Intermediate 10-20 hours Projects 3-6
Advanced 20-40 hours Projects 7-9
Capstone 40+ hours Project 10

Total Sprint Duration: 8-12 weeks at 10-15 hours/week


Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

C Programming Fundamentals

  • Pointers and memory addresses
  • Stack vs. heap allocation
  • Compiling with GCC (gcc -g -o program program.c)
  • Basic understanding of segmentation faults
  • Recommended Reading: “The C Programming Language” by Kernighan & Ritchie - Ch. 5-6

Linux Command Line

  • Basic shell navigation and commands
  • Understanding of processes and signals
  • File permissions and sudo usage
  • Recommended Reading: “The Linux Command Line” by William Shotts - Ch. 1-10

Basic GDB Usage

  • Starting GDB (gdb ./program)
  • Setting breakpoints (break main)
  • Stepping through code (next, step)
  • Printing variables (print x)
  • Recommended Reading: “The Art of Debugging with GDB” by Matloff & Salzman - Ch. 1-2

Helpful But Not Required

Assembly Language Basics (learned during Projects 5-6)

  • x86-64 register names (RAX, RBX, RSP, RIP)
  • Basic instruction formats
  • Recommended Reading: “Low-Level Programming” by Igor Zhirkov - Ch. 1-3

Linux Kernel Concepts (learned during Projects 8-9)

  • Kernel modules and insmod/rmmod
  • Kernel vs. user space
  • Recommended Reading: “Linux Device Drivers, 3rd Edition” by Corbet et al. - Ch. 1-2

Python Scripting (needed for Projects 4, 7, 10)

  • Basic Python syntax and file I/O
  • subprocess module for running external commands
  • struct module for binary parsing

Self-Assessment Questions

Answer “yes” to at least 5 of these before starting:

  1. Can you explain what happens when you dereference a NULL pointer in C?
  2. Do you know the difference between stack and heap memory?
  3. Can you compile a C program with debug symbols using GCC?
  4. Have you used GDB to set a breakpoint and step through code?
  5. Do you understand what a signal (like SIGSEGV) is in Linux?
  6. Can you write a basic bash script that runs a command and checks its exit code?
  7. Do you know what a process memory map looks like (/proc/[pid]/maps)?

Development Environment Setup

Required Tools:

Tool Version Purpose Installation
GCC 9.0+ Compiling with debug symbols sudo apt install build-essential
GDB 10.0+ Core dump analysis sudo apt install gdb
Bash 4.0+ Scripting Pre-installed on Linux
Python 3.8+ Automation scripts sudo apt install python3

Recommended Tools:

Tool Purpose Installation
Valgrind Memory error detection sudo apt install valgrind
strace System call tracing sudo apt install strace
objdump Binary disassembly Part of binutils
readelf ELF file inspection Part of binutils
coredumpctl systemd core dump management Part of systemd

For Kernel Projects (8-9):

Tool Purpose Installation
QEMU/KVM Virtual machine for safe testing sudo apt install qemu-kvm
crash Kernel dump analysis sudo apt install crash
kernel-debuginfo Debug symbols for kernel Varies by distro

Testing Your Setup:

# Verify GCC with debug symbols
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

# Verify GDB
$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1

# Test core dump generation
$ ulimit -c unlimited
$ cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E

# Note: If using systemd-coredump, use coredumpctl instead
$ coredumpctl list

Important Reality Check

This is not easy material. Crash dump analysis requires understanding:

  • How programs are represented in memory
  • How the CPU executes instructions
  • How the operating system manages processes
  • How debugging tools reconstruct program state

Expect to feel confused initially. The projects are designed to make abstract concepts concrete through hands-on work. Trust the process.


Big Picture / Mental Model

The Layers of Crash Analysis

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CRASH ANALYSIS MENTAL MODEL                               │
└─────────────────────────────────────────────────────────────────────────────┘

LAYER 5: AUTOMATED SYSTEMS
┌─────────────────────────────────────────────────────────────────────────────┐
│  Crash Reporters (Sentry, Crashpad)  │  CI/CD Integration  │  Alerting     │
│  Symbolication Servers               │  Crash Deduplication │  Dashboards   │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 4: KERNEL CRASH ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│  kdump (capture mechanism)  │  vmcore (kernel memory image)  │  crash tool │
│  kexec (boot into capture)  │  vmlinux (debug symbols)       │  dmesg/log  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 3: ADVANCED USER-SPACE ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│  Multi-threaded debugging   │  Stripped binaries  │  Memory corruption      │
│  GDB Python scripting       │  Disassembly        │  Address-to-symbol      │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 2: BASIC USER-SPACE ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│  GDB backtrace (bt)         │  Variable inspection (print)  │  Memory exam │
│  Stack frames (frame N)     │  Register state (info reg)    │  x/Nfx addr  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
LAYER 1: CORE DUMP FUNDAMENTALS
┌─────────────────────────────────────────────────────────────────────────────┐
│  ELF core format            │  Signal handling      │  ulimit/core_pattern  │
│  PT_NOTE (metadata)         │  Memory segments      │  systemd-coredump     │
│  PT_LOAD (memory snapshot)  │  Register values      │  Debug symbols (-g)   │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌─────────────────────────────────┐
                    │  FOUNDATION: C Memory Model     │
                    │  Stack, Heap, Code, Data        │
                    │  Pointers, Addresses, Segments  │
                    └─────────────────────────────────┘

The Core Dump Data Flow

 RUNNING PROCESS                    CRASH EVENT                     ANALYSIS
 ┌────────────────┐                ┌────────────────┐              ┌────────────────┐
 │                │                │                │              │                │
 │  Code (.text)  │                │  Signal raised │              │  GDB loads:    │
 │  Data (.data)  │  ─────────►    │  (SIGSEGV,     │  ─────────►  │  - executable  │
 │  BSS  (.bss)   │                │   SIGABRT...)  │              │  - core file   │
 │  Heap          │                │                │              │  - symbols     │
 │  Stack         │                │  Kernel writes │              │                │
 │  Registers     │                │  core dump     │              │  You inspect:  │
 │                │                │                │              │  - backtrace   │
 └────────────────┘                └────────────────┘              │  - variables   │
                                                                   │  - memory      │
                                   Core file contains:             │  - registers   │
                                   ┌────────────────┐              │                │
                                   │ ELF Header     │              └────────────────┘
                                   │ PT_NOTE:       │
                                   │  - prstatus    │
                                   │  - prpsinfo    │
                                   │  - auxv        │
                                   │  - files       │
                                   │ PT_LOAD:       │
                                   │  - memory      │
                                   │    segments    │
                                   └────────────────┘

Theory Primer

This section provides the conceptual foundation for crash dump analysis. Read these chapters before starting the projects—they give you the mental models that make debugging intuitive.

Chapter 1: Core Dumps Fundamentals

Fundamentals

A core dump (or “core file”) is a snapshot of a process’s memory and CPU state at the moment it terminated abnormally. The name comes from early computing when memory was made of magnetic “cores.” Today, a core dump captures everything needed to reconstruct the state of a crashed program: its memory contents, register values, open file descriptors, and more.

When a process receives a signal that causes it to terminate (like SIGSEGV for segmentation faults), the Linux kernel can save this snapshot to a file. This file is your primary evidence for post-mortem debugging—analyzing what went wrong after the crash occurred.

Core dumps are essential because:

  1. Crashes may not be reproducible - Race conditions, timing issues, or specific input combinations may be difficult to recreate
  2. Production debugging is limited - You can’t attach a debugger to production systems easily
  3. The crash context is preserved - You see the exact state at the moment of failure

Deep Dive

The core dump mechanism in Linux is controlled by several system settings and kernel parameters. Understanding these is crucial for both generating and analyzing dumps.

Signal Handling and Core Generation

When a process performs an illegal operation (like dereferencing a NULL pointer), the CPU raises an exception. The kernel translates this into a signal delivered to the process. The default action for certain signals is to terminate the process and generate a core dump:

Signal Number Description Default Action
SIGQUIT 3 Quit from keyboard (Ctrl+) Core dump
SIGILL 4 Illegal instruction Core dump
SIGABRT 6 Abort signal (from abort()) Core dump
SIGFPE 8 Floating-point exception Core dump
SIGSEGV 11 Segmentation fault Core dump
SIGBUS 7 Bus error (bad memory access) Core dump
SIGSYS 31 Bad system call Core dump
SIGTRAP 5 Trace/breakpoint trap Core dump

Core Dump Size Limits (ulimit)

The shell’s resource limits control whether core dumps are created. The ulimit -c command shows the maximum core file size in 512-byte blocks. A value of 0 (common default) means no core dumps are created.

# Check current limit
$ ulimit -c
0

# Enable unlimited core dumps for this session
$ ulimit -c unlimited

# Enable for all users permanently (in /etc/security/limits.conf)
*    soft    core    unlimited
*    hard    core    unlimited

Core Pattern Configuration

The /proc/sys/kernel/core_pattern file determines where core dumps are written and how they’re named. This can be:

  1. A file path pattern - Specifiers like %p (PID), %e (executable name), %t (timestamp) are replaced
  2. A pipe to a program - Starting with | pipes the dump to a handler (like systemd-coredump or apport)
# Traditional file-based pattern
$ echo "core.%e.%p.%t" > /proc/sys/kernel/core_pattern
# Creates: core.myprogram.1234.1703097600

# Modern systemd-based handling
$ cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

Modern Core Dump Management: systemd-coredump

Most modern Linux distributions use systemd-coredump to manage core dumps. It provides:

  • Automatic compression and storage in /var/lib/systemd/coredump/
  • Journal integration for metadata
  • Automatic cleanup of old dumps
  • The coredumpctl tool for listing and debugging
# List recent core dumps
$ coredumpctl list
TIME                        PID  UID  GID SIG     COREFILE EXE
Fri 2024-12-20 10:30:15 EST 1234 1000 1000 SIGSEGV present  /usr/bin/myapp

# Debug a specific crash
$ coredumpctl debug myapp

# Export the core file
$ coredumpctl dump myapp -o core.myapp

Core Dump Security Considerations

Core dumps can contain sensitive data:

  • Environment variables (potentially including secrets)
  • Memory contents (passwords, API keys, personal data)
  • File contents that were being processed

This is why many systems disable core dumps by default and why GDPR compliance may require careful handling. Configure Storage=none in /etc/systemd/coredump.conf if you only want journal metadata without the actual memory dump.

How This Fits in Projects

  • Project 1: You’ll configure core dump generation and verify the system creates dump files
  • Project 2: You’ll load core dumps into GDB and extract backtraces
  • Project 10: You’ll build a system that automatically captures and processes core dumps

Mental Model Diagram

                         CORE DUMP GENERATION FLOW
┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│  1. ILLEGAL OPERATION      2. CPU EXCEPTION      3. KERNEL SIGNAL            │
│  ┌─────────────────┐      ┌─────────────────┐   ┌─────────────────┐          │
│  │  int *p = NULL; │      │  #PF (Page      │   │  Signal 11      │          │
│  │  *p = 42;       │ ───► │  Fault) raised  │──►│  (SIGSEGV)      │          │
│  │  // BOOM!       │      │  by CPU         │   │  delivered      │          │
│  └─────────────────┘      └─────────────────┘   └────────┬────────┘          │
│                                                          │                   │
│  4. CORE PATTERN CHECK     5. DUMP CREATION      6. FILE WRITTEN            │
│  ┌─────────────────┐      ┌─────────────────┐   ┌─────────────────┐          │
│  │ /proc/sys/      │      │  For each       │   │ core.myapp.1234 │          │
│  │ kernel/         │ ───► │  memory region: │──►│ (ELF format)    │          │
│  │ core_pattern    │      │  - Copy to file │   │ Ready for GDB   │          │
│  │                 │      │  - Add metadata │   │                 │          │
│  │ ulimit -c > 0?  │      │  - Save regs    │   └─────────────────┘          │
│  └─────────────────┘      └─────────────────┘                                │
│                                                                              │
│  OR (Modern systemd systems):                                                │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │  |/usr/lib/systemd/systemd-coredump                                    │ │
│  │  └─► Writes to /var/lib/systemd/coredump/                              │ │
│  │  └─► Logs metadata to journal                                          │ │
│  │  └─► Use `coredumpctl debug` to analyze                                │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

// crash.c - A program that generates a core dump
#include <stdio.h>

int main(void) {
    int *ptr = NULL;  // ptr points to address 0
    *ptr = 42;        // Writing to address 0 triggers SIGSEGV
    return 0;
}
# Compile with debug symbols
$ gcc -g -o crash crash.c

# Enable core dumps
$ ulimit -c unlimited

# Run and crash
$ ./crash
Segmentation fault (core dumped)

# Verify the core file exists (traditional systems)
$ file core*
core: ELF 64-bit LSB core file, x86-64

# Or on systemd systems
$ coredumpctl list
TIME                        PID  UID  GID SIG     COREFILE EXE
...                        1234 1000 1000 SIGSEGV present  /path/to/crash

Common Misconceptions

  1. “Core dumps are always created” - No, they require ulimit -c to be non-zero and the file system must be writable.

  2. “The core file contains the executable” - No, it contains only memory contents and metadata. You need the original executable (ideally with debug symbols) to analyze it.

  3. “Core dumps are huge” - Not always. They can be compressed and typically only contain used memory pages, not the full virtual address space.

  4. “I can analyze a core dump from any machine” - The executable must match exactly (same build). Libraries must also match for accurate analysis.

Check-Your-Understanding Questions

  1. What is the default value of ulimit -c on most systems, and what does it mean?
  2. If core_pattern is |/usr/lib/systemd/systemd-coredump, where do core dumps go?
  3. Why might you set Storage=none in /etc/systemd/coredump.conf?
  4. Which signal is sent when you dereference a NULL pointer?
  5. What’s the difference between SIGSEGV and SIGBUS?

Check-Your-Understanding Answers

  1. Default is 0, meaning core dumps are disabled. No core file will be created when a process crashes.

  2. They’re piped to the systemd-coredump service, which stores them in /var/lib/systemd/coredump/ (compressed) and logs metadata to the journal.

  3. For security/privacy compliance (like GDPR) where you want to log that crashes occurred without storing potentially sensitive memory contents.

  4. SIGSEGV (signal 11) - Segmentation fault. This indicates an invalid memory access.

  5. SIGSEGV typically means accessing unmapped memory. SIGBUS means the memory is mapped but the access is misaligned or otherwise illegal (e.g., accessing memory-mapped I/O incorrectly).

Real-World Applications

  • Production debugging - When a server process crashes at 3 AM, the core dump lets you analyze it during business hours
  • Crash reporting systems - Services like Sentry and Crashpad use core dumps (or minidumps) to report crashes
  • QA and testing - Core dumps help developers understand why tests failed
  • Security analysis - Examining crashes for potential vulnerabilities

Where You’ll Apply It

  • Project 1: Generating your first intentional crash and verifying core dump creation
  • Project 2: Loading core dumps into GDB
  • Project 4: Automating core dump analysis
  • Project 10: Building a crash collection system

References

Key Insights

A core dump is a frozen snapshot of a process at the moment of death—it captures everything needed to perform a post-mortem investigation without needing to reproduce the crash.

Summary

Core dumps are ELF files containing a process’s memory and CPU state at crash time. Generation is controlled by ulimit -c (size limit) and /proc/sys/kernel/core_pattern (location/handler). Modern systems use systemd-coredump with coredumpctl for management. Core dumps can contain sensitive data, so security considerations apply.

Homework/Exercises

  1. Exercise 1: Write a C program that crashes with each of these signals: SIGSEGV, SIGFPE, SIGABRT. Generate and verify core dumps for each.

  2. Exercise 2: Configure your system to store core dumps with the pattern /tmp/cores/core.%e.%p.%t. Create the directory, set permissions, and test.

  3. Exercise 3: If your system uses systemd-coredump, practice using coredumpctl list, coredumpctl info, and coredumpctl debug.

  4. Exercise 4: Write a script that checks if core dumps are enabled and reports the current configuration.

Solutions to Homework/Exercises

Exercise 1 Solution:

// Three separate programs:

// sigsegv_crash.c
int main() { int *p = 0; *p = 1; return 0; }

// sigfpe_crash.c
int main() { int x = 1, y = 0; return x / y; }

// sigabrt_crash.c
#include <stdlib.h>
int main() { abort(); return 0; }

Exercise 2 Solution:

sudo mkdir -p /tmp/cores
sudo chmod 1777 /tmp/cores
echo "/tmp/cores/core.%e.%p.%t" | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
./crash  # Test with any crashing program
ls /tmp/cores/  # Verify core file created

Exercise 3 Solution:

coredumpctl list                    # List all dumps
coredumpctl info $(coredumpctl list | tail -1 | awk '{print $2}')  # Info on latest
coredumpctl debug                   # Debug most recent dump

Exercise 4 Solution:

#!/bin/bash
echo "=== Core Dump Configuration Check ==="
echo "ulimit -c: $(ulimit -c)"
echo "core_pattern: $(cat /proc/sys/kernel/core_pattern)"
if [ "$(ulimit -c)" = "0" ]; then
    echo "WARNING: Core dumps are DISABLED"
else
    echo "Core dumps are ENABLED"
fi

Chapter 2: ELF Core Dump Format

Fundamentals

Core dumps are stored in the ELF (Executable and Linkable Format) format—the same format used for Linux executables and shared libraries. Understanding ELF structure is essential because it tells you where to find different pieces of crash data.

An ELF core dump is essentially a specially structured file that contains two main types of information: metadata (what process crashed, signal received, register values) stored in PT_NOTE segments, and memory contents (stack, heap, data sections) stored in PT_LOAD segments.

Unlike executable ELF files that have section headers for code and data, core dumps primarily use program headers to describe memory segments. The readelf and eu-readelf tools can parse these headers, revealing the structure of your crash data.

Deep Dive

ELF File Structure Overview

Every ELF file begins with a fixed-size header that identifies the file and points to important data structures:

┌────────────────────────────────────────────────────────────────────────────┐
│                           ELF FILE STRUCTURE                                │
├────────────────────────────────────────────────────────────────────────────┤
│ ELF Header (52/64 bytes)                                                   │
│   - Magic number: 0x7F 'E' 'L' 'F'                                        │
│   - Class (32/64-bit), Endianness, Version                                │
│   - Type: ET_CORE (4) for core dumps                                      │
│   - Entry point, Program header offset, Section header offset             │
├────────────────────────────────────────────────────────────────────────────┤
│ Program Headers (array)                                                    │
│   - PT_NOTE: Metadata (registers, process info, file mappings)            │
│   - PT_LOAD: Memory segments (actual memory contents)                     │
├────────────────────────────────────────────────────────────────────────────┤
│ Segment Data                                                               │
│   - NOTE data: prstatus, prpsinfo, auxv, file mappings                    │
│   - LOAD data: Stack, heap, mapped files, anonymous memory               │
└────────────────────────────────────────────────────────────────────────────┘

PT_NOTE Segment: The Metadata Treasure Trove

The PT_NOTE segment contains structured metadata about the crashed process. Each note has a name, type, and descriptor (data). Key note types include:

Note Type Name Description
NT_PRSTATUS CORE Process status including registers, signal, PID
NT_PRPSINFO CORE Process info: state, command name, nice value
NT_AUXV CORE Auxiliary vector (dynamic linker info)
NT_FILE CORE File mappings (which files were mapped where)
NT_FPREGSET CORE Floating-point register state
NT_X86_XSTATE LINUX Extended CPU state (AVX, etc.)

The NT_PRSTATUS note is particularly important—it contains:

  • The signal that killed the process (e.g., SIGSEGV = 11)
  • Current and pending signal masks
  • All general-purpose register values (RIP, RSP, RAX, etc.)
  • Process and thread IDs

For multi-threaded processes, there’s one NT_PRSTATUS note per thread, allowing you to see what each thread was doing at crash time.

PT_LOAD Segments: The Memory Snapshot

PT_LOAD segments contain the actual memory contents of the process. Each segment has:

  • p_vaddr: Virtual address where this memory was mapped
  • p_filesz: How many bytes are in the core file
  • p_memsz: How many bytes this segment represented in memory
  • p_flags: Permissions (read, write, execute)

If p_filesz is 0 but p_memsz is non-zero, the segment was all zeros (like uninitialized BSS) and wasn’t stored to save space.

Inspecting ELF Structure with readelf

# View the ELF header
$ readelf -h core
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 ...
  Class:                             ELF64
  Type:                              CORE (Core file)
  ...

# View program headers
$ readelf -l core
Program Headers:
  Type           Offset             VirtAddr           ...
  NOTE           0x0000000000000350 0x0000000000000000 ...
  LOAD           0x0000000000001000 0x0000555555554000 ...
  LOAD           0x0000000000002000 0x00007ffff7dd5000 ...

# View notes
$ readelf -n core
Displaying notes found in: core
  Owner     Data size    Description
  CORE      0x00000150   NT_PRSTATUS (prstatus structure)
  CORE      0x00000088   NT_PRPSINFO (prpsinfo structure)
  CORE      0x00000130   NT_AUXV (auxiliary vector)
  CORE      0x00000550   NT_FILE (mapped files)

The NT_FILE Note: Understanding Memory Mappings

The NT_FILE note is invaluable—it tells you which shared libraries and files were mapped into the process. This helps you understand:

  • Which version of libc was running
  • What shared libraries were loaded
  • Where the executable was mapped
# Example NT_FILE output:
  Page size: 4096
  Start                End                 Page Offset File
  0x0000555555554000   0x0000555555555000  0x00000000  /path/to/program
  0x00007ffff7dd5000   0x00007ffff7f6a000  0x00000000  /lib/x86_64-linux-gnu/libc.so.6

How This Fits in Projects

  • Project 6: You’ll work with stripped binaries where understanding ELF structure helps locate functions
  • Project 7: You’ll parse the ELF/minidump structure programmatically
  • All projects: Understanding where data lives in the core file helps you navigate GDB output

Mental Model Diagram

                    ELF CORE DUMP ANATOMY
┌────────────────────────────────────────────────────────────────────────┐
│                        ELF HEADER (64 bytes)                           │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ Magic: 7F 45 4C 46  Type: CORE  Machine: x86_64  Entry: 0x0     │ │
│  │ Program Header Offset: 64  Section Header Offset: 0 (none)      │ │
│  └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│                      PROGRAM HEADERS                                   │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ [0] PT_NOTE   offset=0x350   vaddr=0x0        filesz=0x8a8      │ │
│  │ [1] PT_LOAD   offset=0x1000  vaddr=0x555...   filesz=0x1000     │ │
│  │ [2] PT_LOAD   offset=0x2000  vaddr=0x7ff...   filesz=0x195000   │ │
│  │ ...                                                              │ │
│  └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│                      NOTE SEGMENT DATA                                 │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ NT_PRSTATUS: signal=11, pid=1234, regs={rip=0x555..., rsp=...}  │ │
│  │ NT_PRPSINFO: state='R', fname="program", args="./program"       │ │
│  │ NT_AUXV: AT_PHDR=..., AT_ENTRY=..., AT_BASE=...                 │ │
│  │ NT_FILE: [0x555...–0x556...] /path/to/program                   │ │
│  │          [0x7ff...–0x7ff...] /lib/.../libc.so.6                 │ │
│  │ NT_FPREGSET: floating point registers                           │ │
│  │ (For multi-threaded: NT_PRSTATUS for each thread)               │ │
│  └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│                      LOAD SEGMENT DATA                                 │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │ [Segment 1: Code] .text section from executable                 │ │
│  │ [Segment 2: Data] .data, .bss, heap                             │ │
│  │ [Segment 3: Stack] Local variables, return addresses            │ │
│  │ [Segment 4: libc] Memory-mapped shared library                  │ │
│  │ ...                                                              │ │
│  └──────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘

     NOTE: No section headers! Core dumps use program headers only.

Minimal Concrete Example

# Generate a core dump
$ ulimit -c unlimited
$ echo "core" | sudo tee /proc/sys/kernel/core_pattern
$ ./crash
Segmentation fault (core dumped)

# Inspect the ELF structure
$ file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style

$ readelf -h core | head -10
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  Type:                              CORE (Core file)

$ readelf -n core | grep -A2 "NT_PRSTATUS"
    CORE                 0x00000150       NT_PRSTATUS (prstatus structure)
# Contains signal 11 (SIGSEGV) and register values

Common Misconceptions

  1. “Core dumps have section headers like executables” - No, core dumps typically have 0 section headers. They use program headers (PT_NOTE, PT_LOAD) exclusively.

  2. “The entire virtual address space is saved” - No, only mapped pages with actual content are saved. Zero pages and unmapped regions are omitted.

  3. “You can run the core dump” - No, it’s not an executable. It’s a memory snapshot that requires the original executable and GDB to interpret.

  4. “All notes are in one PT_NOTE segment” - Usually yes, but the notes within that segment are individual records you must parse sequentially.

Check-Your-Understanding Questions

  1. What is the ELF type field value for core dumps?
  2. What information is stored in the NT_PRSTATUS note?
  3. How can you determine which shared libraries were loaded when a process crashed?
  4. Why do core dumps typically have 0 section headers?
  5. What does it mean when a PT_LOAD segment has p_filesz=0 but p_memsz > 0?

Check-Your-Understanding Answers

  1. ET_CORE (value 4). This distinguishes core dumps from executables (ET_EXEC), shared objects (ET_DYN), and relocatable files (ET_REL).

  2. NT_PRSTATUS contains: the signal that killed the process, PID, PPID, all general-purpose registers (including the instruction pointer RIP and stack pointer RSP), and signal masks.

  3. The NT_FILE note in the PT_NOTE segment lists all memory-mapped files with their virtual address ranges. Use readelf -n core to view it.

  4. Section headers are used by linkers and debuggers for executables, but core dumps only need to represent memory layout. Program headers (PT_LOAD) are sufficient for this, and omitting section headers saves space.

  5. This means the memory region was all zeros (like uninitialized BSS). The kernel doesn’t store zero pages in the core file to save space—GDB knows to treat this region as zeros.

Real-World Applications

  • Crash reporting tools - Parse ELF structure to extract register values and stack data
  • Forensic analysis - Understand exactly what memory a process had access to
  • Custom debugging tools - Build specialized analysis tools by parsing core dumps directly
  • Minidump generation - Convert full core dumps to smaller formats for upload

Where You’ll Apply It

  • Project 6: Understanding stripped binary structure
  • Project 7: Parsing minidumps (similar structure)
  • All projects: Knowing where GDB gets its information

References

Key Insights

A core dump is just another ELF file—but instead of code and data for execution, it contains memory snapshots (PT_LOAD) and metadata (PT_NOTE) for post-mortem analysis.

Summary

Core dumps use ELF format with PT_NOTE segments for metadata (registers, signal, file mappings) and PT_LOAD segments for memory contents. The NT_PRSTATUS note contains registers and signal info. The NT_FILE note lists memory-mapped files. Use readelf -h, readelf -l, and readelf -n to inspect structure.

Homework/Exercises

  1. Exercise 1: Use readelf -l on a core dump and count the PT_LOAD segments. Correlate them with the output of /proc/[pid]/maps from a running process of the same program.

  2. Exercise 2: Use readelf -n to extract the signal number from NT_PRSTATUS. Verify it matches the signal you expected.

  3. Exercise 3: Write a Python script using the struct module to parse the ELF header and count program headers in a core dump.

  4. Exercise 4: Compare the ELF structure of a core dump vs. the original executable using readelf -h on both.

Solutions to Homework/Exercises

Exercise 1 Solution:

# First, run a program and check its maps
$ ./myprogram &
$ cat /proc/$!/maps
# Note the memory regions

# Then generate a core dump (Ctrl+\ or kill -QUIT)
$ readelf -l core | grep LOAD
# Each PT_LOAD should correspond to a mapped region

Exercise 2 Solution:

$ readelf -n core | grep -A5 "NT_PRSTATUS"
# The signal is stored in the prstatus structure
# Look for "si_signo" or use a hex dump to find signal at known offset

# For SIGSEGV (11):
$ xxd core | grep -A2 "0x350"  # Approximate offset of prstatus

Exercise 3 Solution (outline):

import struct

with open('core', 'rb') as f:
    # ELF header is 64 bytes for ELF64
    header = f.read(64)

    # Parse key fields
    magic = header[0:4]  # Should be b'\x7fELF'
    phoff = struct.unpack('<Q', header[32:40])[0]  # Program header offset
    phnum = struct.unpack('<H', header[56:58])[0]  # Number of program headers

    print(f"Program headers: {phnum} at offset {phoff}")

Exercise 4 Solution:

$ readelf -h ./myprogram | grep Type
  Type:                              EXEC (Executable file)

$ readelf -h core | grep Type
  Type:                              CORE (Core file)

# Key differences: Type field, entry point (0 for core), section headers

Chapter 3: GDB for Post-Mortem Debugging

Fundamentals

GDB (GNU Debugger) is the primary tool for analyzing core dumps on Linux. While GDB is commonly used for live debugging (setting breakpoints, stepping through code), post-mortem debugging—loading a core dump after a crash—is fundamentally different: you’re examining a frozen moment in time, not a running process.

In post-mortem mode, you cannot step forward, set breakpoints, or continue execution. Instead, you can inspect the state at crash time: the call stack (backtrace), variable values, memory contents, and register values. The key insight is that a core dump + the original executable + debug symbols together reconstruct the complete picture of what went wrong.

The basic workflow is: gdb <executable> <core-file>. GDB loads the executable to get symbol information and the core file to get the crash state.

Deep Dive

Loading a Core Dump

The fundamental GDB command for core dump analysis is straightforward:

$ gdb ./program core
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Core was generated by `./program'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000555555555149 in main () at crash.c:5
5           *ptr = 42;
(gdb)

GDB immediately shows you:

  1. Which program generated the core
  2. Which signal terminated it
  3. The function and line where it stopped (if symbols are available)

The Backtrace: Your Map of the Crash

The backtrace (or bt) command shows the call stack at crash time:

(gdb) bt
#0  0x0000555555555149 in vulnerable_function (input=0x7fffffffe010 "AAAA") at crash.c:5
#1  0x0000555555555178 in process_data (data=0x7fffffffe010 "AAAA") at crash.c:12
#2  0x00005555555551a2 in main (argc=2, argv=0x7fffffffe108) at crash.c:18

Each frame represents a function call. Frame #0 is where the crash occurred. Higher numbers are callers going back to main (and beyond to _start if you look far enough).

Navigating Stack Frames

You can switch between frames to examine different contexts:

(gdb) frame 1                    # Switch to process_data()
#1  0x0000555555555178 in process_data (data=0x7fffffffe010 "AAAA") at crash.c:12
12          vulnerable_function(data);

(gdb) info args                  # Show function arguments
data = 0x7fffffffe010 "AAAA"

(gdb) info locals                # Show local variables
local_buffer = "..."

(gdb) up                         # Move up one frame (to caller)
(gdb) down                       # Move down one frame (to callee)

Inspecting Variables and Memory

The print command examines variable values:

(gdb) print ptr
$1 = (int *) 0x0                 # NULL pointer - found the bug!

(gdb) print *ptr                 # Try to dereference
Cannot access memory at address 0x0

(gdb) print my_struct
$2 = {name = "test", value = 42, next = 0x555555558040}

(gdb) print my_struct.name
$3 = "test"

The x (examine) command inspects raw memory:

(gdb) x/16xw $rsp               # 16 words in hex, starting at stack pointer
0x7fffffffe000: 0x41414141 0x41414141 0x41414141 0x41414141

(gdb) x/s 0x555555556000        # Examine as string
0x555555556000: "Hello, World!"

(gdb) x/10i $rip                # Examine 10 instructions at instruction pointer
   0x555555555149 <main+20>:    mov    DWORD PTR [rax],0x2a
   0x55555555514f <main+26>:    mov    eax,0x0

Format specifiers for x:

  • x - hexadecimal
  • d - decimal
  • s - string
  • i - instruction (disassembly)
  • c - character
  • b/h/w/g - byte/halfword(2)/word(4)/giant(8) sizes

Register Inspection

Registers often reveal the immediate cause of a crash:

(gdb) info registers
rax            0x0                 0        # Often holds bad address
rbx            0x0                 0
rcx            0x7ffff7f9a6a0      140737353705120
rdx            0x7ffff7f9c4e0      140737353712864
rsi            0x0                 0
rdi            0x7fffffffe010      140737488347152
rbp            0x7fffffffe030      0x7fffffffe030
rsp            0x7fffffffe000      0x7fffffffe000    # Stack pointer
rip            0x555555555149      0x555555555149 <main+20>  # Instruction pointer

(gdb) print $rip                 # Access registers with $ prefix
$1 = (void (*)()) 0x555555555149 <main+20>

The Critical Importance of Debug Symbols

Without debug symbols (compiled with -g), you lose:

  • Function names (replaced by addresses)
  • Variable names and types
  • Source file and line information
# WITH symbols:
#0  0x0000555555555149 in main () at crash.c:5
(gdb) print ptr
$1 = (int *) 0x0

# WITHOUT symbols:
#0  0x0000555555555149 in ?? ()
(gdb) print ptr
No symbol "ptr" in current context.

Essential GDB Commands for Core Analysis

Command Shortcut Description
backtrace bt Show call stack
backtrace full bt full Show stack with local variables
frame N f N Switch to frame N
up / down   Move up/down the stack
info registers i r Show CPU registers
info args   Show function arguments
info locals   Show local variables
print EXPR p EXPR Evaluate and print expression
x/FMT ADDR   Examine memory
list l Show source code at current location
disassemble disas Show assembly code
info threads i threads List all threads
thread N t N Switch to thread N

Using GDB’s Python API (Preview)

GDB can be scripted with Python for automation:

# Save as analyze.py, run with: gdb -x analyze.py ./program core
import gdb

gdb.execute("set pagination off")
print("=== Crash Analysis ===")
print(gdb.execute("bt", to_string=True))
print("=== Registers ===")
rip = gdb.parse_and_eval("$rip")
print(f"RIP: {rip}")

This is covered in depth in Project 4.

How This Fits in Projects

  • Project 2: Master the basic GDB workflow with backtraces
  • Project 3: Use memory inspection to diagnose corruption
  • Project 4: Automate GDB with Python scripting
  • Project 5: Apply GDB to multi-threaded crashes
  • Project 6: Use GDB with stripped binaries

Mental Model Diagram

                        GDB CORE DUMP WORKFLOW
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│  INPUT                          GDB                          OUTPUT         │
│  ┌─────────────────┐       ┌──────────────┐           ┌──────────────────┐ │
│  │  Executable     │       │              │           │  Crash Location  │ │
│  │  (with -g)      │──────►│   Symbol     │──────────►│  Function names  │ │
│  │                 │       │   Matching   │           │  Line numbers    │ │
│  └─────────────────┘       │              │           │  Variable types  │ │
│                            │              │           └──────────────────┘ │
│  ┌─────────────────┐       │              │                                │
│  │  Core Dump      │──────►│   State      │           ┌──────────────────┐ │
│  │  (memory +      │       │   Extraction │──────────►│  Backtrace       │ │
│  │   registers)    │       │              │           │  Variable values │ │
│  └─────────────────┘       │              │           │  Memory contents │ │
│                            │              │           │  Register state  │ │
│                            └──────────────┘           └──────────────────┘ │
│                                                                             │
│  KEY COMMANDS:                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  bt           → Call stack (where did it crash?)                    │   │
│  │  frame N      → Switch context (examine caller)                     │   │
│  │  info args    → Function parameters (what was passed?)              │   │
│  │  info locals  → Local variables (what state?)                       │   │
│  │  print VAR    → Variable value (specific data)                      │   │
│  │  x/FMT ADDR   → Raw memory (bit-level view)                         │   │
│  │  info reg     → CPU registers (hardware state)                      │   │
│  │  list         → Source code (if available)                          │   │
│  │  disas        → Assembly (always available)                         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

# Compile with debug symbols
$ gcc -g -o crash crash.c

# Generate core dump
$ ulimit -c unlimited
$ ./crash
Segmentation fault (core dumped)

# Analyze with GDB
$ gdb ./crash core
(gdb) bt
#0  0x0000555555555149 in main () at crash.c:5
(gdb) list
1       #include <stdio.h>
2
3       int main(void) {
4           int *ptr = NULL;
5           *ptr = 42;      # <-- Crash here
6           return 0;
7       }
(gdb) print ptr
$1 = (int *) 0x0
(gdb) info registers rax rip
rax            0x0                 0
rip            0x555555555149      0x555555555149 <main+20>

Common Misconceptions

  1. “I can step through code in a core dump” - No, the program isn’t running. You can only examine the frozen state. Commands like next, step, continue don’t work.

  2. “I need the exact same binary” - Not exact, but the symbols must match. If you have a debug version of the same source, it will work. Different builds may have different addresses.

  3. “GDB shows me the bug” - GDB shows you the symptom (where it crashed). Finding the cause (often earlier in the program) requires detective work.

  4. “Missing symbols means I can’t debug” - You can still see addresses and disassembly. It’s harder, but not impossible (covered in Project 6).

Check-Your-Understanding Questions

  1. What are the two files you need to load a core dump in GDB?
  2. What does frame #0 represent in a backtrace?
  3. What command shows local variables in the current stack frame?
  4. How do you examine 8 bytes of memory at address 0x7fff0000 as hexadecimal?
  5. Why might print ptr fail with “No symbol” even though the core dump is valid?

Check-Your-Understanding Answers

  1. The executable (preferably compiled with -g for symbols) and the core dump file. Command: gdb ./program core

  2. Frame #0 is the innermost frame—the function that was executing when the crash occurred. It’s where the instruction pointer (RIP) was pointing.

  3. info locals shows local variables. info args shows function arguments.

  4. x/8xb 0x7fff0000 (8 bytes in hex) or x/2xg 0x7fff0000 (2 giant/8-byte words in hex)

  5. The executable was compiled without debug symbols (-g flag). The core dump is valid and contains the value, but GDB can’t map addresses to variable names without symbols.

Real-World Applications

  • Production crash triage - Quickly identify crash location in server applications
  • Bug reports - Extract relevant information to share with developers
  • Regression testing - Analyze crashes from automated test runs
  • Security research - Examine crash state for vulnerability analysis

Where You’ll Apply It

  • Project 2: Basic backtrace extraction
  • Project 3: Memory inspection for corruption analysis
  • Project 4: Scripting GDB for automation
  • Project 5: Multi-threaded analysis with info threads
  • Project 6: Working without symbols

References

Key Insights

GDB + core dump = a time machine to the moment of death. You can’t change the past, but you can thoroughly examine what went wrong.

Summary

GDB loads core dumps with gdb <executable> <core>. Essential commands: bt (backtrace), frame N (switch context), print (variables), x/FMT ADDR (raw memory), info registers (CPU state). Debug symbols (-g) are crucial for meaningful output. Post-mortem debugging examines frozen state—you cannot step or continue.

Homework/Exercises

  1. Exercise 1: Create a program with three nested function calls (main → func1 → func2) where func2 crashes. Use GDB to examine each frame’s local variables.

  2. Exercise 2: Practice memory examination: write a program that stores the string “DEADBEEF” in a buffer, then examine it with x/8xb, x/2xw, and x/s.

  3. Exercise 3: Compare the output of bt with and without debug symbols on the same crash. Document the differences.

  4. Exercise 4: Write down 10 GDB commands and their purposes from memory. Then verify against the documentation.

Solutions to Homework/Exercises

Exercise 1 Solution:

// nested.c
void func2(int *p) { *p = 42; }
void func1(int *p) { func2(p); }
int main() { int *p = 0; func1(p); return 0; }
(gdb) bt
#0  func2 (p=0x0) at nested.c:1
#1  func1 (p=0x0) at nested.c:2
#2  main () at nested.c:3
(gdb) frame 1
(gdb) info args
p = 0x0

Exercise 2 Solution:

int main() { char buf[16] = "DEADBEEF"; volatile int x = 1/0; return 0; }
(gdb) x/8xb buf
0x...: 0x44 0x45 0x41 0x44 0x42 0x45 0x45 0x46  # D E A D B E E F
(gdb) x/2xw buf
0x...: 0x44414544 0x46454542  # Note: little-endian
(gdb) x/s buf
0x...: "DEADBEEF"

Exercise 3 Solution: With -g: Shows function names, file:line, variable names Without -g: Shows ?? (), no file/line, “No symbol in context” for prints

Exercise 4 Solution: bt, frame, up, down, print, x, info registers, info locals, info args, list, disassemble, quit


Chapter 4: Multi-Threaded Crash Analysis

Fundamentals

Modern applications are often multi-threaded, which adds complexity to crash analysis. When a multi-threaded program crashes, the core dump captures the state of all threads, not just the one that triggered the crash. Understanding how to navigate between threads and correlate their states is essential for diagnosing concurrency bugs.

The challenge with multi-threaded crashes is that the crashing thread often reveals the symptom (e.g., a NULL pointer dereference), but the cause may be another thread that corrupted shared data or failed to properly synchronize. You must examine all threads to understand the full picture.

GDB provides commands like info threads, thread N, and thread apply all to navigate and inspect multiple threads in a core dump.

Deep Dive

How Threads Appear in Core Dumps

Each thread in a process has its own:

  • Stack (with its own local variables and call chain)
  • Registers (including its own instruction pointer)
  • Thread ID (TID/LWP - Light Weight Process)

In the ELF core dump, each thread gets its own NT_PRSTATUS note containing that thread’s registers. GDB parses these to reconstruct the state of each thread.

Listing All Threads

(gdb) info threads
  Id   Target Id                    Frame
* 1    Thread 0x7ffff7fb4740 (LWP 12345) main () at main.c:30
  2    Thread 0x7ffff7fb3700 (LWP 12346) worker_func () at worker.c:15
  3    Thread 0x7ffff7fb2700 (LWP 12347) writer_func () at writer.c:22

The * indicates the “current” thread—the one GDB is focused on. In a crash, this is typically the thread that received the fatal signal.

Switching Between Threads

(gdb) thread 2                    # Switch to thread 2
[Switching to thread 2 (Thread 0x7ffff7fb3700 (LWP 12346))]
#0  worker_func () at worker.c:15

(gdb) bt                          # Now shows thread 2's stack
#0  worker_func () at worker.c:15
#1  thread_entry () at main.c:20
...

Getting All Backtraces at Once

The most powerful command for multi-threaded crash analysis:

(gdb) thread apply all bt

Thread 3 (Thread 0x7ffff7fb2700 (LWP 12347)):
#0  writer_func () at writer.c:22
#1  thread_entry () at main.c:25
...

Thread 2 (Thread 0x7ffff7fb3700 (LWP 12346)):
#0  worker_func () at worker.c:15
#1  thread_entry () at main.c:20
...

Thread 1 (Thread 0x7ffff7fb4740 (LWP 12345)):
#0  0x0000555555555149 in main () at main.c:30

Common Multi-Threaded Bug Patterns

  1. Data Race - Two threads access shared data without synchronization, one writes, one reads. The reader may see corrupted or inconsistent data.

  2. Use-After-Free Race - Thread A frees memory, Thread B still has a pointer and uses it.

  3. Double-Free Race - Two threads each try to free the same memory.

  4. Deadlock-Induced Timeout - While not a crash per se, if a program is killed due to a deadlock, the core dump shows threads waiting on locks.

Detecting a Data Race from a Core Dump

Consider this scenario:

// Shared global
char *g_data = NULL;

// Thread 1: Writer
void writer_thread() {
    g_data = malloc(100);
    strcpy(g_data, "Hello");
}

// Thread 2: Reader (crashes!)
void reader_thread() {
    printf("%s\n", g_data);  // CRASH if g_data is still NULL
}

In the core dump:

(gdb) thread apply all bt
Thread 2:
#0  reader_thread () at race.c:12
...

Thread 1:
#0  writer_thread () at race.c:6
...

(gdb) thread 2
(gdb) print g_data
$1 = (char *) 0x0            # Still NULL when reader tried to use it!

(gdb) thread 1
(gdb) info locals
# Maybe see that malloc was about to be called

The race condition is revealed: Thread 2 accessed g_data before Thread 1 finished initializing it.

Examining Locks and Synchronization State

For programs using pthreads, you can examine mutex states:

(gdb) print my_mutex
$1 = {__data = {__lock = 1, __count = 0, __owner = 12346, ...}}
#                                         ^^^^^^^^^^^
#                   This thread (LWP 12346) holds the lock

If a thread is waiting on a lock, its backtrace often shows pthread_mutex_lock or similar.

How This Fits in Projects

  • Project 5: Create and analyze a multi-threaded race condition crash
  • Project 10: Handle multi-threaded crashes in your crash reporter

Mental Model Diagram

                    MULTI-THREADED CRASH ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│                            SINGLE PROCESS                                   │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │                                                                       │ │
│  │   Thread 1 (Main)        Thread 2 (Worker)     Thread 3 (Writer)     │ │
│  │   ┌─────────────┐        ┌─────────────┐       ┌─────────────┐       │ │
│  │   │ Stack       │        │ Stack       │       │ Stack       │       │ │
│  │   │ Registers   │        │ Registers   │       │ Registers   │       │ │
│  │   │ TID: 12345  │        │ TID: 12346  │       │ TID: 12347  │       │ │
│  │   └──────┬──────┘        └──────┬──────┘       └──────┬──────┘       │ │
│  │          │                      │                     │               │ │
│  │          └──────────────────────┼─────────────────────┘               │ │
│  │                                 │                                     │ │
│  │                     ┌───────────▼───────────┐                         │ │
│  │                     │   SHARED MEMORY       │                         │ │
│  │                     │   - Global variables  │                         │ │
│  │                     │   - Heap              │                         │ │
│  │                     │   - Mutexes           │                         │ │
│  │                     └───────────────────────┘                         │ │
│  │                                                                       │ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   CRASH IN THREAD 1:                                                        │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │  Core dump captures ALL threads' states                             │  │
│   │  - Each thread has its own NT_PRSTATUS in the core                  │  │
│   │  - The symptom is in Thread 1, but the cause may be Thread 2 or 3   │  │
│   │  - Use `thread apply all bt` to see all backtraces                  │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│   GDB COMMANDS:                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │  info threads          → List all threads                           │  │
│   │  thread N              → Switch to thread N                         │  │
│   │  thread apply all bt   → Backtrace ALL threads                      │  │
│   │  thread apply all info locals → All threads' local vars            │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

// race_crash.c
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

char *g_data = NULL;

void *writer_thread(void *arg) {
    sleep(1);  // Simulate slow initialization
    g_data = malloc(100);
    return NULL;
}

void *reader_thread(void *arg) {
    // Race: may execute before writer finishes
    printf("Data: %s\n", g_data);  // CRASH if g_data is NULL
    return NULL;
}

int main() {
    pthread_t writer, reader;
    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);
    pthread_join(writer, NULL);
    pthread_join(reader, NULL);
    return 0;
}
$ gdb ./race_crash core
(gdb) info threads
  Id   Target Id                    Frame
* 2    Thread ... (LWP ...) reader_thread () at race_crash.c:15
  3    Thread ... (LWP ...) writer_thread () at race_crash.c:10
  1    Thread ... (LWP ...) pthread_join () ...

(gdb) thread 2
(gdb) print g_data
$1 = (char *) 0x0    # NULL - writer hasn't finished!

Common Misconceptions

  1. “The crashing thread is always the buggy one” - Often the crash is a symptom. Another thread corrupted data that caused this thread to crash.

  2. “Core dumps only capture the crashing thread” - No, all threads are captured. Each has its own registers and stack in the dump.

  3. “I can see lock contention history” - No, you only see the current state. You can’t see what happened before the crash.

  4. “Thread IDs are stable across runs” - No, TIDs (LWP numbers) are assigned by the kernel and vary between runs.

Check-Your-Understanding Questions

  1. How do you list all threads in a GDB core dump session?
  2. What does the * mean in info threads output?
  3. How do you get a backtrace for every thread at once?
  4. Where in the core dump are individual thread states stored?
  5. If Thread 1 crashes due to a NULL pointer, how might Thread 2 be responsible?

Check-Your-Understanding Answers

  1. Use info threads to list all threads with their IDs, target IDs, and current frames.

  2. The * indicates the currently selected/focused thread—usually the one that received the fatal signal.

  3. Use thread apply all bt to get backtraces for all threads in one command.

  4. Each thread has its own NT_PRSTATUS note in the PT_NOTE segment of the core dump, containing that thread’s registers.

  5. Thread 2 might have been responsible for initializing the pointer but hadn’t done so yet (race condition), or Thread 2 might have freed or corrupted the memory Thread 1 was using.

Key Insights

In multi-threaded crashes, the crashing thread shows the symptom, but the cause may be in any thread. Always examine all threads.

Summary

Multi-threaded core dumps capture all threads’ states. Use info threads to list them, thread N to switch, and thread apply all bt for all backtraces. Data races, use-after-free, and synchronization issues require examining shared state across threads. The crashing thread often isn’t the root cause.


Chapter 5: Kernel Crash Analysis (kdump and crash)

Fundamentals

When the Linux kernel itself crashes (a “kernel panic”), the normal core dump mechanism can’t work—the kernel is the core dump generator. Instead, Linux uses kdump, which boots a secondary “capture kernel” to save the memory of the panicked kernel. The resulting file is called a vmcore.

Analyzing a vmcore requires the crash utility, which is essentially “GDB for the kernel.” It understands kernel data structures and can navigate process lists, examine kernel stacks, and inspect driver state—all from a frozen snapshot of the entire system.

This is advanced material, but understanding the basics gives you powerful debugging capabilities for system-level issues.

Deep Dive

How kdump Works

When the kernel panics, it can’t simply write a file—the file system might be corrupted, and the kernel’s own code might be broken. Instead, kdump uses kexec to immediately boot into a small, pre-loaded “capture kernel” that runs from reserved memory:

Normal Operation:
┌─────────────────────────────────────────────────────────────────┐
│                      RUNNING KERNEL                              │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Normal kernel uses most of memory                       │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌──────────────┐                                               │
│  │  Reserved    │ ← Capture kernel loaded here (crashkernel=)  │
│  │  Memory      │                                               │
│  └──────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

After Panic:
┌─────────────────────────────────────────────────────────────────┐
│                    CAPTURE KERNEL RUNNING                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  Panicked kernel's memory preserved, accessible via     │    │
│  │  /proc/vmcore in capture kernel                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌──────────────┐                                               │
│  │  Capture     │ ← This small kernel is now running           │
│  │  Kernel      │ ← Saves /proc/vmcore to disk                  │
│  └──────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Configuring kdump

  1. Reserve memory - Add crashkernel=256M (or more) to kernel command line in GRUB
  2. Install packages - kexec-tools and crash
  3. Enable the service - systemctl enable kdump
  4. Get debug symbols - Install kernel-debuginfo or equivalent
# Check kdump status
$ systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Active: active (exited)

# Check crashkernel reservation
$ cat /proc/cmdline | grep crashkernel
crashkernel=256M

# List crash dumps (after a panic)
$ ls /var/crash/
127.0.0.1-2024-12-20-15:30:00/

The crash Utility

The crash utility is an interactive tool for analyzing vmcore files:

# Basic invocation
$ crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/.../vmcore

crash> help              # List available commands
crash> bt                # Backtrace of panicking task
crash> log               # Kernel log buffer (dmesg)
crash> ps                # Process list at crash time
crash> files <pid>       # Open files for a process
crash> vm <pid>          # Virtual memory info
crash> struct <name>     # Examine kernel structure
crash> quit              # Exit

Essential crash Commands

Command Description
bt Backtrace of current (panicking) task
bt -a Backtrace of all CPUs
log Kernel log buffer (like dmesg)
ps List all processes
files <pid> Open files for a process
vm <pid> Virtual memory for a process
struct <name> <addr> Display kernel structure
mod List loaded modules
kmem -i Memory usage summary
foreach bt Backtrace every process

Analyzing a Kernel Panic

When you trigger a panic (e.g., via a buggy kernel module), the output in crash looks like:

crash> bt
PID: 1234    TASK: ffff88810a4d8000  CPU: 1   COMMAND: "insmod"
 #0 [ffffc90000a77e30] machine_kexec at ffffffff8100259b
 #1 [ffffc90000a77e80] __crash_kexec at ffffffff8110d9ab
 #2 [ffffc90000a77f00] panic at ffffffff8106a3e8
 #3 [ffffc90000a77f30] oops_end at ffffffff81c01b9a
 #4 [ffffc90000a77f80] no_context at ffffffff8104d2ab
 #5 [ffffc90000a77ff0] do_page_fault at ffffffff81c0605e
 #6 [ffffc90000a77ff8] page_fault at ffffffff82000b9e
 #7 [ffffc90000a78050] buggy_init at ffffffffc0670010 [buggy_module]
                                   ^^^^^^^^^^^^^^^ YOUR BUGGY CODE

crash> log | tail -20
[  123.456] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  123.457] #PF: supervisor write access in kernel mode
...
[  123.465] Kernel panic - not syncing: Fatal exception

How This Fits in Projects

  • Project 8: Configure kdump and trigger a kernel panic with a buggy module
  • Project 9: Use crash to analyze the resulting vmcore

Mental Model Diagram

                    KERNEL CRASH ANALYSIS PIPELINE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│  1. PANIC OCCURS           2. KEXEC TRIGGERS        3. CAPTURE & SAVE      │
│  ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐     │
│  │  BUG: NULL ptr  │      │  kexec boots    │      │  /proc/vmcore   │     │
│  │  dereference    │ ───► │  capture kernel │ ───► │  saved to       │     │
│  │  in kernel code │      │  from reserved  │      │  /var/crash/    │     │
│  │                 │      │  memory         │      │                 │     │
│  └─────────────────┘      └─────────────────┘      └─────────────────┘     │
│                                                            │                │
│                                                            ▼                │
│  4. ANALYSIS              5. INVESTIGATION                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  $ crash vmlinux vmcore                                              │   │
│  │                                                                      │   │
│  │  crash> bt          # See panic stack trace                          │   │
│  │  crash> log         # See kernel messages                            │   │
│  │  crash> ps          # See running processes                          │   │
│  │  crash> mod         # See loaded modules                             │   │
│  │                                                                      │   │
│  │  Result: Identify buggy code path in kernel/module                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  REQUIREMENTS:                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  - crashkernel= parameter in boot command line                       │   │
│  │  - kdump service enabled and running                                 │   │
│  │  - crash utility installed                                           │   │
│  │  - Kernel debug symbols (vmlinux with debuginfo)                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Concrete Example

// buggy_module.c - A kernel module that causes a panic
#include <linux/module.h>
#include <linux/kernel.h>

static int __init buggy_init(void) {
    int *ptr = NULL;
    printk(KERN_INFO "About to crash...\n");
    *ptr = 42;  // PANIC: NULL pointer dereference in kernel mode
    return 0;
}

static void __exit buggy_exit(void) {
    printk(KERN_INFO "Goodbye\n");
}

module_init(buggy_init);
module_exit(buggy_exit);
MODULE_LICENSE("GPL");
# Build the module
$ make -C /lib/modules/$(uname -r)/build M=$(pwd) modules

# Load it (IN A VM!)
$ sudo insmod buggy_module.ko
# System panics, kdump captures vmcore

# After reboot, analyze
$ sudo crash /usr/lib/debug/.../vmlinux /var/crash/.../vmcore
crash> bt
crash> log

Key Insights

Kernel crashes require a completely different capture mechanism (kdump) because the kernel can’t debug itself. The crash utility is GDB for the entire operating system.

Summary

Kernel panics are captured by kdump, which boots a capture kernel to save memory as a vmcore. The crash utility analyzes vmcore files using commands like bt, log, and ps. This requires crashkernel reservation, kdump service, and kernel debug symbols.


Chapter 6: Automation and Scripting

Fundamentals

Manual crash analysis doesn’t scale. Production systems may generate dozens of crashes daily, and each needs initial triage to determine severity and potential root cause. GDB’s Python API and batch mode enable automation, letting you build scripts that extract key information without human intervention.

Automation serves two purposes: efficiency (processing many dumps quickly) and consistency (every dump gets the same analysis, reducing human error). This chapter covers the techniques used in Project 4 (automated triage) and Project 10 (centralized crash reporter).

Deep Dive

GDB Batch Mode

The simplest automation is GDB’s batch mode, which runs commands from a file:

# commands.gdb
set pagination off
bt
info registers
quit
$ gdb -q --batch -x commands.gdb ./program core
#0  main () at crash.c:5
...

The -q (quiet) flag suppresses the welcome message. --batch exits after running commands.

GDB Python API

For more sophisticated automation, GDB embeds a Python interpreter. Your Python scripts run inside GDB and can access its internals:

# analyze.py - Run with: gdb -q --batch -x analyze.py ./program core
import gdb

# Disable paging
gdb.execute("set pagination off")

# Get backtrace as string
bt = gdb.execute("bt", to_string=True)
print("=== BACKTRACE ===")
print(bt)

# Access registers programmatically
rip = gdb.parse_and_eval("$rip")
rsp = gdb.parse_and_eval("$rsp")
print(f"RIP: {rip}")
print(f"RSP: {rsp}")

# Get the signal that killed the process
try:
    # This works if the core was from a signal
    info = gdb.execute("info signals", to_string=True)
    print("=== SIGNALS ===")
    print(info[:500])  # First 500 chars
except:
    pass

# Examine a specific address
try:
    mem = gdb.execute("x/16xb $rsp", to_string=True)
    print("=== STACK TOP ===")
    print(mem)
except:
    pass

Key GDB Python Functions

Function Description
gdb.execute(cmd) Run a GDB command
gdb.execute(cmd, to_string=True) Run command, capture output as string
gdb.parse_and_eval(expr) Evaluate expression, return GDB Value
gdb.selected_frame() Get current stack frame
gdb.selected_thread() Get current thread
gdb.inferiors() List of debugged programs

Crash Fingerprinting

To deduplicate crashes (identify unique bugs vs. repeat occurrences), you need a stable “fingerprint.” A common approach:

  1. Extract the top 3-5 frames of the backtrace
  2. Normalize addresses to function names (or offset within function)
  3. Hash the result
import hashlib

def get_crash_fingerprint(bt_output):
    """Generate a fingerprint from a backtrace."""
    lines = bt_output.strip().split('\n')
    # Take first 5 frames
    frames = []
    for line in lines[:5]:
        # Extract function name (simplified)
        if ' in ' in line:
            func = line.split(' in ')[1].split(' ')[0]
            frames.append(func)

    fingerprint = '|'.join(frames)
    return hashlib.md5(fingerprint.encode()).hexdigest()[:16]

Building an Analysis Pipeline

A complete automation pipeline might look like:

┌──────────────────────────────────────────────────────────────────┐
│                    CRASH ANALYSIS PIPELINE                        │
├──────────────────────────────────────────────────────────────────┤
│  1. Core dump arrives (via core_pattern pipe or file watch)      │
│  2. Python script invokes GDB with analysis script               │
│  3. GDB script extracts:                                         │
│     - Backtrace (all threads)                                    │
│     - Registers                                                  │
│     - Signal info                                                │
│     - Key variables (if symbols available)                       │
│  4. Generate fingerprint from backtrace                          │
│  5. Store results:                                                │
│     - Raw core dump (compressed)                                 │
│     - Analysis report (JSON/text)                                │
│     - Fingerprint for deduplication                              │
│  6. Alert if new unique crash or high-severity                   │
└──────────────────────────────────────────────────────────────────┘

How This Fits in Projects

  • Project 4: Build an automated crash triage tool
  • Project 10: Create a complete crash reporting system

Key Insights

Automation transforms crash analysis from a manual, ad-hoc process into a systematic pipeline that can handle production scale.

Summary

GDB batch mode runs commands from files. GDB Python API provides programmatic access via gdb.execute() and gdb.parse_and_eval(). Crash fingerprinting uses backtrace frames to identify unique bugs. Automation enables production-scale crash analysis.


Glossary

Term Definition
Core Dump A file containing a snapshot of a process’s memory and CPU state at termination
Backtrace The call stack showing the sequence of function calls leading to the current point
SIGSEGV Signal 11, Segmentation Fault—raised when a process accesses invalid memory
ELF Executable and Linkable Format—the binary format for Linux executables and core dumps
PT_NOTE ELF program header type containing metadata (registers, signal, file mappings)
PT_LOAD ELF program header type containing actual memory contents
NT_PRSTATUS Note type containing process status and registers at crash time
Debug Symbols Metadata linking binary addresses to source code (file/line/variable names)
ulimit Shell command to set resource limits, including core dump size (ulimit -c)
core_pattern Kernel parameter (/proc/sys/kernel/core_pattern) controlling dump location
systemd-coredump Modern Linux daemon that captures and manages core dumps
coredumpctl Command-line tool to list and debug systemd-managed core dumps
GDB GNU Debugger—the primary tool for analyzing core dumps on Linux
Post-Mortem Debugging Analyzing a crashed program from its core dump after the fact
Stack Frame A section of the stack containing a function’s local variables and return address
kdump Kernel crash dump mechanism using kexec to boot a capture kernel
vmcore The memory dump file created by kdump after a kernel panic
crash The utility for analyzing kernel crash dumps (vmcore files)
kexec Mechanism to boot into a new kernel without going through firmware
Minidump A compact crash dump format used by Breakpad/Crashpad
Symbolication The process of resolving memory addresses to function/line names
Fingerprint A hash identifying a unique crash type for deduplication
Data Race A bug where two threads access shared data without synchronization
LWP Light Weight Process—another term for a thread’s kernel identifier

Why Linux Crash Dump Analysis Matters

Modern Relevance

Crash analysis is a fundamental skill for anyone working with systems software. Consider:

  • Cloud infrastructure runs millions of processes. When one crashes, operators need fast root cause analysis
  • IoT and embedded systems often can’t be debugged live; crash dumps are the only evidence
  • Security researchers analyze crashes to find vulnerabilities
  • SRE/DevOps teams need to correlate crashes with deployments and load patterns

According to Red Hat’s documentation, the crash utility and kernel debug symbols are essential tools for diagnosing kernel issues in enterprise Linux environments.

Real-World Statistics

  • Large-scale services may see thousands of process crashes daily (Facebook’s analysis infrastructure processes millions of crash reports)
  • Kernel panics, while rarer, can cause significant outages (each minute of downtime can cost enterprises $5,600 on average per Gartner estimates)
  • The average time to diagnose a crash without proper tooling can be hours; with proper crash analysis, minutes

Context and Evolution

Core dumps have existed since the earliest Unix systems (1970s). The name “core” comes from magnetic core memory. While the underlying technology has evolved (from simple memory snapshots to ELF-formatted files with metadata), the concept remains: preserve the state for later analysis.

Modern developments include:

  • systemd-coredump (2012+) for centralized management
  • Breakpad/Crashpad (Google) for cross-platform minidumps
  • Sentry, Bugsnag, Crashlytics for cloud-based crash aggregation
  • eBPF for live system analysis (complementing post-mortem)

Concept Summary Table

Concept Cluster What You Need to Internalize
Core Dump Fundamentals Core dumps are ELF files capturing process memory + CPU state. Generation requires ulimit -c and core_pattern. Modern systems use systemd-coredump.
ELF Core Format PT_NOTE segments hold metadata (registers in NT_PRSTATUS, files in NT_FILE). PT_LOAD segments hold memory. No section headers.
GDB Post-Mortem Load with gdb <exe> <core>. Key commands: bt, frame, print, x, info registers. Debug symbols (-g) are essential.
Multi-Threaded Analysis All threads captured in core. Use info threads, thread N, thread apply all bt. Crashing thread shows symptom, cause may be elsewhere.
Kernel Crash Analysis kdump uses kexec to boot capture kernel. vmcore analyzed with crash utility. Requires crashkernel reservation and debug symbols.
Automation GDB batch mode and Python API enable scripted analysis. Fingerprinting identifies unique crashes. Scales to production.

Project-to-Concept Map

Project Concepts Applied
Project 1: The First Crash Core Dump Fundamentals
Project 2: The GDB Backtrace GDB Post-Mortem, Debug Symbols
Project 3: The Memory Inspector GDB Post-Mortem, ELF Core Format
Project 4: Automated Crash Detective Automation, GDB Python API
Project 5: Multi-threaded Mayhem Multi-Threaded Analysis
Project 6: Stripped Binary Crash ELF Core Format, GDB without symbols
Project 7: Minidump Parser ELF Core Format (similar concepts)
Project 8: Kernel Panics Kernel Crash Analysis, kdump
Project 9: Analyzing with crash Kernel Crash Analysis, crash utility
Project 10: Centralized Reporter Automation, Fingerprinting, All concepts

Deep Dive Reading by Concept

Concept Book & Chapter Why This Matters
Core Dumps “The Linux Programming Interface” Ch. 22 Definitive coverage of signals and process termination
ELF Format “Practical Binary Analysis” Ch. 2-3 Deep dive into ELF structure
GDB Basics “The Art of Debugging with GDB” Ch. 1-4 Foundational GDB skills
Memory Layout “Computer Systems: A Programmer’s Perspective” Ch. 7, 9 Understanding virtual memory
Multi-threading “The Linux Programming Interface” Ch. 29-30 POSIX threads and synchronization
Kernel Internals “Linux Kernel Development” by Robert Love For kernel panic analysis
Kernel Modules “Linux Device Drivers, 3rd Ed” Ch. 1-2 Writing and debugging modules

Quick Start: Your First 48 Hours

Day 1: Foundation (4-6 hours)

  1. Read Theory Primer Chapters 1-3 (2 hours)
    • Core Dump Fundamentals
    • ELF Core Format (overview)
    • GDB Post-Mortem
  2. Complete Project 1 (2 hours)
    • Configure your system for core dumps
    • Verify dumps are generated
    • Definition of Done: file core* shows ELF core file
  3. Start Project 2 (1-2 hours)
    • Load your first core dump in GDB
    • Get a backtrace with line numbers

Day 2: Practical Skills (4-6 hours)

  1. Finish Project 2 (1 hour)
    • Compare with/without debug symbols
    • Definition of Done: Can identify crash location by file:line
  2. Start Project 3 (3-4 hours)
    • Create a memory corruption scenario
    • Practice print and x commands
    • Inspect variables across stack frames
  3. Review and Practice (1 hour)
    • Re-do the GDB homework exercises
    • Take notes on commands you find useful

Path 1: The Developer (Focus on User-Space)

If you’re a software developer wanting to debug your own applications:

  1. Project 1Project 2Project 3 (Weeks 1-2)
  2. Project 4 (Automation) → Project 5 (Multi-threaded) (Weeks 3-4)
  3. Project 6 (Stripped binaries) if you work with releases (Week 5)
  4. Skip kernel projects unless needed

Path 2: The SRE/DevOps Engineer

If you’re managing infrastructure and need to triage crashes:

  1. Project 1Project 2 (quick start) (Week 1)
  2. Project 4 (Automation - your main tool) (Week 2)
  3. Project 8Project 9 (Kernel crashes) (Weeks 3-4)
  4. Project 10 (Build your own crash collection) (Weeks 5-6)

Path 3: The Security Researcher

If you’re analyzing crashes for vulnerabilities:

  1. Project 1Project 2Project 3 (Weeks 1-2)
  2. Project 6 (Stripped binaries - common in targets) (Week 3)
  3. Project 7 (Minidump parsing) (Week 4)
  4. Deep study of ELF format and memory corruption patterns

Success Metrics

You’ve mastered this material when you can:

  1. Configure any Linux system for core dump capture in under 5 minutes
  2. Get a backtrace from a core dump and identify the crash location
  3. Navigate stack frames and inspect variables/memory in GDB
  4. Write a Python script that automates basic crash triage
  5. Analyze a multi-threaded crash and identify cross-thread issues
  6. Work with stripped binaries using disassembly
  7. Configure kdump and trigger a test kernel panic (in a VM)
  8. Use the crash utility to analyze a vmcore
  9. Explain the ELF core format and what each section contains
  10. Design a crash reporting pipeline for a production system

Project Overview Table

# Project Difficulty Time Key Skill
1 The First Crash Beginner 4-8h System configuration
2 The GDB Backtrace Beginner 4-8h Basic GDB workflow
3 The Memory Inspector Intermediate 10-15h Memory examination
4 Automated Crash Detective Intermediate 15-20h GDB scripting
5 Multi-threaded Mayhem Advanced 15-20h Thread analysis
6 Stripped Binary Crash Advanced 15-20h Disassembly
7 Minidump Parser Advanced 20-30h Binary parsing
8 Kernel Panics Expert 20-30h Kernel modules
9 Analyzing with crash Expert 15-20h Kernel debugging
10 Centralized Reporter Master 40+h System design

Project List

The following 10 projects guide you from your first intentional crash to building production-grade crash analysis infrastructure.


Project 1: The First Crash — Understanding Core Dump Generation

  • File: P01-first-crash-core-dump-generation.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust (for comparison)
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Systems Configuration, Process Signals
  • Software or Tool: ulimit, systemd-coredump, coredumpctl
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A controlled environment that generates, captures, and verifies core dumps from intentional crashes, along with a configuration script that sets up any Linux system for crash capture.

Why it teaches crash dump analysis: Before you can analyze a crash, you must reliably capture one. This project forces you to understand how the kernel decides whether to dump core, where it writes the dump, and how modern Linux (systemd) has changed the traditional core.PID pattern. You’ll learn by intentionally breaking things and verifying the evidence is preserved.

Core challenges you will face:

  • Configuring ulimit correctly → Maps to Core Dump Fundamentals (soft vs hard limits, shell vs process)
  • Understanding core_pattern → Maps to systemd-coredump integration
  • Choosing storage location → Maps to real-world deployment considerations
  • Triggering different crash types → Maps to understanding signals (SIGSEGV, SIGABRT, SIGFPE)

Real World Outcome

You will have a shell script that configures any Linux system for core dump capture and a test program that crashes in multiple ways. Running the test will produce visible, verifiable core dumps.

Example Output:

$ ./setup-coredumps.sh
[+] Checking current ulimit -c: 0
[+] Setting ulimit -c unlimited for current shell
[+] Current core_pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
[+] systemd-coredump is active — using coredumpctl
[+] Configuration complete!

$ ./crash-test segfault
[*] About to trigger: SIGSEGV (Segmentation Fault)
[*] Dereferencing NULL pointer...
Segmentation fault (core dumped)

$ coredumpctl list
TIME                          PID  UID  GID SIG     COREFILE EXE                      SIZE
Sat 2025-01-04 10:23:45 UTC  1234 1000 1000 SIGSEGV present  /home/user/crash-test   245.2K

$ ./crash-test abort
[*] About to trigger: SIGABRT (Abort)
[*] Calling abort()...
Aborted (core dumped)

$ coredumpctl list | tail -2
Sat 2025-01-04 10:23:45 UTC  1234 1000 1000 SIGSEGV present  /home/user/crash-test   245.2K
Sat 2025-01-04 10:24:01 UTC  1235 1000 1000 SIGABRT present  /home/user/crash-test   246.1K

$ file $(coredumpctl -o /tmp/core.test info --no-pager 1234 2>/dev/null; echo /tmp/core.test)
/tmp/core.test: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './crash-test segfault'

The Core Question You Are Answering

“Where does my crashed process’s memory go, and how do I make sure the kernel actually writes it?”

Before writing any code, understand that core dumps are not automatic. The kernel checks multiple conditions: Is RLIMIT_CORE non-zero? Is the executable setuid? Does the process have permission to write to the dump location? Is there enough disk space? Modern systems add another layer: systemd-coredump intercepts dumps before they hit the filesystem. You need to understand this pipeline before you can debug anything.

Concepts You Must Understand First

  1. Resource Limits (ulimit)
    • What is the difference between soft and hard limits?
    • Why does ulimit -c unlimited in a script not affect programs started afterward?
    • How do you set permanent limits via /etc/security/limits.conf?
    • Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 36
  2. Signals and Process Termination
    • Which signals generate core dumps by default (SIGQUIT, SIGILL, SIGABRT, SIGFPE, SIGSEGV, SIGBUS, SIGSYS, SIGTRAP)?
    • How can you catch a signal vs letting it dump core?
    • What happens when a signal handler re-raises the signal?
    • Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 20-22
  3. systemd-coredump
    • How does the kernel pipe core dumps to an external program?
    • Where does systemd-coredump store files (/var/lib/systemd/coredump/)?
    • How does compression (LZ4) affect storage and retrieval?
    • Book Reference: systemd-coredump(8) and coredumpctl(1) man pages

Questions to Guide Your Design

  1. Configuration Detection
    • How will your script detect if systemd-coredump is active vs traditional core files?
    • What should happen on non-systemd systems (Alpine, older RHEL)?
  2. Crash Triggering
    • How will you trigger each signal type (NULL deref for SIGSEGV, abort() for SIGABRT, 1/0 for SIGFPE)?
    • Should you compile with or without optimizations? Why?
  3. Verification
    • How will you verify the dump was actually created (not a zero-size file)?
    • What fields in coredumpctl info prove the dump is usable?
  4. Cleanup
    • How do you remove old test dumps without affecting real crashes?
    • What is the coredumpctl retention policy?

Thinking Exercise

Trace the Kernel Path

Before coding, trace what happens when a process dereferences NULL:

1. Process executes: *(int *)0 = 42;
2. CPU raises page fault (address 0 is not mapped)
3. Kernel's page fault handler runs
4. Handler finds no valid mapping → sends SIGSEGV to process
5. Process has no handler for SIGSEGV → default action is "dump core + terminate"
6. Kernel checks RLIMIT_CORE:
   - If 0 → no dump, just terminate
   - If >0 → proceed
7. Kernel reads /proc/sys/kernel/core_pattern:
   - If starts with "|" → pipe to that program (systemd-coredump)
   - Otherwise → write to file with that pattern
8. Kernel writes ELF core file with process memory + registers
9. Process terminates, parent receives SIGCHLD

Questions while tracing:

  • At which step can you lose the dump?
  • What if the pipe to systemd-coredump fails?
  • How does the kernel know which memory regions to include?

The Interview Questions They Will Ask

  1. “A production service crashed but there’s no core dump. Walk me through how you would debug why.”
  2. “What is the difference between ulimit in .bashrc and /etc/security/limits.conf?”
  3. “How does systemd-coredump differ from traditional core dumps, and what are the tradeoffs?”
  4. “Why might a setuid program not generate a core dump even with ulimit -c unlimited?”
  5. “How would you configure core dumps on a container running in Kubernetes?”

Hints in Layers

Hint 1: Start with Detection First, detect the current configuration before changing anything. Read /proc/sys/kernel/core_pattern and compare ulimit -c (soft) vs ulimit -Hc (hard).

Hint 2: Handle Both Modes Your script should work on both systemd and non-systemd systems. Use which coredumpctl or check for the |/usr/lib/systemd/systemd-coredump pattern to detect the mode.

Hint 3: Test Program Structure

main(argc, argv):
    if argc < 2:
        print usage: "./crash-test [segfault|abort|fpe|bus]"
        exit

    switch argv[1]:
        case "segfault":
            print "Triggering SIGSEGV..."
            ptr = NULL
            *ptr = 42  // crash here
        case "abort":
            print "Triggering SIGABRT..."
            abort()
        case "fpe":
            print "Triggering SIGFPE..."
            volatile x = 0
            y = 1 / x  // crash here
        case "bus":
            print "Triggering SIGBUS..."
            // Requires unaligned access on strict architectures

Hint 4: Verification Commands Use these commands to verify your setup:

# Check if dump was created (systemd)
coredumpctl list | head -5

# Extract and verify
coredumpctl dump <PID> -o /tmp/test.core
file /tmp/test.core  # Should show "ELF 64-bit LSB core file"

# Check size (should be non-zero)
ls -la /tmp/test.core

Books That Will Help

Topic Book Chapter
Resource Limits “The Linux Programming Interface” by Kerrisk Ch. 36: Process Resources
Signals “The Linux Programming Interface” by Kerrisk Ch. 20-22: Signals
Core Dumps “The Linux Programming Interface” by Kerrisk Ch. 22.1: Core Dump Files
Practical GDB “The Art of Debugging” by Matloff & Salzman Ch. 1: Getting Started

Common Pitfalls and Debugging

Problem 1: “ulimit -c unlimited had no effect”

  • Why: ulimit only affects the current shell and its children. Running sudo ulimit -c unlimited spawns a root subshell that exits immediately.
  • Fix: Add limit to /etc/security/limits.conf for persistent change, or run ulimit in the same shell that launches the process.
  • Quick test: bash -c 'ulimit -c unlimited; ./crash-test segfault'

Problem 2: “Core dump was created but has 0 bytes”

  • Why: The process may have died before the dump completed, or filesystem is full.
  • Fix: Check dmesg | grep -i core for kernel messages. Verify disk space with df -h.
  • Quick test: dmesg | tail -20

Problem 3: “coredumpctl shows ‘missing’ in COREFILE column”

  • Why: systemd-coredump may have retention policies that deleted old dumps, or the dump failed during capture.
  • Fix: Check /etc/systemd/coredump.conf for MaxUse= and KeepFree= settings.
  • Quick test: coredumpctl info <PID> for detailed error messages

Problem 4: “Crash happens but no ‘core dumped’ message”

  • Why: Shell might not report core dump status, or signal was caught.
  • Fix: Check exit status: ./crash-test segfault; echo "Exit: $?" (139 = 128+11 = SIGSEGV with core)
  • Quick test: Exit code 139 (SIGSEGV) or 134 (SIGABRT) indicates dump should have occurred

Definition of Done

  • Setup script detects and reports current core dump configuration
  • Setup script configures ulimit and verifies core_pattern
  • Test program triggers at least 3 different signal types (SIGSEGV, SIGABRT, SIGFPE)
  • Each crash produces a verifiable core dump (non-zero size, correct ELF type)
  • coredumpctl list or ls core.* shows all test crashes
  • Script works on at least 2 different Linux distributions (Ubuntu, Fedora, or similar)
  • Documentation explains the differences between systemd and traditional core patterns

Project 2: The GDB Backtrace — Extracting Crash Context

  • File: P02-gdb-backtrace-crash-context.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Debugging, Stack Traces
  • Software or Tool: GDB, debug symbols (-g flag)
  • Main Book: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman

What you will build: A debugging workflow that extracts meaningful crash information from core dumps using GDB. You will create programs that crash in various ways and practice the essential GDB commands to understand exactly what went wrong.

Why it teaches crash dump analysis: The backtrace is your first tool when analyzing any crash. But a raw backtrace is often useless without understanding stack frames, argument values, and local variables. This project teaches you to navigate from “Segmentation fault” to “Line 47 of foo.c passed NULL to bar()”.

Core challenges you will face:

  • Loading core dumps correctly → Maps to GDB’s core-file command and executable matching
  • Reading backtraces with symbols → Maps to understanding compile flags (-g, -O)
  • Navigating stack frames → Maps to frame selection and variable inspection
  • Comparing with/without debug symbols → Maps to real-world debugging constraints

Real World Outcome

You will have a debugging toolkit that demonstrates GDB’s core dump analysis capabilities. Running your workflow will produce clear, actionable crash information.

Example Output:

$ ./demo-crashes linked-list-corruption
[*] Creating a linked list with 5 nodes
[*] Corrupting node 3's next pointer with garbage
[*] Traversing list (will crash)...
Segmentation fault (core dumped)

$ gdb ./demo-crashes core.12345
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Reading symbols from ./demo-crashes...
[New LWP 12345]
Core was generated by `./demo-crashes linked-list-corruption'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
47          printf("Node value: %d\n", current->value);

(gdb) bt
#0  0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
#1  0x0000555555555456 in test_linked_list_corruption () at demo-crashes.c:89
#2  0x00005555555556b2 in main (argc=2, argv=0x7fffffffde68) at demo-crashes.c:142

(gdb) frame 0
#0  0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
47          printf("Node value: %d\n", current->value);

(gdb) print current
$1 = (struct Node *) 0xdeadbeef

(gdb) print *current
Cannot access memory at address 0xdeadbeef

(gdb) info locals
head = 0x5555555592a0
current = 0xdeadbeef
count = 3

(gdb) # The bug: after 3 iterations, current became 0xdeadbeef (our corruption)
(gdb) # Root cause: someone wrote 0xdeadbeef to node[2]->next

The Core Question You Are Answering

“The program crashed—but WHERE exactly, and with WHAT state?”

A crash report saying “Segmentation fault” tells you almost nothing. You need: (1) the exact line of code, (2) the call chain that led there, (3) the values of relevant variables, and (4) the state of memory at the crash point. GDB gives you all of this—if you know the commands.

Concepts You Must Understand First

  1. Debug Symbols (-g flag)
    • What information does -g embed in the binary?
    • How does DWARF format map addresses to source lines?
    • Why can you still get a backtrace without symbols, just without names?
    • Book Reference: “Practical Binary Analysis” by Andriesse — Ch. 2
  2. Stack Frames
    • What is stored in each stack frame (return address, saved registers, locals)?
    • How does GDB number frames (0 is current, higher is caller)?
    • What is the frame pointer (RBP) and how does it help?
    • Book Reference: “Computer Systems: A Programmer’s Perspective” by Bryant — Ch. 3.7
  3. GDB Core Commands
    • bt (backtrace) — shows call stack
    • frame N — select stack frame N
    • info locals — show local variables in current frame
    • info args — show function arguments
    • print <expr> — evaluate and print expression
    • Book Reference: “The Art of Debugging” by Matloff — Ch. 2-3

Questions to Guide Your Design

  1. Test Case Design
    • What crash scenarios will you create (NULL pointer, buffer overflow, use-after-free)?
    • How will you ensure the crash happens at a predictable location?
  2. Symbol Comparison
    • How will you demonstrate the difference between -g and no -g?
    • What about -g with optimization (-O2 -g)?
  3. Variable Inspection
    • How will you show variables at different stack depths?
    • What happens when you try to print an optimized-out variable?
  4. Documentation
    • What GDB commands will you document for your reference?
    • How will you create a “cheat sheet” for common scenarios?

Thinking Exercise

Trace the Stack

Given this call chain that crashes:

void foo(int *p) { *p = 42; }  // Crash here if p is NULL
void bar(int x) { if (x < 0) foo(NULL); else { int y = x; foo(&y); } }
void baz() { bar(-1); }
int main() { baz(); return 0; }

Draw the stack at crash time:

High addresses
┌─────────────────┐
│ main's frame    │ ← Frame 3
│  (return addr)  │
├─────────────────┤
│ baz's frame     │ ← Frame 2
│  (return addr)  │
├─────────────────┤
│ bar's frame     │ ← Frame 1
│  x = -1         │
│  (return addr)  │
├─────────────────┤
│ foo's frame     │ ← Frame 0 (current)
│  p = NULL       │
│  crash point    │
└─────────────────┘
Low addresses

Questions while tracing:

  • If you only see frame 0, how do you know the real bug is in bar()?
  • What would info args show in frame 1?
  • How would this look different without debug symbols?

The Interview Questions They Will Ask

  1. “You have a core dump from a production crash. Walk me through your first 5 GDB commands.”
  2. “What is the difference between bt and bt full?”
  3. “How can you tell if a crash is a NULL pointer dereference vs a use-after-free from the backtrace?”
  4. “The backtrace shows ?? instead of function names. What does that mean and how do you fix it?”
  5. “How do you inspect a variable that GDB says is ‘optimized out’?”
  6. “What is the frame pointer, and why do some binaries omit it?”

Hints in Layers

Hint 1: Create Multiple Crash Types Start with 3-4 distinct crash scenarios: NULL dereference, stack buffer overflow, heap corruption, double-free. Each teaches different debugging patterns.

Hint 2: Compile Two Versions Always compile the same source twice: gcc -g -O0 (debug) and gcc -O2 (release). Compare the backtrace quality to understand why debug symbols matter.

Hint 3: Build a Command Reference

Essential GDB Commands for Core Dump Analysis:
- bt              : Show backtrace (call stack)
- bt full         : Backtrace with local variables
- frame N         : Switch to frame N
- up / down       : Move one frame up/down
- info locals     : Show local variables
- info args       : Show function arguments
- print VAR       : Print variable value
- print *PTR      : Dereference pointer
- print PTR@10    : Print array of 10 elements
- x/10x ADDR      : Examine 10 hex words at address
- list            : Show source around current line
- info registers  : Show CPU registers

Hint 4: Automate with GDB Batch Mode

# Run GDB commands non-interactively
gdb -batch -ex "bt" -ex "info locals" ./program core.123

# Save to file
gdb -batch -ex "bt full" ./program core.123 > crash-report.txt

Books That Will Help

Topic Book Chapter
GDB Basics “The Art of Debugging” by Matloff & Salzman Ch. 1-4
Stack Frames “CSAPP” by Bryant & O’Hallaron Ch. 3.7: Procedures
Debug Symbols “Practical Binary Analysis” by Andriesse Ch. 2: ELF Format
Memory Layout “CSAPP” by Bryant & O’Hallaron Ch. 9: Virtual Memory

Common Pitfalls and Debugging

Problem 1: “GDB says ‘no debugging symbols found’“

  • Why: The binary was compiled without -g, or symbols were stripped.
  • Fix: Recompile with gcc -g or locate a debug symbol package (e.g., -dbgsym packages on Ubuntu).
  • Quick test: file ./program should say “with debug_info” if symbols present

Problem 2: “Backtrace shows ‘??’ for function names”

  • Why: GDB can’t map addresses to symbols. Either wrong executable or missing symbols.
  • Fix: Ensure the executable matches the core dump (same build). Use info shared to check shared library symbols.
  • Quick test: readelf -s ./program | head -20 should show symbol table

Problem 3: “Variable shows as ‘optimized out’“

  • Why: Compiler optimized away the variable (stored in register, inlined, etc.)
  • Fix: Recompile with -O0 for debugging, or use info registers to find register values.
  • Quick test: gcc -g -O0 should preserve all variables

Problem 4: “Core file doesn’t match executable”

  • Why: The binary was rebuilt after the crash, changing addresses.
  • Fix: Keep the exact binary that crashed alongside the core dump. Use version control or preserve binaries with each release.
  • Quick test: file core.123 shows the original executable path

Definition of Done

  • Created at least 4 different crash scenarios (NULL, overflow, use-after-free, etc.)
  • Can load core dump in GDB and get full backtrace with symbols
  • Documented the difference in backtrace quality with and without -g
  • Can navigate frames and inspect variables at each level
  • Created a GDB command cheat sheet with examples
  • Can use GDB batch mode to generate crash reports non-interactively
  • Can explain what each line of a backtrace means

Project 3: The Memory Inspector — Deep State Examination

  • File: P03-memory-inspector-deep-state.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Memory Layout, Debugging, Forensics
  • Software or Tool: GDB, hexdump, /proc filesystem
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you will build: A memory forensics toolkit that goes beyond backtraces to examine heap state, corrupted data structures, and memory patterns. You will create programs with subtle memory bugs and use GDB’s memory examination commands to find root causes that backtraces alone can’t reveal.

Why it teaches crash dump analysis: Many crashes don’t happen at the bug—they happen later when corrupted data is used. A backtrace shows you where it died, but memory inspection shows you what went wrong. This is the difference between “it crashed in strcmp()” and “someone wrote past the end of the username buffer 200 lines earlier.”

Core challenges you will face:

  • Examining raw memory → Maps to GDB’s x command and format specifiers
  • Finding corruption patterns → Maps to recognizing freed memory, stack canaries, guard bytes
  • Tracing data structure state → Maps to following pointers and understanding layout
  • Correlating addresses with regions → Maps to stack vs heap vs data vs code identification

Real World Outcome

You will have a collection of memory forensics scenarios and the skills to investigate corruption that causes delayed crashes.

Example Output:

$ ./memory-mysteries heap-overflow
[*] Allocating two adjacent buffers
[*] Buffer A: 32 bytes for user input
[*] Buffer B: 32 bytes for privilege level (should be "user")
[*] Overwriting Buffer A with 48 bytes (oops, 16 byte overflow)
[*] Checking privilege level...
[!] Unexpected privilege: 'admin' (Buffer B was corrupted!)
[*] Attempting privileged operation...
Segmentation fault (core dumped)

$ gdb ./memory-mysteries core.5678
(gdb) bt
#0  0x00007ffff7c9b152 in __strcmp_avx2 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000055555555543a in check_operation (priv=0x5555555596d0 "") at memory-mysteries.c:89
#2  0x0000555555555678 in do_admin_thing () at memory-mysteries.c:134
...

(gdb) frame 1
#1  0x000055555555543a in check_operation (priv=0x5555555596d0 "") at memory-mysteries.c:89

(gdb) print priv
$1 = 0x5555555596d0 ""

(gdb) x/32xb 0x5555555596d0
0x5555555596d0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596d8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596e0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596e8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

(gdb) # Buffer B is all zeros — was overwritten with null bytes from overflow!
(gdb) # Let's look at Buffer A (32 bytes before)
(gdb) x/48xb 0x5555555596d0-32
0x5555555596b0: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41  <- Buffer A "AAAA..."
0x5555555596b8: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0x5555555596c0: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0x5555555596c8: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41  <- End of A, start of B
0x5555555596d0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00  <- Buffer B (corrupted)
0x5555555596d8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

(gdb) # Aha! Buffer A was filled with 'A' (0x41) and the null terminator overflowed into B
(gdb) # The 48-byte string + null terminator overwrote B completely

The Core Question You Are Answering

“The crash site is just the symptom—where is the actual BUG?”

Most memory corruption bugs don’t crash immediately. A heap overflow corrupts an adjacent allocation that isn’t used until minutes later. A use-after-free might work 99 times because the memory hasn’t been reused. The backtrace tells you where the train derailed; memory inspection tells you where the track was sabotaged.

Concepts You Must Understand First

  1. Process Memory Layout
    • Where are stack, heap, data, and code segments?
    • How do you identify which region an address belongs to?
    • What do typical stack addresses vs heap addresses look like?
    • Book Reference: “CSAPP” by Bryant — Ch. 9.7-9.8
  2. GDB Memory Examination
    • x/NFU ADDR: N=count, F=format (x,d,c,s,i), U=unit (b,h,w,g)
    • Common patterns: x/20xb (20 hex bytes), x/s (string), x/10i (10 instructions)
    • How to examine memory relative to variables: x/16xb &buffer
    • Book Reference: “The Art of Debugging” by Matloff — Ch. 3
  3. Heap Internals Basics
    • How does malloc track allocations (size, flags in chunk headers)?
    • What patterns indicate freed memory (0xdeadbeef, 0xfeeefeee, etc.)?
    • What is the “red zone” and how does Address Sanitizer use it?
    • Book Reference: “Secure Coding in C and C++” by Seacord — Ch. 4
  4. Data Structure Layout
    • How does the compiler arrange struct fields in memory?
    • What is padding and alignment?
    • How do you correlate offsetof() with memory examination?
    • Book Reference: “CSAPP” by Bryant — Ch. 3.9

Questions to Guide Your Design

  1. Scenario Selection
    • What memory bugs will you demonstrate (heap overflow, use-after-free, stack buffer overflow, uninitialized memory)?
    • How will you make the corruption visible but not immediately fatal?
  2. Memory Patterns
    • What recognizable patterns will you use (0x41 for ‘A’, 0xDEADBEEF, specific strings)?
    • How will you demonstrate that the corruption pattern reveals the source?
  3. Region Identification
    • How will you teach identifying stack vs heap addresses?
    • Will you show how to use /proc/PID/maps or info proc mappings?
  4. Forensic Workflow
    • What systematic approach will you document for investigating corruption?
    • How do you trace back from corrupted memory to the corrupting code?

Thinking Exercise

Map the Memory Regions

Given this GDB output, identify each address’s region:

(gdb) print &argc
$1 = (int *) 0x7fffffffde04
(gdb) print buffer
$2 = 0x5555555592a0 "Hello"
(gdb) print main
$3 = {int (int, char **)} 0x555555555169
(gdb) print &global_counter
$4 = (int *) 0x555555558010

Questions:

  • Which address is on the stack? (Hint: 0x7fff… is high memory)
  • Which address is heap? (Hint: dynamically allocated, similar to code addresses)
  • Which is code? (Hint: same region as program counter)
  • Which is data segment? (Hint: near code but different section)

Typical x86-64 Linux layout:

0x7fff_xxxx_xxxx  Stack (grows down)
0x7f00_xxxx_xxxx  Shared libraries
...
0x5555_5555_xxxx  Heap (grows up, glibc mmap allocations)
0x5555_5555_8xxx  Data segment (.data, .bss)
0x5555_5555_5xxx  Code segment (.text)

The Interview Questions They Will Ask

  1. “How do you determine if a crash is caused by heap corruption vs stack corruption?”
  2. “Explain the GDB command x/20xb $rsp — what does each part mean?”
  3. “You see address 0xdeadbeef in a pointer. What does this typically indicate?”
  4. “How can you tell if memory was freed before being used?”
  5. “Walk me through debugging a crash where the backtrace shows the bug is in libc’s malloc.”
  6. “What is the difference between examining memory with print vs x in GDB?”

Hints in Layers

Hint 1: Start with Known Patterns Fill buffers with recognizable patterns before corruption: memset(buf, 'A', size) or fill with sequential bytes. When you see these patterns where they shouldn’t be, you’ve found your corruption.

Hint 2: Compare Before and After Create scenarios where you can examine memory state before and after the corruption. Use GDB breakpoints before the bug to capture “good” state, then compare with the core dump.

Hint 3: GDB Memory Examination Cheat Sheet

x/Nx ADDR    - N hex words (4 bytes each)
x/Nxb ADDR   - N hex bytes
x/Nxg ADDR   - N hex giant words (8 bytes)
x/Ns ADDR    - N strings
x/Ni ADDR    - N instructions
x/Nc ADDR    - N characters

Examples:
x/32xb &buffer     - 32 bytes of buffer
x/s $rdi           - String at first argument
x/10i $pc          - 10 instructions at program counter
x/8xg $rsp         - 8 stack slots (64-bit)

Hint 4: Use info Files for Memory Map

(gdb) info files
# Shows all loaded segments and their address ranges

(gdb) info proc mappings
# Shows /proc/PID/maps style output (if available from core)

(gdb) maintenance info sections
# Shows ELF sections with addresses

Books That Will Help

Topic Book Chapter
Memory Layout “CSAPP” by Bryant & O’Hallaron Ch. 9: Virtual Memory
GDB Memory Commands “The Art of Debugging” by Matloff Ch. 3: Inspecting Variables
Heap Internals “Secure Coding in C and C++” by Seacord Ch. 4: Dynamic Memory
Stack Frame Layout “CSAPP” by Bryant & O’Hallaron Ch. 3.7: Procedures
Binary Inspection “Practical Binary Analysis” by Andriesse Ch. 5: Binary Analysis Basics

Common Pitfalls and Debugging

Problem 1: “Memory shows all zeros but program used this data”

  • Why: Core dump may not include all memory pages (sparse dump). Or memory was freed.
  • Fix: Check if systemd-coredump limits dump size. Use coredumpctl info to see ProcessState.
  • Quick test: coredumpctl info <PID> | grep "Size" — compare with expected memory usage

Problem 2: “Can’t tell stack from heap addresses”

  • Why: Without context, addresses are just numbers. Need memory map.
  • Fix: Use info proc mappings in GDB or examine /proc/PID/maps from a live process.
  • Quick test: Stack addresses typically start with 0x7fff on x86-64 Linux

Problem 3: “Heap metadata looks corrupted”

  • Why: Heap overflow overwrote malloc’s bookkeeping, causing strange chunk sizes.
  • Fix: This is often the symptom, not the cause. Look for the buffer that overflowed into metadata.
  • Quick test: Look for ASCII patterns (like 0x41414141) in chunk headers

Problem 4: “Same crash, different memory contents each run”

  • Why: Address Space Layout Randomization (ASLR) changes addresses each run.
  • Fix: Disable ASLR for debugging: echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
  • Quick test: Re-enable after debugging! (echo 2 for full ASLR)

Definition of Done

  • Created at least 3 memory corruption scenarios (heap overflow, use-after-free, uninitialized)
  • Can identify memory region (stack/heap/data/code) from any address
  • Documented the x command with multiple format examples
  • Can trace corruption from crash site back to the bug location
  • Demonstrated finding a recognizable pattern in unexpected memory
  • Created a memory forensics workflow/checklist
  • Can explain heap chunk metadata and what corruption looks like

Project 4: Automated Crash Detective — GDB Scripting and Python API

  • File: P04-automated-crash-detective-gdb-scripting.md
  • Main Programming Language: Python
  • Alternative Programming Languages: GDB Command Scripts, Bash
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Automation, Scripting, Tooling
  • Software or Tool: GDB Python API, gdb.Command, batch mode
  • Main Book: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman

What you will build: An automated crash analysis tool that takes a core dump and executable as input and produces a structured report with backtrace, local variables, thread state, and memory analysis—all without human interaction. You will learn GDB’s Python API to build reusable debugging automation.

Why it teaches crash dump analysis: Manual GDB debugging doesn’t scale. When you have 100 crashes per day, you need automation. This project teaches you to codify your debugging knowledge into scripts that run consistently and generate reports for further analysis or integration with crash aggregation systems.

Core challenges you will face:

  • Learning GDB’s Python API → Maps to gdb.Command, gdb.Frame, gdb.Value
  • Extracting structured data → Maps to parsing GDB output vs using API objects
  • Handling edge cases → Maps to corrupted frames, missing symbols, stripped binaries
  • Generating useful reports → Maps to what information developers actually need

Real World Outcome

You will have a Python-based crash analysis tool that generates JSON or Markdown reports from core dumps automatically.

Example Output:

$ ./crash-analyzer --core ./core.5678 --exe ./myserver --format json
{
  "timestamp": "2025-01-04T10:30:45Z",
  "executable": "/home/user/myserver",
  "signal": "SIGSEGV",
  "signal_code": 11,
  "crashing_thread": 1,
  "total_threads": 4,
  "backtrace": [
    {
      "frame": 0,
      "function": "process_request",
      "file": "server.c",
      "line": 234,
      "args": {"req": "0x5555555596d0", "len": "1024"},
      "locals": {"buffer": "0x7fffffffdd00", "i": "127"}
    },
    {
      "frame": 1,
      "function": "handle_client",
      "file": "server.c",
      "line": 189,
      "args": {"client_fd": "5"}
    },
    {
      "frame": 2,
      "function": "main",
      "file": "server.c",
      "line": 312
    }
  ],
  "registers": {
    "rip": "0x555555555abc",
    "rsp": "0x7fffffffdd00",
    "rbp": "0x7fffffffde10"
  },
  "analysis": {
    "crash_type": "null_pointer_dereference",
    "likely_cause": "req pointer was NULL at frame 0",
    "stack_corrupted": false
  }
}

$ ./crash-analyzer --core ./core.5678 --exe ./myserver --format markdown
# Crash Report: myserver

**Signal:** SIGSEGV (Segmentation Fault)
**Time:** 2025-01-04 10:30:45 UTC
**Threads:** 4 (crashed in thread 1)

## Backtrace

| Frame | Function | Location | Arguments |
|-------|----------|----------|-----------|
| 0 | process_request | server.c:234 | req=0x5555555596d0, len=1024 |
| 1 | handle_client | server.c:189 | client_fd=5 |
| 2 | main | server.c:312 | argc=1, argv=... |

## Analysis

**Likely Cause:** NULL pointer dereference at `req` parameter
**Recommendation:** Add null check before line 234

The Core Question You Are Answering

“How do I turn my manual debugging workflow into a repeatable, scriptable process?”

Every time you debug a crash, you run the same commands: bt, info locals, info threads. Why type them manually? A script can do it faster, more consistently, and produce structured output for downstream processing. This is the foundation of crash aggregation systems used by Google, Microsoft, and every major software company.

Concepts You Must Understand First

  1. GDB Batch Mode
    • How does gdb -batch -x script.gdb work?
    • What is the difference between -x (command file) and -ex (inline command)?
    • How do you capture output to a file?
    • Book Reference: “The Art of Debugging” by Matloff — Ch. 7
  2. GDB Python API Fundamentals
    • How do you load a Python script in GDB (source, python-interactive)?
    • What are gdb.Frame, gdb.Value, gdb.Type, gdb.Symbol?
    • How do you iterate stack frames programmatically?
    • Book Reference: GDB Manual, Python API chapter (online)
  3. Creating Custom GDB Commands
    • How does the gdb.Command class work?
    • How do you handle arguments in custom commands?
    • How do you output structured data (JSON) from GDB?
    • Book Reference: GDB Documentation — “Extending GDB with Python”
  4. Error Handling in GDB Scripts
    • What happens when gdb.parse_and_eval() fails?
    • How do you detect and handle missing debug symbols?
    • How do you skip corrupted frames?
    • Book Reference: Experience and experimentation

Questions to Guide Your Design

  1. Input/Output
    • What input formats will you support (core file + exe, coredumpctl)?
    • What output formats will you produce (JSON, Markdown, plain text)?
    • How will you handle command-line arguments?
  2. Information Extraction
    • What information is essential in every report (backtrace, signal, registers)?
    • What optional information adds value (locals, threads, memory)?
    • How deep should you go into memory inspection?
  3. Error Handling
    • What if the executable doesn’t match the core?
    • What if debug symbols are missing?
    • What if a stack frame is corrupted?
  4. Extensibility
    • How will others add new analyses?
    • Can you support plugins for different crash types?

Thinking Exercise

Design the API

Before coding, design the data model for crash information:

# What fields should a StackFrame have?
class StackFrame:
    frame_number: int
    function_name: str      # or "??" if unknown
    file_name: Optional[str]
    line_number: Optional[int]
    address: int
    arguments: Dict[str, str]   # name -> value as string
    locals: Dict[str, str]      # name -> value as string

# What fields should a CrashReport have?
class CrashReport:
    executable_path: str
    core_path: str
    timestamp: datetime
    signal: str              # "SIGSEGV"
    signal_number: int       # 11
    backtrace: List[StackFrame]
    registers: Dict[str, int]
    threads: List[ThreadInfo]
    analysis: AnalysisResult

Questions:

  • How do you get the signal name from the signal number?
  • What if a local variable’s value is “optimized out”?
  • How do you serialize gdb.Value to JSON?

The Interview Questions They Will Ask

  1. “How would you build a system to analyze 1000 crash dumps per day?”
  2. “What is the GDB Python API, and when would you use it over command scripts?”
  3. “How do you handle a crash dump where the executable was compiled without debug symbols?”
  4. “Design a crash fingerprinting algorithm—how would you identify if two crashes are the same bug?”
  5. “What information should a crash report contain for a developer to diagnose the issue?”
  6. “How would you integrate automated crash analysis with a CI/CD pipeline?”

Hints in Layers

Hint 1: Start with Batch Mode Before writing Python, get comfortable with GDB batch mode:

gdb -batch \
  -ex "file ./myprogram" \
  -ex "core-file ./core.123" \
  -ex "bt" \
  -ex "info threads"

Capture this output, then parse it (as a stepping stone to the Python API).

Hint 2: Basic Python Script Structure

import gdb
import json

class CrashAnalyzer(gdb.Command):
    """Analyze current core dump and output report."""

    def __init__(self):
        super().__init__("crash-analyze", gdb.COMMAND_USER)

    def invoke(self, arg, from_tty):
        report = {}

        # Get backtrace
        frame = gdb.newest_frame()
        frames = []
        while frame:
            frames.append(self.extract_frame_info(frame))
            frame = frame.older()
        report["backtrace"] = frames

        # Output as JSON
        print(json.dumps(report, indent=2))

    def extract_frame_info(self, frame):
        # ... implement frame extraction
        pass

CrashAnalyzer()  # Register the command

Hint 3: Handle Missing Symbols Gracefully

def get_function_name(frame):
    try:
        name = frame.name()
        return name if name else "??"
    except gdb.error:
        return "??"

def get_local_value(name, frame):
    try:
        val = frame.read_var(name)
        return str(val)
    except gdb.error as e:
        return f"<unavailable: {e}>"

Hint 4: Wrapper Script for Easy Invocation

#!/bin/bash
# crash-analyzer.sh
GDB_SCRIPT="$(dirname $0)/crash_analyzer.py"
gdb -batch \
    -ex "source $GDB_SCRIPT" \
    -ex "file $2" \
    -ex "core-file $1" \
    -ex "crash-analyze $3"  # Pass format as arg

Books That Will Help

Topic Book Chapter
GDB Scripting “The Art of Debugging” by Matloff Ch. 7: Scripting
Python APIs GDB Manual (online) Python API Reference
JSON Processing Python Documentation json module
CLI Design “The Linux Command Line” by Shotts Ch. 25: Scripts

Common Pitfalls and Debugging

Problem 1: “gdb.error: No frame selected”

  • Why: You’re trying to access frame data before loading the core file.
  • Fix: Ensure the core-file command runs before your analysis script.
  • Quick test: Run commands interactively first to verify order

Problem 2: “Cannot convert gdb.Value to JSON”

  • Why: gdb.Value objects aren’t JSON-serializable; you need to convert to strings/ints.
  • Fix: Use str(value) or int(value) depending on type. Handle errors for complex types.
  • Quick test: print(type(val), str(val)) in your script

Problem 3: “Script works interactively but fails in batch mode”

  • Why: Batch mode may have different timing or output buffering.
  • Fix: Ensure all gdb commands complete before Python code runs. Use gdb.execute() with to_string=True to capture output.
  • Quick test: Add debug prints to track execution flow

Problem 4: “Missing symbols for system libraries”

  • Why: Debug symbols for libc, etc. are in separate packages.
  • Fix: Install debug symbol packages (libc6-dbg on Debian, glibc-debuginfo on Fedora).
  • Quick test: gdb -ex "info sharedlibrary" to see which libs lack symbols

Definition of Done

  • Script loads core dump and executable via command-line arguments
  • Produces JSON output with backtrace, signal, and thread info
  • Produces human-readable (Markdown) output as alternative
  • Handles missing debug symbols gracefully (shows addresses, not errors)
  • Handles multiple threads and identifies the crashing thread
  • Includes basic analysis (null pointer detection, crash type classification)
  • Works with coredumpctl integration (can extract core from systemd storage)
  • Documented installation and usage instructions

Project 5: Multi-threaded Mayhem — Analyzing Concurrent Crashes

  • File: P05-multi-threaded-mayhem-concurrent-crashes.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Concurrency, Thread Debugging, Race Conditions
  • Software or Tool: GDB thread commands, pthreads, helgrind
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you will build: A multi-threaded test suite that demonstrates various concurrency bugs (data races, deadlocks, thread-unsafe crashes) and the GDB techniques to diagnose them from core dumps. You’ll learn to navigate thread state in crashes where the symptom is in one thread but the cause is in another.

Why it teaches crash dump analysis: Single-threaded debugging is straightforward—follow the backtrace to the bug. Multi-threaded crashes are puzzles: Thread A crashes because Thread B corrupted shared data. Thread C is deadlocked waiting for a mutex Thread D holds. This project teaches you to think across threads and correlate state.

Core challenges you will face:

  • Navigating thread state in GDB → Maps to info threads, thread N, thread apply all
  • Understanding thread-specific crash context → Maps to which thread received the signal
  • Correlating data across threads → Maps to finding the “other thread” that caused corruption
  • Recognizing concurrency bug patterns → Maps to data races, deadlocks, use-after-free across threads

Real World Outcome

You will have a collection of multi-threaded crash scenarios and the skills to diagnose which thread caused the problem, not just which thread crashed.

Example Output:

$ ./thread-chaos race-condition
[*] Spawning 4 threads to increment shared counter
[*] Thread 0 starting...
[*] Thread 1 starting...
[*] Thread 2 starting...
[*] Thread 3 starting...
[*] Thread 1 corrupted shared data structure (intentionally)
[*] Thread 3 accessing corrupted data...
Segmentation fault (core dumped)

$ gdb ./thread-chaos core.9876
(gdb) info threads
  Id   Target Id                    Frame
* 1    Thread 0x7ffff7fb1000 (LWP 9876) 0x00007ffff7c9b152 in __strcmp_avx2 ()
  2    Thread 0x7ffff6fb0700 (LWP 9877) 0x00007ffff7ce2a8d in __lll_lock_wait ()
  3    Thread 0x7ffff67af700 (LWP 9878) 0x0000555555555420 in worker_thread ()
  4    Thread 0x7ffff5fae700 (LWP 9879) 0x00007ffff7d0b3bf in __GI___nanosleep ()

(gdb) # Thread 1 (LWP 9876) is the crashed thread - marked with *
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fb1000 (LWP 9876))]
#0  0x00007ffff7c9b152 in __strcmp_avx2 () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0  0x00007ffff7c9b152 in __strcmp_avx2 ()
#1  0x00005555555554a3 in process_item (item=0xdeadbeef) at thread-chaos.c:78
#2  0x0000555555555523 in worker_thread (arg=0x0) at thread-chaos.c:95
#3  0x00007ffff7d8b6db in start_thread () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) # item=0xdeadbeef is suspicious - looks like corruption pattern
(gdb) # Let's check all threads' backtraces

(gdb) thread apply all bt

Thread 4 (Thread 0x7ffff5fae700 (LWP 9879)):
#0  0x00007ffff7d0b3bf in __GI___nanosleep ()
#1  0x0000555555555389 in sleepy_thread (arg=0x3) at thread-chaos.c:67

Thread 3 (Thread 0x7ffff67af700 (LWP 9878)):
#0  0x0000555555555420 in worker_thread (arg=0x2) at thread-chaos.c:92
#1  0x00007ffff7d8b6db in start_thread ()

Thread 2 (Thread 0x7ffff6fb0700 (LWP 9877)):
#0  0x00007ffff7ce2a8d in __lll_lock_wait ()
#1  0x00007ffff7ce5393 in pthread_mutex_lock ()
#2  0x0000555555555501 in corrupt_shared_data (arg=0x1) at thread-chaos.c:88
#3  0x00007ffff7d8b6db in start_thread ()

(gdb) # Thread 2 is in corrupt_shared_data! That's likely the culprit
(gdb) # Thread 1 crashed, but Thread 2's function name suggests it corrupted the data

The Core Question You Are Answering

“In a multi-threaded crash, which thread actually CAUSED the problem?”

The thread that receives SIGSEGV is often the victim, not the perpetrator. Thread A writes garbage to a shared pointer; Thread B later dereferences it and crashes. The backtrace for Thread B shows the crash in innocent code. You need to examine ALL threads to find the real bug.

Concepts You Must Understand First

  1. Thread Representation in Core Dumps
    • How are threads captured (all threads, all registers)?
    • What is LWP (Light Weight Process)?
    • Which thread is marked as the “crashing thread”?
    • Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 29
  2. GDB Thread Commands
    • info threads — list all threads with current frame
    • thread N — switch to thread N
    • thread apply all CMD — run CMD on each thread
    • thread apply all bt — backtrace for all threads (most useful!)
    • Book Reference: “The Art of Debugging” by Matloff — Ch. 5
  3. Concurrency Bug Patterns
    • Data race: Two threads access same data, at least one writes, no synchronization
    • Deadlock: Circular wait on locks (hard to see in crash, easier in live debug)
    • Use-after-free across threads: Thread A frees, Thread B uses
    • Race on refcount: Thread A decrements to 0 and frees while Thread B still using
    • Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 30
  4. Mutex and Synchronization State
    • How do you examine mutex state in a core dump?
    • What does a “waiting on lock” frame look like?
    • How do you identify which thread holds a lock?
    • Book Reference: glibc/pthread internals documentation

Questions to Guide Your Design

  1. Scenario Selection
    • What concurrency bugs will you demonstrate (race condition, deadlock, cross-thread use-after-free)?
    • How will you make the crashes reproducible (or intentionally non-deterministic)?
  2. Thread Identification
    • How will you name or tag threads so they’re identifiable in GDB?
    • Will you use pthread_setname_np()?
  3. Evidence Planting
    • How will you make it clear which thread caused the problem (comments, patterns)?
    • What corruption patterns will help identify the source?
  4. Analysis Workflow
    • What systematic approach will you document for multi-threaded crash analysis?
    • How do you correlate state across threads?

Thinking Exercise

Map Thread Interactions

Given 4 threads accessing a shared linked list:

  • Thread 1: Reads nodes
  • Thread 2: Adds nodes
  • Thread 3: Removes nodes
  • Thread 4: Reads nodes

Questions:

  • If Thread 1 crashes dereferencing a freed node, which thread might have freed it?
  • If Thread 2 crashes while adding a node, could another thread be involved?
  • How would you use thread apply all bt to investigate?

Draw the interaction diagram:

Thread 1 (Read)     Thread 2 (Add)      Thread 3 (Remove)    Thread 4 (Read)
     |                   |                   |                    |
     |   ←── shared_list (mutex protected?) ──→                   |
     |                   |                   |                    |
   read(node)        add(new)           remove(node)          read(node)
     |                   |                   |                    |
     ↓                   ↓                   ↓                    ↓
 If node freed      Race with             Frees node        If node freed
 → SIGSEGV          remove?               |                 → SIGSEGV
                                          ↓
                                    If reader still
                                    using → crash

The Interview Questions They Will Ask

  1. “How do you determine which thread caused a crash in a multi-threaded program?”
  2. “What is a data race, and how would you detect one from a core dump?”
  3. “The backtrace shows the crash in a standard library function (strcmp). How do you find your bug?”
  4. “How do you examine mutex state in a core dump?”
  5. “Describe a scenario where the crashing thread is NOT where the bug is.”
  6. “What tools besides GDB help find concurrency bugs? (helgrind, tsan)”

Hints in Layers

Hint 1: Name Your Threads Use pthread_setname_np() to give threads meaningful names. GDB will show these in info threads:

pthread_setname_np(pthread_self(), "worker-1");

Hint 2: Use Recognizable Corruption Patterns When one thread intentionally corrupts data, use patterns that stand out:

// Corrupting thread:
shared_ptr = (void*)0xDEADBEEF;  // Obvious bad pointer

// Or fill with pattern:
memset(shared_buffer, 0x41, size);  // All 'A's

Hint 3: Thread Apply All Is Your Friend

(gdb) thread apply all bt         # All backtraces
(gdb) thread apply all bt full    # All backtraces with locals
(gdb) thread apply all print shared_var  # Check shared var in each thread

Hint 4: Look for Mutex Wait Patterns A thread blocked on a mutex will have a frame like:

#0  __lll_lock_wait () at lowlevellock.S:49
#1  pthread_mutex_lock () at pthread_mutex_lock.c:80

This tells you which thread is waiting. Use info threads to find who might be holding the lock.

Books That Will Help

Topic Book Chapter
POSIX Threads “The Linux Programming Interface” by Kerrisk Ch. 29-30
Thread Debugging “The Art of Debugging” by Matloff Ch. 5: Debugging in Multi-threaded Environments
Race Conditions “C Programming: A Modern Approach” by King Ch. 19: Program Design
Lock-free Patterns “C++ Concurrency in Action” by Williams Ch. 5-6

Common Pitfalls and Debugging

Problem 1: “All threads show the same backtrace”

  • Why: Copy/paste error, or you’re looking at the wrong column in info threads.
  • Fix: Use thread apply all bt which clearly labels each thread’s backtrace.
  • Quick test: Count unique LWP numbers in info threads

Problem 2: “Can’t tell which thread holds the mutex”

  • Why: Mutex internals are opaque in most debuggers.
  • Fix: Look for the thread that’s NOT waiting on the mutex and has the lock in scope. Or examine pthread_mutex_t internals (glibc-specific).
  • Quick test: thread apply all print my_mutex to see state in each context

Problem 3: “Race condition doesn’t reproduce”

  • Why: Races are timing-dependent by nature.
  • Fix: Add sleeps, use multiple runs, or use sched_yield() to increase collision probability.
  • Quick test: Run in a loop: for i in {1..100}; do ./thread-chaos; done

Problem 4: “Thread that caused corruption has already exited”

  • Why: Threads can exit before the crash occurs.
  • Fix: Look for evidence in remaining threads or memory. The corruption pattern itself may indicate the source.
  • Quick test: Check thread count at crash time vs expected

Definition of Done

  • Created at least 3 multi-threaded crash scenarios (race, deadlock attempt, cross-thread UAF)
  • Can use info threads and thread apply all bt effectively
  • Can identify the crashing thread vs the culprit thread
  • Documented the thread analysis workflow
  • Threads have meaningful names visible in GDB
  • Demonstrated finding the “other thread” that caused a crash
  • Can explain how mutex wait patterns appear in backtraces

Project 6: The Stripped Binary Challenge — Debugging Without Symbols

  • File: P06-stripped-binary-debugging-without-symbols.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Assembly
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Reverse Engineering, Assembly, Binary Analysis
  • Software or Tool: GDB, objdump, readelf, IDA Free or Ghidra
  • Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you will build: A debugging workflow for analyzing crashes in stripped binaries (no debug symbols, no function names). You’ll learn to use disassembly, register analysis, and memory patterns to understand crashes when the backtrace shows only hex addresses.

Why it teaches crash dump analysis: Production binaries are often stripped to reduce size and protect intellectual property. Third-party libraries and closed-source software never have your debug symbols. Security researchers analyze malware that’s intentionally obfuscated. This project teaches you to debug when you have nothing but the binary and the crash.

Core challenges you will face:

  • Reading disassembly → Maps to understanding x86-64 instructions
  • Correlating addresses to code → Maps to using objdump and readelf
  • Understanding calling conventions → Maps to finding function arguments in registers
  • Reconstructing context without symbols → Maps to pattern recognition in memory

Real World Outcome

You will be able to extract useful crash information from binaries with no debug symbols—a skill that distinguishes expert debuggers.

Example Output:

$ file ./mystery-server
./mystery-server: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
    interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped

$ # No debug info! Let's crash it and analyze
$ ./mystery-server &
[1] 4567
$ curl http://localhost:8080/crash
curl: (52) Empty reply from server
$ # Server crashed

$ gdb ./mystery-server core.4567
(gdb) bt
#0  0x0000000000401a3c in ?? ()
#1  0x0000000000401b89 in ?? ()
#2  0x0000000000401df2 in ?? ()
#3  0x00000000004015a1 in ?? ()
#4  0x00007ffff7db3d90 in __libc_start_call_main () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) # All "??" - no symbols. But we can still analyze!

(gdb) x/10i 0x0000000000401a3c
   0x401a3c:    mov    (%rax),%edx         # Crashed here - dereferencing RAX
   0x401a3e:    add    %edx,%r12d
   0x401a41:    add    $0x8,%rax
   0x401a45:    cmp    %rax,%rbx
   ...

(gdb) info registers rax
rax            0x0                 0     # RAX is 0! NULL pointer dereference

(gdb) # Let's find what function we're in by looking at frame 1's call site
(gdb) x/5i 0x0000000000401b89-5
   0x401b84:    call   0x401a20           # Calling our crashing function
   0x401b89:    mov    %eax,%r13d         # Return point (frame 1's address)

(gdb) # So 0x401a20 is the start of the crashing function
(gdb) # Let's see what it does

(gdb) x/30i 0x401a20
   0x401a20:    push   %rbp
   0x401a21:    push   %rbx
   0x401a22:    push   %r12
   0x401a24:    mov    %rdi,%rbp         # First arg (pointer)
   0x401a27:    mov    %rsi,%rbx         # Second arg (count)
   ...
   0x401a3c:    mov    (%rax),%edx       # CRASH: rax derived from arg

The Core Question You Are Answering

“How do I debug a crash when the binary tells me NOTHING?”

Stripped binaries are the norm in production. When something crashes, you won’t have source lines, function names, or variable names. But you still have the CPU state, memory contents, and the binary itself. With assembly knowledge and systematic analysis, you can still find the bug.

Concepts You Must Understand First

  1. x86-64 Calling Convention
    • Arguments 1-6 are in: RDI, RSI, RDX, RCX, R8, R9
    • Return value is in RAX
    • Callee-saved: RBP, RBX, R12-R15
    • Stack grows down, RSP is stack pointer
    • Book Reference: “CSAPP” by Bryant — Ch. 3.7
  2. Basic x86-64 Instructions
    • mov (%rax), %edx — Load memory at address in RAX into EDX
    • call ADDR — Call function at ADDR
    • push/pop — Stack operations
    • cmp/je/jne — Comparison and conditional jumps
    • Book Reference: “CSAPP” by Bryant — Ch. 3
  3. ELF Structure Without Symbols
    • How to find function boundaries (look for push rbp; mov rsp, rbp)
    • Using .plt and .got for library function identification
    • String references can reveal function purpose
    • Book Reference: “Practical Binary Analysis” by Andriesse — Ch. 2-3
  4. GDB Disassembly Commands
    • x/Ni ADDR — Disassemble N instructions at ADDR
    • disassemble ADDR,+LEN — Disassemble range
    • info registers — Show all registers
    • x/s ADDR — Try to interpret memory as string
    • Book Reference: GDB Manual, Examining Memory

Questions to Guide Your Design

  1. Scenario Creation
    • How will you create a stripped binary that crashes in interesting ways?
    • Will you keep a non-stripped version for verification?
  2. Analysis Workflow
    • What systematic steps will you follow for stripped binary analysis?
    • How will you document the workflow for future reference?
  3. Tool Integration
    • Will you use objdump, readelf, or both?
    • Will you introduce a disassembler (Ghidra, IDA Free)?
  4. Pattern Recognition
    • What common patterns will you learn to recognize (function prologues, loops, string operations)?
    • How will you identify library calls?

Thinking Exercise

Decode the Crash

Given this GDB output from a stripped binary:

(gdb) bt
#0  0x0000000000401234 in ?? ()
#1  0x00007ffff7ca5678 in strlen () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000401567 in ?? ()

(gdb) info registers
rax            0x0                 0
rdi            0x0                 0

Questions:

  • Frame 1 is strlen from libc. What does this tell you about the crash?
  • RDI is 0, and strlen takes its argument in RDI. What happened?
  • The bug is likely in frame 2 (0x401567). Why?
  • What would you examine next?

Analysis:

1. strlen crashed because it received NULL (RDI=0)
2. Frame 2 called strlen with a NULL pointer
3. Next step: disassemble around 0x401567 to see why NULL was passed
4. Look for: mov $0x0,%rdi or a conditional that should have checked for NULL

The Interview Questions They Will Ask

  1. “You have a core dump from a stripped binary. Walk me through your analysis approach.”
  2. “What x86-64 registers contain function arguments 1, 2, and 3?”
  3. “How can you find function boundaries in a stripped binary?”
  4. “The crash is in libc’s malloc. How do you find the bug in your code?”
  5. “How do you identify what function a code block implements without symbols?”
  6. “What tools besides GDB help with stripped binary analysis?”

Hints in Layers

Hint 1: Find Function Starts Look for the standard function prologue:

push   %rbp
mov    %rsp,%rbp
sub    $0xNN,%rsp      ; Allocate stack space

Every push %rbp is likely a function entry.

Hint 2: Use Cross-References Find who calls the crashing function:

(gdb) x/5i RETURN_ADDRESS-5
# Look for "call CRASHING_FUNC_ADDR"

Hint 3: Library Calls Are Labeled Even in stripped binaries, calls to libc functions go through PLT:

call   0x401030 <strlen@plt>

GDB shows these names because they’re in the dynamic symbol table.

Hint 4: String References Look for strings to understand purpose:

$ strings -t x ./mystery-server | grep -i error
  4a20 Error: invalid input
  4b30 Connection error

Then in GDB:

(gdb) x/s 0x404a20
0x404a20: "Error: invalid input"

Find code that references this address.

Books That Will Help

Topic Book Chapter
x86-64 Assembly “CSAPP” by Bryant & O’Hallaron Ch. 3: Machine-Level Representation
ELF Format “Practical Binary Analysis” by Andriesse Ch. 2: ELF Binary Format
Reverse Engineering “Practical Reverse Engineering” by Dang Ch. 1-2
GDB Advanced “The Art of Debugging” by Matloff Ch. 7: Advanced Topics

Common Pitfalls and Debugging

Problem 1: “Can’t tell where functions start/end”

  • Why: Without symbols, there are no boundaries marked.
  • Fix: Look for push %rbp (function start) and ret (function end). Also look for alignment padding (nop instructions).
  • Quick test: objdump -d ./binary | grep -E "(push.*%rbp|retq)"

Problem 2: “Register values don’t make sense”

  • Why: You’re looking at registers after the crash, not before. Some may be corrupted.
  • Fix: Focus on callee-saved registers (RBP, RBX, R12-R15) as they preserve values across calls.
  • Quick test: Verify stack integrity first

Problem 3: “Can’t find the string that caused the crash”

  • Why: String might be on the heap (not in binary) or already freed.
  • Fix: Examine memory at the pointer address. Check if it’s a valid address range.
  • Quick test: info proc mappings to see valid address ranges

Problem 4: “Disassembly looks like garbage”

  • Why: You might be disassembling data, not code. Or wrong address.
  • Fix: Use the addresses from the backtrace. They’re valid instruction pointers.
  • Quick test: x/1i $pc should always show a valid instruction

Definition of Done

  • Created a stripped binary that crashes in an interesting way
  • Can analyze the crash using only GDB and the binary (no source)
  • Documented the x86-64 calling convention and key registers
  • Can find function boundaries without symbols
  • Can identify what libc functions are being called
  • Can trace from crash site to the actual bug
  • Created a stripped binary analysis workflow/checklist
  • Verified findings against the (hidden) source code

Project 7: Minidump Parser — Understanding Compact Crash Formats

  • File: P07-minidump-parser-compact-crash-formats.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Binary Parsing, File Formats, Cross-Platform
  • Software or Tool: Breakpad, minidump_dump, custom parser
  • Main Book: “Practical Binary Analysis” by Dennis Andriesse

What you will build: A parser for Google Breakpad’s minidump format—a compact, cross-platform crash dump format used by Chrome, Firefox, and many other applications. You’ll learn to read the minidump specification and extract crash information from real-world minidump files.

Why it teaches crash dump analysis: Not all crash dumps are ELF core files. Breakpad minidumps are ubiquitous in commercial software because they’re small (kilobytes vs megabytes), cross-platform, and contain just enough information for diagnosis. Understanding this format exposes you to how the industry handles crash reporting at scale.

Core challenges you will face:

  • Reading a binary file format specification → Maps to understanding headers, streams, and directories
  • Parsing variable-length structures → Maps to handling strings, lists, and nested data
  • Extracting platform-specific data → Maps to handling x86-64 vs ARM vs Windows contexts
  • Correlating with symbol files → Maps to understanding .sym files and address resolution

Real World Outcome

You will have a working minidump parser that extracts crash information comparable to Google’s minidump_dump tool.

Example Output:

$ ./minidump-parser ./crash.dmp
=== Minidump Analysis: crash.dmp ===

Header:
  Signature: MDMP
  Version: 0xa793
  Stream Count: 12
  Timestamp: 2025-01-04 15:30:45 UTC

Crash Info:
  Exception Code: EXCEPTION_ACCESS_VIOLATION (0xc0000005)
  Crash Address: 0x00007ff6123456ab
  Crashing Thread: 0

Threads (4 total):
  Thread 0 (Crashed):
    Stack: 0x000000c8f9b00000 - 0x000000c8f9c00000
    Context:
      RIP: 0x00007ff6123456ab
      RSP: 0x000000c8f9bff8a0
      RBP: 0x000000c8f9bff8f0

  Thread 1:
    Stack: 0x000000c8f9400000 - 0x000000c8f9500000
    Context: [suspended]

Modules (15 loaded):
  0x00007ff612340000 - 0x00007ff612380000  myapp.exe
  0x00007ffc12340000 - 0x00007ffc12560000  ntdll.dll
  0x00007ffc10000000 - 0x00007ffc10200000  kernel32.dll
  ...

Memory Regions (5 captured):
  0x000000c8f9bff800 - 0x000000c8f9c00000 (2048 bytes) - Stack near crash
  ...

Stack Trace (unsymbolicated):
  #0  0x00007ff6123456ab  myapp.exe + 0x156ab
  #1  0x00007ff612345123  myapp.exe + 0x15123
  #2  0x00007ffc12345678  ntdll.dll + 0x5678

The Core Question You Are Answering

“How do crash reporting systems represent crashes in a portable, compact format?”

ELF core dumps are Linux-specific and huge. Minidumps solve this: they’re small, cross-platform, and contain just the information needed for diagnosis. Understanding this format teaches you how production crash reporting actually works at companies like Google, Mozilla, and Microsoft.

Concepts You Must Understand First

  1. Minidump Format Structure
    • Header: Signature, version, stream directory location
    • Stream directory: Array of (type, size, offset) entries
    • Streams: Thread list, module list, exception, memory, system info
    • Book Reference: MSDN Minidump File Format documentation
  2. Stream Types You’ll Parse
    • MINIDUMP_STREAM_TYPE: ThreadListStream, ModuleListStream, ExceptionStream
    • Memory streams: MemoryListStream, Memory64ListStream
    • Context stream: Thread context (registers) per architecture
    • Book Reference: Breakpad source code (src/google_breakpad/common/minidump_format.h)
  3. CPU Context Structures
    • Different for x86, x86-64, ARM, ARM64
    • Contains all registers at crash time
    • Must know architecture to interpret correctly
    • Book Reference: Processor-specific ABI documents
  4. Symbol Files (.sym)
    • Breakpad’s text-based symbol format
    • Maps addresses to function names and source lines
    • PUBLIC records for function starts, FUNC + line records for details
    • Book Reference: Breakpad symbol file format documentation

Questions to Guide Your Design

  1. Parsing Strategy
    • Will you read the entire file into memory or seek to each stream?
    • How will you handle endianness (minidumps from Windows are little-endian)?
  2. Platform Support
    • Will you support x86-64 contexts only, or multiple architectures?
    • How will you detect the architecture from the minidump?
  3. Output Format
    • What human-readable format will you produce?
    • Will you support JSON output for programmatic use?
  4. Symbol Integration
    • Will you support loading .sym files for symbolication?
    • How will you match modules to their symbol files?

Thinking Exercise

Map the Minidump Structure

Given a hex dump of a minidump header:

00000000: 4d44 4d50 93a7 0000 0c00 0000 2001 0000  MDMP........
00000010: 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx  ................

Questions:

  • What is the signature (first 4 bytes)?
  • What is the version (next 4 bytes)?
  • How many streams are in the directory?
  • Where is the stream directory located?

Decode:

Signature: "MDMP" (0x504d444d in little-endian)
Version: 0x0000a793
Stream Count: 12 (0x0000000c)
Stream Directory RVA: 0x00000120 (288 bytes into file)

The Interview Questions They Will Ask

  1. “What is a minidump, and how does it differ from a core dump?”
  2. “Walk me through the structure of a minidump file.”
  3. “How does Breakpad capture crash information without stopping the process?”
  4. “What information would you need to symbolicate a minidump stack trace?”
  5. “How would you design a crash reporting system for a mobile app?”
  6. “What are the privacy implications of crash dumps, and how do minidumps address them?”

Hints in Layers

Hint 1: Start with the Header The header is fixed-size and tells you where everything else is:

typedef struct {
    uint32_t signature;    // "MDMP"
    uint32_t version;
    uint32_t stream_count;
    uint32_t stream_directory_rva;  // Offset to directory
    uint32_t checksum;
    uint32_t timestamp;
    uint64_t flags;
} MINIDUMP_HEADER;

Hint 2: Parse the Stream Directory Each entry tells you the type, size, and location of a stream:

typedef struct {
    uint32_t stream_type;  // e.g., ThreadListStream = 3
    uint32_t size;
    uint32_t rva;          // Offset in file
} MINIDUMP_DIRECTORY;

Hint 3: Use Existing Tools for Verification Breakpad’s minidump_dump tool shows you what to expect:

$ minidump_dump crash.dmp
# Compare your output with this

Hint 4: Handle Variable-Length Strings Minidumps use MINIDUMP_STRING for module names:

typedef struct {
    uint32_t length;      // In bytes, not including terminator
    uint16_t buffer[1];   // UTF-16LE string data
} MINIDUMP_STRING;

Read length bytes, convert from UTF-16LE to your preferred encoding.

Books That Will Help

Topic Book Chapter
Binary Parsing “Practical Binary Analysis” by Andriesse Ch. 2-3: Binary Formats
File I/O in C “The C Programming Language” by K&R Ch. 7: Input and Output
Endianness “CSAPP” by Bryant Ch. 2.1: Information Storage
Windows Internals “Windows Internals” by Russinovich Ch. 3: System Mechanisms

Common Pitfalls and Debugging

Problem 1: “Signature doesn’t match ‘MDMP’“

  • Why: You might be reading big-endian or the file isn’t a minidump.
  • Fix: Check byte order. Minidumps are little-endian. Verify with file command.
  • Quick test: xxd -l 16 crash.dmp should show “MDMP” as ASCII

Problem 2: “Stream RVA points to wrong location”

  • Why: RVA is relative to file start, not current position. Off-by-one error.
  • Fix: Seek to absolute position in file, not relative.
  • Quick test: Print RVAs and manually verify with hex editor

Problem 3: “Module names are garbled”

  • Why: Minidumps store strings as UTF-16LE, not ASCII.
  • Fix: Convert UTF-16LE to UTF-8. The length field is in bytes.
  • Quick test: Check if bytes alternate with 0x00 (ASCII in UTF-16)

Problem 4: “Context structure size doesn’t match”

  • Why: Context structure varies by CPU architecture.
  • Fix: Check SystemInfo stream for processor architecture, use correct struct.
  • Quick test: Print context size vs expected for the architecture

Definition of Done

  • Can read and validate the minidump header
  • Can enumerate all streams in the stream directory
  • Can parse ThreadListStream and show thread information
  • Can parse ModuleListStream and show loaded modules
  • Can parse ExceptionStream and show crash details
  • Can extract and display CPU context (registers) for crashing thread
  • Output matches minidump_dump tool for test files
  • Documented the minidump format as learned

Project 8: Kernel Panic Anatomy — Triggering and Capturing with kdump

  • File: P08-kernel-panic-kdump-capture.md
  • Main Programming Language: C (kernel module)
  • Alternative Programming Languages: Shell scripting for setup
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Kernel Development, System Recovery, Enterprise Linux
  • Software or Tool: kdump, kexec, crash, kernel module development
  • Main Book: “Linux Kernel Development” by Robert Love

What you will build: A controlled kernel panic environment using kdump—the industry-standard mechanism for capturing kernel crash dumps. You’ll write a simple kernel module that triggers a panic on demand and configure kdump to capture the vmcore for analysis.

Why it teaches crash dump analysis: User-space crashes are tame compared to kernel panics. When the kernel crashes, there’s no OS to save you—kdump uses a clever trick (kexec) to boot a minimal “capture kernel” that saves memory before rebooting. This is how enterprise Linux (RHEL, SUSE) handles kernel crashes, and understanding it is essential for systems engineers.

Core challenges you will face:

  • Configuring kdump correctly → Maps to kernel parameters, crashkernel reservation
  • Writing a kernel module → Maps to basic kernel development
  • Understanding kexec → Maps to the boot-into-capture-kernel mechanism
  • Handling the vmcore → Maps to where and how the dump is saved

Real World Outcome

You will have a working kdump setup in a VM that reliably captures kernel panics, along with a kernel module that triggers panics on demand for testing.

Example Output:

$ # On a VM configured with kdump
$ sudo kdumpctl status
kdump: Kdump is operational

$ cat /proc/cmdline
... crashkernel=256M ...

$ # Our panic-trigger module
$ ls /sys/kernel/panic_trigger/
trigger

$ # Trigger the panic (VM will crash and reboot)
$ echo 1 | sudo tee /sys/kernel/panic_trigger/trigger
[  123.456789] panic_trigger: Triggering kernel panic!
[  123.456790] Kernel panic - not syncing: Manually triggered panic
...

$ # After reboot, kdump captured the vmcore
$ ls /var/crash/127.0.0.1-2025-01-04-15:30:45/
vmcore  vmcore-dmesg.txt

$ # Check that it's valid
$ sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
    /var/crash/127.0.0.1-2025-01-04-15:30:45/vmcore
crash> bt
PID: 1234   TASK: ffff8881234abcd0  CPU: 0   COMMAND: "tee"
 #0 [ffffc90001234e40] machine_kexec at ffffffff81060abc
 #1 [ffffc90001234e90] __crash_kexec at ffffffff811234de
 #2 [ffffc90001234f00] panic at ffffffff81890123
 #3 [ffffc90001234f80] panic_trigger_write at ffffffffc0001234 [panic_trigger]
 ...

The Core Question You Are Answering

“When the kernel itself crashes, how do you capture the state when there’s no OS to help?”

The kernel is the foundation—when it panics, everything stops. The ingenious solution is kdump: pre-load a second “capture kernel” into reserved memory, and when panic occurs, kexec immediately boots into it. This minimal kernel’s only job is to save the crashed kernel’s memory to disk before rebooting. It’s one of the most elegant debugging mechanisms in systems software.

Concepts You Must Understand First

  1. Kernel Panic Basics
    • What triggers a kernel panic (BUG(), NULL deref in kernel, deadlock)?
    • What information is printed to console?
    • Why can’t you just write to disk from the panicking kernel?
    • Book Reference: “Linux Kernel Development” by Love — Ch. 18: Debugging
  2. kexec and kdump Architecture
    • kexec: Boot a new kernel without going through BIOS
    • kdump: Use kexec to boot into a pre-loaded capture kernel
    • crashkernel parameter: Reserve memory for the capture kernel
    • Book Reference: kernel.org documentation on kdump
  3. Kernel Module Basics
    • Module loading with insmod/modprobe
    • init and exit functions
    • Sysfs interface for triggering actions
    • Book Reference: “Linux Device Drivers” by Corbet — Ch. 1-2
  4. vmcore Format
    • The memory dump created by kdump
    • ELF format with special notes
    • Contains all kernel memory at panic time
    • Book Reference: crash utility documentation

Questions to Guide Your Design

  1. VM Setup
    • What virtualization platform will you use (QEMU, VirtualBox, VMware)?
    • How much RAM should you reserve for crashkernel?
  2. kdump Configuration
    • Where should vmcores be saved (local disk, NFS, SSH)?
    • What kernel debug symbols do you need?
  3. Panic Module Design
    • What trigger mechanism (sysfs file, procfs, ioctl)?
    • Should it support different panic types (BUG, NULL, deadlock)?
  4. Safety Considerations
    • How will you ensure this only runs in a VM?
    • What warnings should the module print?

Thinking Exercise

Trace the kdump Boot Sequence

Map out what happens when you trigger a panic with kdump configured:

1. User writes to /sys/kernel/panic_trigger/trigger
2. Kernel module calls panic("...")
3. Kernel enters panic() function:
   - Disables interrupts, stops other CPUs
   - Prints panic message to console
   - Calls __crash_kexec() if kdump is configured
4. __crash_kexec():
   - Copies registers and memory info to pre-defined location
   - Jumps to the capture kernel (loaded at boot time)
5. Capture kernel boots:
   - Minimal initramfs with makedumpfile
   - Reads original kernel's memory from /proc/vmcore
   - Writes vmcore to /var/crash/
   - Reboots into normal kernel
6. After reboot:
   - vmcore is available for analysis with crash utility

Questions:

  • Why must the capture kernel be pre-loaded (not loaded at panic time)?
  • Why does crashkernel memory need to be reserved at boot?
  • What if the panic happens in the capture kernel?

The Interview Questions They Will Ask

  1. “A production server kernel panicked overnight. Walk me through your investigation process.”
  2. “What is kdump, and how does it differ from regular core dumps?”
  3. “How does kexec boot a new kernel without going through BIOS?”
  4. “What is the crashkernel boot parameter, and how do you determine the right value?”
  5. “What information do you need to analyze a vmcore?”
  6. “How would you configure kdump to send crash dumps to a remote server?”

Hints in Layers

Hint 1: Start with kdump Configuration Before writing modules, get kdump working with manual triggers:

# Fedora/RHEL
sudo dnf install kexec-tools crash kernel-debuginfo
sudo systemctl enable kdump
# Add crashkernel=256M to kernel command line
sudo reboot

# Test kdump is operational
sudo kdumpctl status

Hint 2: Simple Panic Module Skeleton

// panic_trigger.c
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sysfs.h>
#include <linux/kobject.h>

static struct kobject *panic_kobj;

static ssize_t trigger_store(struct kobject *kobj,
    struct kobj_attribute *attr, const char *buf, size_t count)
{
    pr_alert("panic_trigger: Triggering kernel panic!\n");
    panic("Manually triggered panic from panic_trigger module");
    return count;  // Never reached
}

static struct kobj_attribute trigger_attr = __ATTR_WO(trigger);

static int __init panic_trigger_init(void) { /* ... */ }
static void __exit panic_trigger_exit(void) { /* ... */ }

Hint 3: Verify Debug Symbols The crash utility needs matching kernel debug symbols:

# Check kernel version
uname -r

# Install debug symbols
# Fedora: sudo dnf debuginfo-install kernel
# Ubuntu: sudo apt install linux-image-$(uname -r)-dbgsym

# Verify
ls /usr/lib/debug/lib/modules/$(uname -r)/vmlinux

Hint 4: Test in VM Only! Add a safety check to your module:

static int __init panic_trigger_init(void)
{
    if (!hypervisor_is_present()) {
        pr_err("panic_trigger: Refusing to load on bare metal!\n");
        return -EPERM;
    }
    // ...
}

Books That Will Help

Topic Book Chapter
Kernel Modules “Linux Device Drivers, 3rd Ed” by Corbet Ch. 1-2: Building Modules
Kernel Debugging “Linux Kernel Development” by Love Ch. 18: Debugging
Kernel Internals “Understanding the Linux Kernel” by Bovet Ch. 4: Interrupts
System Boot “How Linux Works” by Ward Ch. 5: How Linux Boots

Common Pitfalls and Debugging

Problem 1: “kdump service won’t start”

  • Why: crashkernel parameter missing or insufficient memory reserved.
  • Fix: Add crashkernel=256M (or more) to kernel command line in GRUB.
  • Quick test: cat /proc/cmdline | grep crashkernel

Problem 2: “Panic happens but no vmcore created”

  • Why: Capture kernel failed to boot or write. Check crash directory.
  • Fix: Check /var/crash/ for partial dumps or errors. Check serial console output.
  • Quick test: journalctl -b -1 | grep -i kdump (previous boot logs)

Problem 3: “crash says ‘vmcore and vmlinux do not match’“

  • Why: Debug symbols are for a different kernel version.
  • Fix: Install debug symbols for the exact kernel that crashed.
  • Quick test: crash --version and compare kernel versions

Problem 4: “Module won’t load: ‘Unknown symbol’“

  • Why: Missing kernel headers or mismatched versions.
  • Fix: Install kernel-devel package for your running kernel.
  • Quick test: ls /lib/modules/$(uname -r)/build

Definition of Done

  • VM configured with kdump operational (kdumpctl status shows ready)
  • Kernel module created that triggers panic via sysfs
  • Successfully triggered panic and vmcore was captured
  • vmcore can be opened with crash utility
  • Can extract backtrace showing the panic call chain
  • Documented the complete kdump setup process
  • Module includes safety check to prevent loading on bare metal

Project 9: Analyzing Kernel Crashes with the crash Utility

  • File: P09-analyzing-kernel-crashes-crash-utility.md
  • Main Programming Language: crash commands (interactive)
  • Alternative Programming Languages: crash extensions in C
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Kernel Internals, Debugging, System Analysis
  • Software or Tool: crash utility, kernel debug symbols
  • Main Book: “Understanding the Linux Kernel” by Bovet & Cesati

What you will build: Expertise in using the crash utility to analyze kernel vmcores. You’ll investigate real kernel panics (from Project 8 and downloaded examples) and learn to navigate kernel data structures, examine process state, and identify root causes of kernel crashes.

Why it teaches crash dump analysis: The crash utility is to kernel debugging what GDB is to user-space. It’s the standard tool used by Red Hat, SUSE, and kernel developers worldwide. Mastering it opens doors to kernel debugging, enterprise support, and deep systems understanding.

Core challenges you will face:

  • Navigating crash’s command set → Maps to bt, task, vm, kmem, and dozens more
  • Understanding kernel data structures → Maps to task_struct, mm_struct, etc.
  • Correlating kernel state → Maps to finding what led to the panic
  • Reading kernel source alongside crash → Maps to effective debugging workflow

Real World Outcome

You will have a systematic workflow for analyzing kernel crashes using crash, with a command reference and documented investigation of real vmcores.

Example Output:

$ sudo crash /usr/lib/debug/lib/modules/5.15.0/vmlinux /var/crash/vmcore

crash 8.0.0
...
      KERNEL: /usr/lib/debug/lib/modules/5.15.0/vmlinux
    DUMPFILE: /var/crash/vmcore
        DATE: Sat Jan  4 15:30:45 UTC 2025
      UPTIME: 01:23:45
LOAD AVERAGE: 0.50, 0.35, 0.20
       TASKS: 256
    NODENAME: myserver
     RELEASE: 5.15.0-generic
     VERSION: #1 SMP
     MACHINE: x86_64
      MEMORY: 8 GB
       PANIC: "Kernel panic - not syncing: Manually triggered panic"

crash> bt
PID: 1234   TASK: ffff888123456780  CPU: 2   COMMAND: "tee"
 #0 [ffffc90001234000] machine_kexec at ffffffff81060abc
 #1 [ffffc90001234050] __crash_kexec at ffffffff81123def
 #2 [ffffc90001234100] panic at ffffffff818901ab
 #3 [ffffc90001234180] panic_trigger_write at ffffffffc0123456 [panic_trigger]
 #4 [ffffc900012341d0] kernfs_fop_write_iter at ffffffff8132abcd
 #5 [ffffc90001234250] vfs_write at ffffffff8128def0
 #6 [ffffc90001234290] ksys_write at ffffffff8128f123
 #7 [ffffc900012342d0] do_syscall_64 at ffffffff81890456
 #8 [ffffc90001234300] entry_SYSCALL_64_after_hwframe at ffffffff82000089

crash> task
PID: 1234   TASK: ffff888123456780  CPU: 2   COMMAND: "tee"
struct task_struct {
  state = 0x0,
  stack = 0xffffc90001230000,
  pid = 1234,
  tgid = 1234,
  comm = "tee",
  ...
}

crash> vm
PID: 1234   TASK: ffff888123456780  CPU: 2   COMMAND: "tee"
       MM               PGD          RSS    TOTAL_VM
ffff888112233440  ffff88810abcd000  2048k    12288k
      VMA           START       END     FLAGS FILE
ffff888100001000 7f0000000000 7f0000021000 8000875 /usr/bin/tee
...

crash> log | tail -20
[  123.456789] panic_trigger: Triggering kernel panic!
[  123.456790] Kernel panic - not syncing: Manually triggered panic

The Core Question You Are Answering

“How do I investigate a kernel crash when I have gigabytes of kernel memory and thousands of data structures?”

A vmcore contains everything: all processes, all memory, all kernel state. The challenge is navigation. The crash utility provides commands to examine specific data structures, follow pointers, and correlate state across subsystems. It’s like GDB but for the entire operating system state.

Concepts You Must Understand First

  1. crash Command Categories
    • Process commands: bt, task, ps, foreach
    • Memory commands: vm, kmem, rd, wr
    • Kernel state: log, timer, irq, runq
    • System info: sys, mach, net
    • Book Reference: crash whitepaper by Dave Anderson (Red Hat)
  2. Key Kernel Data Structures
    • task_struct: Process descriptor
    • mm_struct: Memory management
    • inode, dentry: Filesystem
    • sk_buff, socket: Networking
    • Book Reference: “Understanding the Linux Kernel” by Bovet — throughout
  3. Kernel Stack Traces
    • How to read kernel backtraces
    • Identifying the panic function
    • Finding the root cause vs symptom
    • Book Reference: “Linux Kernel Development” by Love — Ch. 18
  4. Using Source Code
    • Correlating crash output with kernel source
    • Using LXR or elixir.bootlin.com
    • Understanding inline functions and macros
    • Book Reference: Kernel source code itself

Questions to Guide Your Design

  1. Learning Path
    • What commands will you focus on first?
    • How will you practice each command?
  2. Sample Vmcores
    • Where will you get vmcores to practice with?
    • Will you create different crash types in Project 8?
  3. Documentation
    • What format will your crash command reference take?
    • How will you document your analysis workflow?
  4. Advanced Features
    • Will you explore crash extensions?
    • Will you write custom crash macros?

Thinking Exercise

Trace a NULL Pointer Panic

Given this crash backtrace:

crash> bt
 #0 page_fault at ffffffff81789abc
 #1 do_page_fault at ffffffff8101def0
 #2 async_page_fault at ffffffff82000123
 #3 my_driver_read at ffffffffc0123456 [my_driver]
 #4 vfs_read at ffffffff8128abc0
 #5 ksys_read at ffffffff8128bcd0
 #6 do_syscall_64 at ffffffff81890456

Questions:

  • What is the immediate cause of the panic?
  • Which frame contains your (or the third-party) code?
  • What commands would you use to investigate my_driver_read?

Investigation steps:

crash> dis my_driver_read            # Disassemble the function
crash> bt -f                         # Show frame addresses
crash> x/20x <frame_address>         # Examine stack at crash point
crash> task                          # What process triggered this?
crash> files                         # What files did it have open?

The Interview Questions They Will Ask

  1. “Walk me through analyzing a kernel panic using the crash utility.”
  2. “What does the ‘bt’ command show, and how do you interpret kernel stack traces?”
  3. “How do you find what process was running when the kernel panicked?”
  4. “A customer reports a production kernel panic. What information do you need?”
  5. “How do you examine the contents of a kernel data structure in crash?”
  6. “What is the difference between ‘rd’ and ‘struct’ commands in crash?”

Hints in Layers

Hint 1: Essential Commands to Learn First

bt          # Backtrace of current/specified task
log         # Kernel ring buffer (dmesg)
task        # Current task's task_struct
ps          # Process list
vm          # Virtual memory info
kmem -i     # Memory usage summary
sys         # System information

Hint 2: Investigating a Specific Task

crash> ps | grep suspicious_process
   1234      1   0  ffff888123456780  RU   0.5   myprocess

crash> set 1234                    # Set context to PID 1234
crash> bt                          # Backtrace of that process
crash> files                       # Open files
crash> vm                          # Memory maps
crash> task -R                     # Reveal task_struct contents

Hint 3: Examining Memory

crash> kmem -s                     # Slab cache info
crash> kmem <address>              # What does this address point to?
crash> rd <address> 64             # Read 64 bytes at address
crash> struct task_struct <addr>  # Interpret as task_struct
crash> list task_struct.tasks -H <head> # Walk a linked list

Hint 4: Using foreach

crash> foreach bt                  # Backtrace of ALL tasks
crash> foreach RU bt               # Backtrace of running tasks only
crash> foreach files               # Open files for all processes

Books That Will Help

Topic Book Chapter
crash Basics crash whitepaper by Dave Anderson Entire document
Kernel Internals “Understanding the Linux Kernel” by Bovet Ch. 3: Processes
Memory Management “Understanding the Linux Kernel” by Bovet Ch. 8: Memory
Kernel Debugging “Linux Kernel Development” by Love Ch. 18: Debugging

Common Pitfalls and Debugging

Problem 1: “crash says ‘cannot find vmlinux’“

  • Why: Debug symbols not installed or wrong path.
  • Fix: Install kernel debuginfo package. Use full path to vmlinux.
  • Quick test: find /usr/lib/debug -name "vmlinux*"

Problem 2: “Backtrace shows only addresses, no function names”

  • Why: Missing debug symbols or wrong vmlinux.
  • Fix: Ensure vmlinux matches the kernel that crashed exactly.
  • Quick test: crash -s vmlinux vmcore shows version mismatch warnings

Problem 3: “‘struct’ command shows garbage”

  • Why: Wrong address or data structure mismatch.
  • Fix: Verify the address with kmem <addr> first. Check kernel version.
  • Quick test: kmem -s to find valid slab addresses to practice with

Problem 4: “Can’t find the process that crashed”

  • Why: The crashing context might be interrupt or kernel thread.
  • Fix: Check bt output for COMMAND field. Use ps -k for kernel threads.
  • Quick test: bt shows PID and COMMAND at the top

Definition of Done

  • Can load a vmcore into crash and get basic info
  • Mastered bt, log, task, ps, vm commands
  • Can investigate which process/thread triggered a panic
  • Can examine kernel data structures with struct command
  • Can navigate memory with rd, kmem, and address interpretation
  • Created a crash command cheat sheet with examples
  • Analyzed at least 3 different types of kernel crashes
  • Documented a complete investigation workflow

Project 10: Centralized Crash Reporter — Production-Grade Infrastructure

  • File: P10-centralized-crash-reporter-infrastructure.md
  • Main Programming Language: Python (backend), C (collector)
  • Alternative Programming Languages: Go, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: System Design, Distributed Systems, DevOps
  • Software or Tool: systemd-coredump, REST API, PostgreSQL, object storage
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you will build: A complete crash reporting infrastructure that collects crashes from multiple hosts, stores them centrally, generates analysis reports, fingerprints for deduplication, and provides a web interface for browsing crashes. This is a mini-version of Sentry, Crashlytics, or Mozilla’s crash-stats.

Why it teaches crash dump analysis: Everything you’ve learned comes together: core dump capture, GDB analysis, automation, fingerprinting. This project teaches you to think at scale: How do you handle 10,000 crashes per day? How do you identify the top 10 bugs? How do you integrate with alerting systems?

Core challenges you will face:

  • Reliable crash collection → Maps to handling partial dumps, network failures
  • Scalable storage → Maps to compression, retention, object storage
  • Fingerprinting and deduplication → Maps to grouping same bugs together
  • Analysis pipeline → Maps to automated report generation at scale
  • User interface → Maps to presenting crash data usefully

Real World Outcome

You will have a working crash reporting system that you can deploy for real services, demonstrating end-to-end understanding of production crash analysis.

Example Output:

Web Interface:

┌─────────────────────────────────────────────────────────────────┐
│ Crash Reporter Dashboard                           Last 24h ▼  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ Total Crashes: 847    Unique Bugs: 23    Hosts Affected: 12    │
│                                                                 │
│ Top Crashes (by occurrence)                                     │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ #1  NULL_DEREF_process_request_server.c:234   (423 crashes)│ │
│ │     First: Jan 3, 14:22  Last: Jan 4, 08:45                │ │
│ │     Hosts: web-01, web-02, web-03                          │ │
│ │     [View Stack] [View Crashes] [Mark Fixed]               │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ #2  SEGV_handle_upload_api.c:892             (198 crashes) │ │
│ │     First: Jan 4, 02:15  Last: Jan 4, 08:30                │ │
│ │     Hosts: api-01, api-02                                  │ │
│ │     [View Stack] [View Crashes] [Mark Fixed]               │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

API Output:

$ curl https://crash-reporter.internal/api/v1/crashes/latest
{
  "crashes": [
    {
      "id": "crash-2025-01-04-abc123",
      "fingerprint": "NULL_DEREF_process_request_server.c:234",
      "timestamp": "2025-01-04T08:45:23Z",
      "host": "web-01",
      "executable": "/opt/myapp/server",
      "signal": "SIGSEGV",
      "backtrace": [
        {"frame": 0, "function": "process_request", "file": "server.c", "line": 234},
        {"frame": 1, "function": "handle_client", "file": "server.c", "line": 189}
      ],
      "analysis": {
        "crash_type": "null_pointer_dereference",
        "variable": "req",
        "recommendation": "Add null check for req parameter"
      }
    }
  ],
  "total": 847,
  "page": 1,
  "per_page": 50
}

Collection Agent:

$ sudo crash-agent status
Crash Collection Agent v1.0.0
  Status: Running
  Server: https://crash-reporter.internal
  Collected today: 12 crashes
  Queue: 0 pending

$ # When a crash happens:
[2025-01-04 08:45:23] Detected new core dump: /var/lib/systemd/coredump/core.server.1234.xxx
[2025-01-04 08:45:24] Analyzing with GDB...
[2025-01-04 08:45:25] Generated report (fingerprint: NULL_DEREF_process_request_server.c:234)
[2025-01-04 08:45:26] Uploaded to server (crash-2025-01-04-abc123)
[2025-01-04 08:45:26] Core dump retained locally for 7 days

The Core Question You Are Answering

“How do you build a system that turns thousands of raw crashes into actionable insights?”

A crash dump is useless if it sits in /var/crash on one server. You need: collection (get dumps from everywhere), analysis (extract useful information), aggregation (group similar crashes), storage (keep them queryable), and presentation (show developers what matters). This is crash analysis at production scale.

Concepts You Must Understand First

  1. System Architecture
    • Collection agents on each host
    • Central server with API and storage
    • Analysis workers (possibly distributed)
    • Web frontend for humans
    • Book Reference: “Designing Data-Intensive Applications” by Kleppmann — Ch. 1-2
  2. Crash Fingerprinting
    • What identifies a unique bug? (crash location, call stack, signal)
    • Hash functions for fingerprints
    • Handling variability (ASLR, different call paths to same bug)
    • Book Reference: Mozilla crash-stats documentation (online)
  3. Data Pipeline
    • Collection: Watch for new core dumps, extract metadata
    • Analysis: Run GDB, generate report, compute fingerprint
    • Storage: Metadata in database, cores in object storage
    • Book Reference: “Data Pipelines with Apache Airflow” patterns
  4. Scalability Considerations
    • Rate limiting (prevent DoS from crash loops)
    • Retention policies (can’t keep everything forever)
    • Sampling (for very high volume)
    • Book Reference: “Site Reliability Engineering” by Google

Questions to Guide Your Design

  1. Collection
    • How will agents discover new crashes (inotify, polling, coredumpctl)?
    • What happens if the network is down?
    • How do you handle crash loops (same bug crashing repeatedly)?
  2. Analysis
    • Will you analyze on the host or send raw cores to the server?
    • How will you handle missing debug symbols?
    • What timeout for analysis?
  3. Storage
    • What metadata goes in the database?
    • Where do you store actual core dumps (if at all)?
    • What’s your retention policy?
  4. Fingerprinting
    • What parts of the stack trace go into the fingerprint?
    • How do you handle different code paths to the same bug?
    • How do you detect when a bug is “fixed”?
  5. Interface
    • What views do users need (top bugs, recent crashes, specific host)?
    • How do you integrate with alerting (PagerDuty, Slack)?
    • Can developers mark bugs as “known” or “fixed”?

Thinking Exercise

Design the Data Flow

Trace a crash from occurrence to dashboard:

Host (web-01)                    Central Server                   User
    │                                  │                            │
    ├──[1. Crash occurs]              │                            │
    │   core.server.1234.xxx          │                            │
    │                                  │                            │
    ├──[2. Agent detects crash]       │                            │
    │   inotify on /var/crash         │                            │
    │                                  │                            │
    ├──[3. Agent analyzes locally]    │                            │
    │   GDB batch → JSON report       │                            │
    │                                  │                            │
    ├──[4. Agent POSTs to server]────►├──[5. Server receives]      │
    │   POST /api/v1/crashes          │   Validates, stores        │
    │   {report, fingerprint}         │                            │
    │                                  │                            │
    │                                  ├──[6. Fingerprint lookup]   │
    │                                  │   Existing bug or new?     │
    │                                  │                            │
    │                                  ├──[7. Update aggregates]    │
    │                                  │   Increment counter        │
    │                                  │   Update last_seen         │
    │                                  │                            │
    │                                  ├──[8. Trigger alerts]       │
    │                                  │   If new bug or spike      │
    │                                  │       │                    │
    │                                  │       └───────────────────►│
    │                                  │         Slack/PagerDuty    │
    │                                  │                            │
    │                                  │                 ◄──────────┤
    │                                  │            [9. User views  │
    │                                  │                dashboard]  │

Questions:

  • What if step 4 fails (network issue)?
  • What if the same crash happens 1000 times in 1 minute?
  • How do you handle crashes from different versions of the same software?

The Interview Questions They Will Ask

  1. “Design a crash reporting system for a company with 1000 servers.”
  2. “How would you implement crash fingerprinting to group similar crashes?”
  3. “What’s your retention strategy for crash dumps at scale?”
  4. “How do you handle a ‘crash storm’ where a bug causes thousands of crashes per minute?”
  5. “How would you integrate crash reporting with your deployment pipeline?”
  6. “What privacy concerns exist with crash dumps, and how do you address them?”

Hints in Layers

Hint 1: Start with the Agent Build the collection agent first—it’s the foundation:

# crash_agent.py - skeleton
import subprocess
import requests
import json

def watch_for_crashes():
    # Use coredumpctl or inotify to detect new crashes
    pass

def analyze_crash(core_path, exe_path):
    # Run GDB batch mode, return JSON report
    pass

def compute_fingerprint(report):
    # Hash crash location + top N frames
    pass

def upload_crash(report, fingerprint):
    # POST to central server
    pass

Hint 2: Fingerprint Algorithm A simple but effective fingerprint:

def compute_fingerprint(report):
    components = [
        report['signal'],
        report['backtrace'][0]['function'],  # Crash function
        report['backtrace'][0].get('file', 'unknown'),
        report['backtrace'][0].get('line', 0),
    ]
    # Add a few more frames for uniqueness
    for frame in report['backtrace'][1:4]:
        components.append(frame['function'])

    fingerprint_string = '|'.join(str(c) for c in components)
    return hashlib.sha256(fingerprint_string.encode()).hexdigest()[:16]

Hint 3: Database Schema

CREATE TABLE crashes (
    id UUID PRIMARY KEY,
    fingerprint VARCHAR(64),
    timestamp TIMESTAMP,
    host VARCHAR(255),
    executable VARCHAR(512),
    signal VARCHAR(32),
    report JSONB,  -- Full analysis report
    FOREIGN KEY (fingerprint) REFERENCES bugs(fingerprint)
);

CREATE TABLE bugs (
    fingerprint VARCHAR(64) PRIMARY KEY,
    first_seen TIMESTAMP,
    last_seen TIMESTAMP,
    total_count INTEGER,
    status VARCHAR(32),  -- new, known, fixed
    title VARCHAR(255)   -- Human-readable summary
);

Hint 4: Rate Limiting Crash Loops

class CrashRateLimiter:
    def __init__(self, max_per_minute=10):
        self.max_per_minute = max_per_minute
        self.fingerprint_counts = defaultdict(list)

    def should_report(self, fingerprint):
        now = time.time()
        # Clean old entries
        self.fingerprint_counts[fingerprint] = [
            t for t in self.fingerprint_counts[fingerprint]
            if now - t < 60
        ]
        # Check limit
        if len(self.fingerprint_counts[fingerprint]) >= self.max_per_minute:
            return False
        self.fingerprint_counts[fingerprint].append(now)
        return True

Books That Will Help

Topic Book Chapter
System Design “Designing Data-Intensive Applications” by Kleppmann Ch. 1-3, 10-11
API Design “REST API Design Rulebook” by Masse Throughout
Database Design “Database Internals” by Petrov Ch. 1-2
Operations “Site Reliability Engineering” by Google Ch. 4, 12

Common Pitfalls and Debugging

Problem 1: “Agent can’t keep up with crash rate”

  • Why: Analysis is slow, crashes queue up.
  • Fix: Implement rate limiting, skip duplicates within time window, async analysis.
  • Quick test: Monitor queue depth over time

Problem 2: “Fingerprints are too specific (every crash is ‘unique’)”

  • Why: Including too much variable data (addresses, PIDs).
  • Fix: Only use stable components (function names, file/line, signal). Strip addresses.
  • Quick test: Same bug should produce same fingerprint across runs

Problem 3: “Database grows too fast”

  • Why: Storing too much per crash, no retention.
  • Fix: Store summary in DB, full report in object storage. Implement TTL.
  • Quick test: Track DB size growth over time

Problem 4: “Can’t analyze crashes without debug symbols”

  • Why: Agents might not have symbols for all software.
  • Fix: Symbol server that agents can query. Fall back to address-only fingerprint.
  • Quick test: Test with stripped binary, verify graceful degradation

Definition of Done

  • Collection agent watches for and reports new crashes
  • Central server receives and stores crash reports
  • Fingerprinting correctly groups same bugs together
  • Web interface shows crash list, bug list, and statistics
  • API provides programmatic access to crash data
  • Rate limiting prevents crash loop DoS
  • Retention policy automatically removes old data
  • Alerting integration (Slack webhook or similar) works
  • Documentation covers deployment and configuration
  • System tested with simulated crash load

Project Comparison Table

# Project Difficulty Time Depth of Understanding Fun Factor Real-World Value
1 The First Crash — Core Dump Generation Level 1: Beginner 4-6 hours ★★☆☆☆ Foundation ★★★☆☆ Essential baseline
2 The GDB Backtrace — Extracting Crash Context Level 1: Beginner 6-8 hours ★★★☆☆ Practical ★★★★☆ Daily debugging
3 The Memory Inspector — Deep State Examination Level 2: Intermediate 10-15 hours ★★★★☆ Deep ★★★★☆ Real investigation
4 Automated Crash Detective — GDB Scripting Level 2: Intermediate 15-20 hours ★★★★☆ Automation ★★★★★ CI/CD integration
5 Multi-threaded Mayhem — Concurrent Crashes Level 3: Advanced 20-30 hours ★★★★★ Expert ★★★☆☆ Production systems
6 The Stripped Binary Challenge Level 3: Advanced 15-25 hours ★★★★★ Expert ★★★★★ Security/Forensics
7 Minidump Parser — Compact Crash Formats Level 2: Intermediate 15-20 hours ★★★★☆ Format Mastery ★★★★☆ Cross-platform
8 Kernel Panic Anatomy — kdump Configuration Level 3: Advanced 20-30 hours ★★★★★ Kernel-Level ★★★☆☆ System reliability
9 Analyzing Kernel Crashes with crash Level 3: Advanced 20-30 hours ★★★★★ Kernel-Level ★★★★☆ Kernel debugging
10 Centralized Crash Reporter Level 3: Advanced 40-60 hours ★★★★★ Architecture ★★★★★ Production essential

Time Investment Summary

Learning Path Total Time Projects
Minimum Viable 10-14 hours Projects 1, 2
Working Professional 40-60 hours Projects 1-4
Expert Track 100-150 hours Projects 1-6
Full Mastery 160-240 hours All 10 projects

Recommendation

If You Are New to Crash Analysis

Start with Project 1: The First Crash

This is non-negotiable. You need to understand:

  • How core dumps are generated
  • Where they’re stored
  • What triggers their creation

Many developers have never seen a core dump because ulimit defaults to 0. Fix that first.

Then immediately do Project 2: The GDB Backtrace

This teaches you the 80% of crash debugging that covers 95% of real-world cases. Once you can read a backtrace and examine variables, you’re dangerous.

If You Already Debug with GDB

Start with Project 3: The Memory Inspector

Go deeper. Learn to examine arbitrary memory, understand heap corruption, and trace data structures. This separates the competent from the experts.

Then do Project 4: Automated Crash Detective

Automation pays dividends. A GDB Python script that runs on every crash in CI catches bugs before they hit production.

If You Work on Production Systems

Prioritize Projects 5 and 10

  • Project 5 (Multi-threaded Mayhem) teaches you to debug the crashes that only happen under load.
  • Project 10 (Centralized Crash Reporter) gives you visibility into crash patterns across your fleet.

If You’re a Kernel Developer or SRE

Projects 8 and 9 are essential

Kernel crashes are different. You need kdump configured, you need to know the crash utility, and you need to understand kernel data structures.

If You Work in Security or Forensics

Project 6: The Stripped Binary Challenge

Real-world malware and proprietary software rarely come with symbols. Learning to debug without them is a superpower.

For Most Developers:                For Kernel/Systems:
┌──────────────────────┐           ┌──────────────────────┐
│ Project 1 (Foundation)│           │ Project 1 (Foundation)│
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 2 (GDB Basics)│          │ Project 2 (GDB Basics)│
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 3 (Deep Mem) │           │ Project 8 (kdump)    │
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 4 (Automation)│          │ Project 9 (crash)    │
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 5 (Threading)│           │ Project 5 (Threading)│
└───────────┬──────────┘           └───────────┬──────────┘
            ▼                                   ▼
┌──────────────────────┐           ┌──────────────────────┐
│ Project 10 (Reporter)│           │ Project 10 (Reporter)│
└──────────────────────┘           └──────────────────────┘

Final Overall Project: Production Crash Forensics Platform

The Goal

Combine everything you’ve learned into a complete crash forensics platform that could be deployed in a real production environment.

System Name: CrashLens

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           CrashLens Platform                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │ Collection      │    │ Collection      │    │ Collection      │         │
│  │ Agent (Host 1)  │    │ Agent (Host 2)  │    │ Agent (Host N)  │         │
│  │                 │    │                 │    │                 │         │
│  │ • systemd hook  │    │ • systemd hook  │    │ • systemd hook  │         │
│  │ • minidump gen  │    │ • minidump gen  │    │ • minidump gen  │         │
│  │ • symbol fetch  │    │ • symbol fetch  │    │ • symbol fetch  │         │
│  └────────┬────────┘    └────────┬────────┘    └────────┬────────┘         │
│           │                      │                      │                   │
│           └──────────────────────┼──────────────────────┘                   │
│                                  ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │                         Message Queue (Redis/RabbitMQ)                │ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                  │                                          │
│           ┌──────────────────────┴───────────────────────┐                  │
│           ▼                                              ▼                  │
│  ┌─────────────────────────┐                ┌─────────────────────────┐    │
│  │    Analysis Workers     │                │    Symbol Server        │    │
│  │                         │◄───────────────│                         │    │
│  │ • GDB Python automation │    symbols     │ • debuginfod-compatible│    │
│  │ • Fingerprint generation│                │ • Build ID indexed     │    │
│  │ • Stack unwinding       │                │ • S3 backend           │    │
│  │ • Minidump parsing      │                │                         │    │
│  └───────────┬─────────────┘                └─────────────────────────┘    │
│              │                                                              │
│              ▼                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │                         PostgreSQL Database                           │ │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐      │ │
│  │  │ crashes    │  │ bugs       │  │ binaries   │  │ users      │      │ │
│  │  │            │  │            │  │            │  │            │      │ │
│  │  │ • metadata │  │ • fingerp. │  │ • build_id │  │ • API keys │      │ │
│  │  │ • frames   │  │ • status   │  │ • symbols  │  │ • teams    │      │ │
│  │  │ • memory   │  │ • issue_id │  │ • version  │  │ • alerts   │      │ │
│  │  └────────────┘  └────────────┘  └────────────┘  └────────────┘      │ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                  │                                          │
│              ┌───────────────────┴───────────────────┐                      │
│              ▼                                       ▼                      │
│  ┌─────────────────────────┐            ┌─────────────────────────┐        │
│  │       REST API          │            │     Web Dashboard       │        │
│  │                         │            │                         │        │
│  │ • Crash submission      │            │ • Bug list view         │        │
│  │ • Query/search          │            │ • Crash timeline        │        │
│  │ • Bug management        │            │ • Stack trace viewer    │        │
│  │ • Integration webhooks  │            │ • Trend analysis        │        │
│  └─────────────────────────┘            │ • Issue tracker link    │        │
│                                         └─────────────────────────┘        │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Phases

Phase 1: Foundation (Projects 1, 2, 10)

  1. Implement basic collection agent using systemd-coredump hooks
  2. Create central database schema for crashes and bugs
  3. Build simple REST API for crash submission
  4. Implement basic fingerprinting from stack traces

Phase 2: Analysis Engine (Projects 3, 4, 6)

  1. Create GDB Python scripts for automated analysis
  2. Implement minidump generation for reduced storage
  3. Handle stripped binaries with graceful degradation
  4. Add memory region extraction for heap analysis

Phase 3: Multi-threading and Advanced (Project 5)

  1. Extend fingerprinting to handle concurrent crashes
  2. Detect and flag potential race conditions
  3. Implement deadlock detection in crash analysis
  4. Add thread state comparison tools

Phase 4: Kernel Support (Projects 8, 9)

  1. Add kdump collection support for kernel panics
  2. Implement crash utility integration for kernel analysis
  3. Create kernel-specific fingerprinting
  4. Handle vmcore files in analysis pipeline

Phase 5: Production Polish (Project 7)

  1. Implement Breakpad minidump format support
  2. Add cross-platform crash ingestion (Windows, macOS)
  3. Create symbol server with debuginfod compatibility
  4. Build trend analysis and anomaly detection

Success Criteria

Your CrashLens platform is complete when:

  • Collection: Agents automatically capture crashes from 10+ hosts
  • Analysis: Crashes are analyzed within 60 seconds of occurrence
  • Fingerprinting: Same bugs are correctly grouped (verify with synthetic crashes)
  • Symbols: Debug symbols are automatically fetched for known binaries
  • Dashboard: Web UI shows crash list, bug trends, and drill-down views
  • API: Integration with issue trackers (GitHub/Jira) works
  • Kernel: Kernel panics are captured and analyzed via kdump
  • Scale: System handles 1000+ crashes/day without degradation
  • Retention: Old data is automatically aged out per policy
  • Documentation: Deployment guide covers all components

Stretch Goals

  • Machine Learning: Classify crashes by root cause
  • Automated Bisect: Integrate with git to find regressing commits
  • Live Debugging: Connect to running process from dashboard
  • Distributed Tracing: Correlate crashes with request traces
  • Cost Analysis: Estimate business impact of each bug

From Learning to Production: What Is Next

After completing these projects, you understand crash dump analysis deeply. Here’s how your skills map to production tools and what gaps remain:

Your Project Production Equivalent Gap to Fill
Project 1: Core Dump Generation systemd-coredump, apport Multi-distro configuration, storage policies
Project 2: GDB Backtrace GDB, LLDB IDE integration, remote debugging
Project 3: Memory Inspector Valgrind, ASan, MSan Runtime instrumentation, sanitizers
Project 4: GDB Scripting Mozilla rr, Pernosco Record/replay debugging
Project 5: Multi-threaded Helgrind, ThreadSanitizer Data race detection
Project 6: Stripped Binaries Ghidra, IDA Pro, Binary Ninja Full reverse engineering
Project 7: Minidump Parser Breakpad, Crashpad Client library integration
Project 8: kdump Red Hat Crash, Oracle kdump Enterprise kernel support
Project 9: crash Utility crash + extensions Custom crash plugins
Project 10: Crash Reporter Sentry, Backtrace.io, Raygun SaaS scale, ML classification

Career Applications

Site Reliability Engineering (SRE)

  • Your Project 10 skills directly apply to building observability infrastructure
  • Project 8/9 kernel skills are essential for infrastructure debugging
  • Fingerprinting knowledge helps reduce alert fatigue

Security Engineering

  • Project 6 stripped binary skills are core to malware analysis
  • Memory examination skills from Project 3 help in exploit development
  • Core dump analysis is essential for incident response

Systems Programming

  • Every project contributes to building robust, debuggable systems
  • Understanding crash formats helps design better error handling
  • Automation skills from Project 4 integrate into CI/CD pipelines

Kernel Development

  • Projects 8/9 are essential prerequisites
  • Understanding of ELF format and memory layout is foundational
  • Crash analysis is a daily activity for kernel developers

Summary

This learning path covers Linux Crash Dump Analysis through 10 hands-on projects, taking you from basic core dump generation to building a production-grade crash reporting platform.

# Project Name Main Language Difficulty Time Estimate
1 The First Crash — Core Dump Generation C Level 1: Beginner 4-6 hours
2 The GDB Backtrace — Extracting Crash Context C Level 1: Beginner 6-8 hours
3 The Memory Inspector — Deep State Examination C Level 2: Intermediate 10-15 hours
4 Automated Crash Detective — GDB Scripting Python Level 2: Intermediate 15-20 hours
5 Multi-threaded Mayhem — Analyzing Concurrent Crashes C Level 3: Advanced 20-30 hours
6 The Stripped Binary Challenge — Debugging Without Symbols C/Assembly Level 3: Advanced 15-25 hours
7 Minidump Parser — Understanding Compact Crash Formats C/Python Level 2: Intermediate 15-20 hours
8 Kernel Panic Anatomy — Triggering and Capturing with kdump C Level 3: Advanced 20-30 hours
9 Analyzing Kernel Crashes with the crash Utility C Level 3: Advanced 20-30 hours
10 Centralized Crash Reporter — Production-Grade Infrastructure Python Level 3: Advanced 40-60 hours

For Beginners: Start with Projects 1 → 2 → 3 → 4

For Intermediate Developers: Projects 1 → 2 → 4 → 5 → 10

For Kernel/Systems Engineers: Projects 1 → 2 → 8 → 9 → 5

For Security Professionals: Projects 1 → 2 → 3 → 6 → 7

Expected Outcomes

After completing these projects, you will:

  1. Generate and configure core dumps on any Linux system with confidence
  2. Navigate GDB fluently using commands like bt, frame, print, x, and info
  3. Examine memory in depth including heap structures, stack frames, and data corruption
  4. Automate crash analysis using GDB’s Python API for CI/CD integration
  5. Debug multi-threaded crashes including race conditions and deadlocks
  6. Analyze stripped binaries using assembly-level debugging techniques
  7. Parse minidump formats for compact, portable crash representation
  8. Configure and use kdump to capture kernel panics
  9. Navigate kernel crashes using the crash utility and kernel data structures
  10. Design and build production crash reporting infrastructure

The Deeper Understanding

Beyond the technical skills, you’ll understand:

  • Why crashes happen: Not just “null pointer,” but the architectural reasons software fails
  • What memory really is: A flat array of bytes with conventions layered on top
  • How debuggers work: They’re not magic—they read the same files you can read
  • Why symbols matter: And how to work without them when necessary
  • How to think forensically: Reconstructing what happened from incomplete evidence

You’ll have built 10+ working projects that demonstrate deep understanding of crash dump analysis from first principles.


Additional Resources and References

Standards and Specifications

Official Documentation

  • GDB Manual: https://sourceware.org/gdb/current/onlinedocs/gdb/
  • systemd-coredump: https://www.freedesktop.org/software/systemd/man/systemd-coredump.html
  • Linux Kernel kdump: https://www.kernel.org/doc/html/latest/admin-guide/kdump/kdump.html
  • crash Utility: https://crash-utility.github.io/

Books (Essential Reading)

Book Author Why It Matters
Computer Systems: A Programmer’s Perspective Bryant & O’Hallaron Foundation for understanding memory, processes, and linking
The Linux Programming Interface Michael Kerrisk Comprehensive coverage of signals, process memory, and core dumps
Linux Kernel Development Robert Love Essential for kernel crash analysis
Understanding the Linux Kernel Bovet & Cesati Deep dive into kernel internals
Debugging with GDB Richard Stallman et al. Official GDB documentation in book form
Practical Binary Analysis Dennis Andriesse Reverse engineering and binary formats
Expert C Programming Peter van der Linden Deep C knowledge for understanding crashes

Online Resources

  • Julia Evans’ Debugging Zines: https://wizardzines.com/ — Approachable visual guides
  • Brendan Gregg’s Blog: https://www.brendangregg.com/ — Performance and debugging
  • LWN.net Kernel Articles: https://lwn.net/ — In-depth kernel coverage
  • GDB Dashboard: https://github.com/cyrus-and/gdb-dashboard — Enhanced GDB interface

Tools Referenced

Tool Purpose Installation
GDB GNU Debugger apt install gdb
LLDB LLVM Debugger apt install lldb
coredumpctl systemd core dump management Part of systemd
crash Kernel crash analysis apt install crash
Breakpad Client crash reporting Build from source
debuginfod Symbol server apt install debuginfod
objdump Binary examination Part of binutils
readelf ELF file analysis Part of binutils
addr2line Address to source mapping Part of binutils

Community and Help

  • GDB Mailing List: https://sourceware.org/gdb/mailing-lists/
  • Stack Overflow Tags: [gdb], [core-dump], [crash-dump]
  • Reddit: r/linux, r/linuxadmin, r/ReverseEngineering
  • Linux Kernel Mailing List (LKML): For kernel crash questions

This learning path was designed to take you from zero knowledge to expert-level crash dump analysis skills. The projects are ordered to build on each other, with each one adding new concepts and reinforcing what came before. Complete them all, and you’ll have skills that set you apart in systems programming, site reliability engineering, security research, or kernel development.