Sprint: Linux Crash Dump Analysis Mastery - From Core Dumps to Kernel Panics
Goal: Master the art of post-mortem debugging in Linux. You will learn to analyze user-space core dumps with GDB, understand the ELF format that stores crash data, diagnose multi-threaded crashes, automate triage workflows, and ultimately dissect kernel panics with the
crashutility. By the end, you will confidently find the root cause of any crash—from a simple segmentation fault to a full kernel panic.
Introduction
When a program crashes, it often leaves behind a core dump—a snapshot of the process’s memory and CPU state at the moment of death. This file is your crime scene evidence. Learning to analyze it transforms you from a developer who guesses at bugs into one who knows the root cause.
This guide covers user-space crash analysis (segfaults, memory corruption, multi-threaded races) and introduces kernel crash analysis (panics, oops, kdump). You will build 10 projects that progressively deepen your understanding.
What You Will Build
- Crash-generating programs - Controlled bugs that produce core dumps
- GDB analysis workflows - Manual and scripted backtrace extraction
- Memory inspection tools - Raw memory examination and corruption detection
- Automated triage scripts - Python/GDB automation for production systems
- Multi-threaded crash scenarios - Data race and deadlock analysis
- Stripped binary debugging - Working without debug symbols
- Minidump parser - Understanding Breakpad/Crashpad formats
- Kernel panic triggers - Writing buggy kernel modules safely in VMs
- Kernel crash analysis - Using
crashon vmcore files - Centralized crash reporter - A mini-Sentry for your infrastructure
Big Picture: The Crash Analysis Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ CRASH ANALYSIS PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ APPLICATION │ │ KERNEL │ │ CORE DUMP │ │ ANALYSIS │
│ CRASH │────►│ HANDLER │────►│ FILE │────►│ TOOLS │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │
│ │ │ │
▼ ▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ SIGSEGV │ │core_pattern│ │ ELF Format │ │ GDB │
│ SIGABRT │ │ ulimit -c │ │ PT_NOTE │ │ crash │
│ SIGFPE │ │ systemd- │ │ PT_LOAD │ │ minidump │
│ SIGBUS │ │ coredump │ │ Registers │ │ stackwalk │
└───────────┘ └───────────┘ └───────────┘ └───────────┘
USER-SPACE CRASHES: KERNEL CRASHES:
┌─────────────────────────────┐ ┌─────────────────────────────────────┐
│ Process → Signal → Core │ │ Panic → kexec → vmcore → crash │
│ GDB loads: executable + │ │ crash loads: vmlinux + vmcore │
│ core file + debug symbols │ │ Debug kernel symbols required │
└─────────────────────────────┘ └─────────────────────────────────────┘
Scope
In Scope:
- Linux user-space core dump analysis (GDB)
- ELF core dump format internals
- Multi-threaded crash debugging
- GDB Python scripting for automation
- Kernel crash dumps (kdump/vmcore basics)
- The
crashutility for kernel analysis - Minidump format (Breakpad/Crashpad)
Out of Scope:
- Windows crash dumps (WinDbg, minidumps on Windows)
- macOS crash reports
- Hardware debugging (JTAG, logic analyzers)
- Live debugging techniques (covered in other guides)
How to Use This Guide
Reading Strategy
-
Read the Theory Primer first - The concepts explained before the projects give you the mental model needed to understand why techniques work.
-
Follow the project order - Projects 1-3 build foundational skills. Projects 4-6 add intermediate complexity. Projects 7-10 tackle advanced topics.
-
Don’t skip the “Thinking Exercise” - These pre-project exercises build the mental models that make debugging intuitive.
-
Use the “Definition of Done” - Each project has explicit completion criteria. Don’t move on until you’ve hit them.
Workflow for Each Project
1. Read the project overview and "Core Question"
2. Complete the "Thinking Exercise"
3. Implement the project (use hints only when stuck)
4. Verify against "Real World Outcome" examples
5. Review the "Common Pitfalls" section
6. Complete the "Definition of Done" checklist
7. Attempt the interview questions
Time Investment
| Project Type | Time Estimate | Examples |
|---|---|---|
| Beginner | 4-8 hours | Projects 1-2 |
| Intermediate | 10-20 hours | Projects 3-6 |
| Advanced | 20-40 hours | Projects 7-9 |
| Capstone | 40+ hours | Project 10 |
Total Sprint Duration: 8-12 weeks at 10-15 hours/week
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
C Programming Fundamentals
- Pointers and memory addresses
- Stack vs. heap allocation
- Compiling with GCC (
gcc -g -o program program.c) - Basic understanding of segmentation faults
- Recommended Reading: “The C Programming Language” by Kernighan & Ritchie - Ch. 5-6
Linux Command Line
- Basic shell navigation and commands
- Understanding of processes and signals
- File permissions and sudo usage
- Recommended Reading: “The Linux Command Line” by William Shotts - Ch. 1-10
Basic GDB Usage
- Starting GDB (
gdb ./program) - Setting breakpoints (
break main) - Stepping through code (
next,step) - Printing variables (
print x) - Recommended Reading: “The Art of Debugging with GDB” by Matloff & Salzman - Ch. 1-2
Helpful But Not Required
Assembly Language Basics (learned during Projects 5-6)
- x86-64 register names (RAX, RBX, RSP, RIP)
- Basic instruction formats
- Recommended Reading: “Low-Level Programming” by Igor Zhirkov - Ch. 1-3
Linux Kernel Concepts (learned during Projects 8-9)
- Kernel modules and insmod/rmmod
- Kernel vs. user space
- Recommended Reading: “Linux Device Drivers, 3rd Edition” by Corbet et al. - Ch. 1-2
Python Scripting (needed for Projects 4, 7, 10)
- Basic Python syntax and file I/O
- subprocess module for running external commands
- struct module for binary parsing
Self-Assessment Questions
Answer “yes” to at least 5 of these before starting:
- Can you explain what happens when you dereference a NULL pointer in C?
- Do you know the difference between stack and heap memory?
- Can you compile a C program with debug symbols using GCC?
- Have you used GDB to set a breakpoint and step through code?
- Do you understand what a signal (like SIGSEGV) is in Linux?
- Can you write a basic bash script that runs a command and checks its exit code?
- Do you know what a process memory map looks like (
/proc/[pid]/maps)?
Development Environment Setup
Required Tools:
| Tool | Version | Purpose | Installation |
|---|---|---|---|
| GCC | 9.0+ | Compiling with debug symbols | sudo apt install build-essential |
| GDB | 10.0+ | Core dump analysis | sudo apt install gdb |
| Bash | 4.0+ | Scripting | Pre-installed on Linux |
| Python | 3.8+ | Automation scripts | sudo apt install python3 |
Recommended Tools:
| Tool | Purpose | Installation |
|---|---|---|
| Valgrind | Memory error detection | sudo apt install valgrind |
| strace | System call tracing | sudo apt install strace |
| objdump | Binary disassembly | Part of binutils |
| readelf | ELF file inspection | Part of binutils |
| coredumpctl | systemd core dump management | Part of systemd |
For Kernel Projects (8-9):
| Tool | Purpose | Installation |
|---|---|---|
| QEMU/KVM | Virtual machine for safe testing | sudo apt install qemu-kvm |
| crash | Kernel dump analysis | sudo apt install crash |
| kernel-debuginfo | Debug symbols for kernel | Varies by distro |
Testing Your Setup:
# Verify GCC with debug symbols
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
# Verify GDB
$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
# Test core dump generation
$ ulimit -c unlimited
$ cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E
# Note: If using systemd-coredump, use coredumpctl instead
$ coredumpctl list
Important Reality Check
This is not easy material. Crash dump analysis requires understanding:
- How programs are represented in memory
- How the CPU executes instructions
- How the operating system manages processes
- How debugging tools reconstruct program state
Expect to feel confused initially. The projects are designed to make abstract concepts concrete through hands-on work. Trust the process.
Big Picture / Mental Model
The Layers of Crash Analysis
┌─────────────────────────────────────────────────────────────────────────────┐
│ CRASH ANALYSIS MENTAL MODEL │
└─────────────────────────────────────────────────────────────────────────────┘
LAYER 5: AUTOMATED SYSTEMS
┌─────────────────────────────────────────────────────────────────────────────┐
│ Crash Reporters (Sentry, Crashpad) │ CI/CD Integration │ Alerting │
│ Symbolication Servers │ Crash Deduplication │ Dashboards │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
LAYER 4: KERNEL CRASH ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│ kdump (capture mechanism) │ vmcore (kernel memory image) │ crash tool │
│ kexec (boot into capture) │ vmlinux (debug symbols) │ dmesg/log │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
LAYER 3: ADVANCED USER-SPACE ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│ Multi-threaded debugging │ Stripped binaries │ Memory corruption │
│ GDB Python scripting │ Disassembly │ Address-to-symbol │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
LAYER 2: BASIC USER-SPACE ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│ GDB backtrace (bt) │ Variable inspection (print) │ Memory exam │
│ Stack frames (frame N) │ Register state (info reg) │ x/Nfx addr │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
LAYER 1: CORE DUMP FUNDAMENTALS
┌─────────────────────────────────────────────────────────────────────────────┐
│ ELF core format │ Signal handling │ ulimit/core_pattern │
│ PT_NOTE (metadata) │ Memory segments │ systemd-coredump │
│ PT_LOAD (memory snapshot) │ Register values │ Debug symbols (-g) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────┐
│ FOUNDATION: C Memory Model │
│ Stack, Heap, Code, Data │
│ Pointers, Addresses, Segments │
└─────────────────────────────────┘
The Core Dump Data Flow
RUNNING PROCESS CRASH EVENT ANALYSIS
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ Code (.text) │ │ Signal raised │ │ GDB loads: │
│ Data (.data) │ ─────────► │ (SIGSEGV, │ ─────────► │ - executable │
│ BSS (.bss) │ │ SIGABRT...) │ │ - core file │
│ Heap │ │ │ │ - symbols │
│ Stack │ │ Kernel writes │ │ │
│ Registers │ │ core dump │ │ You inspect: │
│ │ │ │ │ - backtrace │
└────────────────┘ └────────────────┘ │ - variables │
│ - memory │
Core file contains: │ - registers │
┌────────────────┐ │ │
│ ELF Header │ └────────────────┘
│ PT_NOTE: │
│ - prstatus │
│ - prpsinfo │
│ - auxv │
│ - files │
│ PT_LOAD: │
│ - memory │
│ segments │
└────────────────┘
Theory Primer
This section provides the conceptual foundation for crash dump analysis. Read these chapters before starting the projects—they give you the mental models that make debugging intuitive.
Chapter 1: Core Dumps Fundamentals
Fundamentals
A core dump (or “core file”) is a snapshot of a process’s memory and CPU state at the moment it terminated abnormally. The name comes from early computing when memory was made of magnetic “cores.” Today, a core dump captures everything needed to reconstruct the state of a crashed program: its memory contents, register values, open file descriptors, and more.
When a process receives a signal that causes it to terminate (like SIGSEGV for segmentation faults), the Linux kernel can save this snapshot to a file. This file is your primary evidence for post-mortem debugging—analyzing what went wrong after the crash occurred.
Core dumps are essential because:
- Crashes may not be reproducible - Race conditions, timing issues, or specific input combinations may be difficult to recreate
- Production debugging is limited - You can’t attach a debugger to production systems easily
- The crash context is preserved - You see the exact state at the moment of failure
Deep Dive
The core dump mechanism in Linux is controlled by several system settings and kernel parameters. Understanding these is crucial for both generating and analyzing dumps.
Signal Handling and Core Generation
When a process performs an illegal operation (like dereferencing a NULL pointer), the CPU raises an exception. The kernel translates this into a signal delivered to the process. The default action for certain signals is to terminate the process and generate a core dump:
| Signal | Number | Description | Default Action |
|---|---|---|---|
| SIGQUIT | 3 | Quit from keyboard (Ctrl+) | Core dump |
| SIGILL | 4 | Illegal instruction | Core dump |
| SIGABRT | 6 | Abort signal (from abort()) | Core dump |
| SIGFPE | 8 | Floating-point exception | Core dump |
| SIGSEGV | 11 | Segmentation fault | Core dump |
| SIGBUS | 7 | Bus error (bad memory access) | Core dump |
| SIGSYS | 31 | Bad system call | Core dump |
| SIGTRAP | 5 | Trace/breakpoint trap | Core dump |
Core Dump Size Limits (ulimit)
The shell’s resource limits control whether core dumps are created. The ulimit -c command shows the maximum core file size in 512-byte blocks. A value of 0 (common default) means no core dumps are created.
# Check current limit
$ ulimit -c
0
# Enable unlimited core dumps for this session
$ ulimit -c unlimited
# Enable for all users permanently (in /etc/security/limits.conf)
* soft core unlimited
* hard core unlimited
Core Pattern Configuration
The /proc/sys/kernel/core_pattern file determines where core dumps are written and how they’re named. This can be:
- A file path pattern - Specifiers like
%p(PID),%e(executable name),%t(timestamp) are replaced - A pipe to a program - Starting with
|pipes the dump to a handler (like systemd-coredump or apport)
# Traditional file-based pattern
$ echo "core.%e.%p.%t" > /proc/sys/kernel/core_pattern
# Creates: core.myprogram.1234.1703097600
# Modern systemd-based handling
$ cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
Modern Core Dump Management: systemd-coredump
Most modern Linux distributions use systemd-coredump to manage core dumps. It provides:
- Automatic compression and storage in
/var/lib/systemd/coredump/ - Journal integration for metadata
- Automatic cleanup of old dumps
- The
coredumpctltool for listing and debugging
# List recent core dumps
$ coredumpctl list
TIME PID UID GID SIG COREFILE EXE
Fri 2024-12-20 10:30:15 EST 1234 1000 1000 SIGSEGV present /usr/bin/myapp
# Debug a specific crash
$ coredumpctl debug myapp
# Export the core file
$ coredumpctl dump myapp -o core.myapp
Core Dump Security Considerations
Core dumps can contain sensitive data:
- Environment variables (potentially including secrets)
- Memory contents (passwords, API keys, personal data)
- File contents that were being processed
This is why many systems disable core dumps by default and why GDPR compliance may require careful handling. Configure Storage=none in /etc/systemd/coredump.conf if you only want journal metadata without the actual memory dump.
How This Fits in Projects
- Project 1: You’ll configure core dump generation and verify the system creates dump files
- Project 2: You’ll load core dumps into GDB and extract backtraces
- Project 10: You’ll build a system that automatically captures and processes core dumps
Mental Model Diagram
CORE DUMP GENERATION FLOW
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. ILLEGAL OPERATION 2. CPU EXCEPTION 3. KERNEL SIGNAL │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ int *p = NULL; │ │ #PF (Page │ │ Signal 11 │ │
│ │ *p = 42; │ ───► │ Fault) raised │──►│ (SIGSEGV) │ │
│ │ // BOOM! │ │ by CPU │ │ delivered │ │
│ └─────────────────┘ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ 4. CORE PATTERN CHECK 5. DUMP CREATION 6. FILE WRITTEN │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /proc/sys/ │ │ For each │ │ core.myapp.1234 │ │
│ │ kernel/ │ ───► │ memory region: │──►│ (ELF format) │ │
│ │ core_pattern │ │ - Copy to file │ │ Ready for GDB │ │
│ │ │ │ - Add metadata │ │ │ │
│ │ ulimit -c > 0? │ │ - Save regs │ └─────────────────┘ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ OR (Modern systemd systems): │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ |/usr/lib/systemd/systemd-coredump │ │
│ │ └─► Writes to /var/lib/systemd/coredump/ │ │
│ │ └─► Logs metadata to journal │ │
│ │ └─► Use `coredumpctl debug` to analyze │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
Minimal Concrete Example
// crash.c - A program that generates a core dump
#include <stdio.h>
int main(void) {
int *ptr = NULL; // ptr points to address 0
*ptr = 42; // Writing to address 0 triggers SIGSEGV
return 0;
}
# Compile with debug symbols
$ gcc -g -o crash crash.c
# Enable core dumps
$ ulimit -c unlimited
# Run and crash
$ ./crash
Segmentation fault (core dumped)
# Verify the core file exists (traditional systems)
$ file core*
core: ELF 64-bit LSB core file, x86-64
# Or on systemd systems
$ coredumpctl list
TIME PID UID GID SIG COREFILE EXE
... 1234 1000 1000 SIGSEGV present /path/to/crash
Common Misconceptions
-
“Core dumps are always created” - No, they require
ulimit -cto be non-zero and the file system must be writable. -
“The core file contains the executable” - No, it contains only memory contents and metadata. You need the original executable (ideally with debug symbols) to analyze it.
-
“Core dumps are huge” - Not always. They can be compressed and typically only contain used memory pages, not the full virtual address space.
-
“I can analyze a core dump from any machine” - The executable must match exactly (same build). Libraries must also match for accurate analysis.
Check-Your-Understanding Questions
- What is the default value of
ulimit -con most systems, and what does it mean? - If
core_patternis|/usr/lib/systemd/systemd-coredump, where do core dumps go? - Why might you set
Storage=nonein/etc/systemd/coredump.conf? - Which signal is sent when you dereference a NULL pointer?
- What’s the difference between SIGSEGV and SIGBUS?
Check-Your-Understanding Answers
-
Default is
0, meaning core dumps are disabled. No core file will be created when a process crashes. -
They’re piped to the systemd-coredump service, which stores them in
/var/lib/systemd/coredump/(compressed) and logs metadata to the journal. -
For security/privacy compliance (like GDPR) where you want to log that crashes occurred without storing potentially sensitive memory contents.
-
SIGSEGV (signal 11) - Segmentation fault. This indicates an invalid memory access.
-
SIGSEGV typically means accessing unmapped memory. SIGBUS means the memory is mapped but the access is misaligned or otherwise illegal (e.g., accessing memory-mapped I/O incorrectly).
Real-World Applications
- Production debugging - When a server process crashes at 3 AM, the core dump lets you analyze it during business hours
- Crash reporting systems - Services like Sentry and Crashpad use core dumps (or minidumps) to report crashes
- QA and testing - Core dumps help developers understand why tests failed
- Security analysis - Examining crashes for potential vulnerabilities
Where You’ll Apply It
- Project 1: Generating your first intentional crash and verifying core dump creation
- Project 2: Loading core dumps into GDB
- Project 4: Automating core dump analysis
- Project 10: Building a crash collection system
References
- Linux man page: core(5)
- systemd-coredump documentation
- coredumpctl man page
- “The Linux Programming Interface” by Michael Kerrisk - Ch. 22 (Signals)
- Arch Wiki: Core dump
Key Insights
A core dump is a frozen snapshot of a process at the moment of death—it captures everything needed to perform a post-mortem investigation without needing to reproduce the crash.
Summary
Core dumps are ELF files containing a process’s memory and CPU state at crash time. Generation is controlled by ulimit -c (size limit) and /proc/sys/kernel/core_pattern (location/handler). Modern systems use systemd-coredump with coredumpctl for management. Core dumps can contain sensitive data, so security considerations apply.
Homework/Exercises
-
Exercise 1: Write a C program that crashes with each of these signals: SIGSEGV, SIGFPE, SIGABRT. Generate and verify core dumps for each.
-
Exercise 2: Configure your system to store core dumps with the pattern
/tmp/cores/core.%e.%p.%t. Create the directory, set permissions, and test. -
Exercise 3: If your system uses systemd-coredump, practice using
coredumpctl list,coredumpctl info, andcoredumpctl debug. -
Exercise 4: Write a script that checks if core dumps are enabled and reports the current configuration.
Solutions to Homework/Exercises
Exercise 1 Solution:
// Three separate programs:
// sigsegv_crash.c
int main() { int *p = 0; *p = 1; return 0; }
// sigfpe_crash.c
int main() { int x = 1, y = 0; return x / y; }
// sigabrt_crash.c
#include <stdlib.h>
int main() { abort(); return 0; }
Exercise 2 Solution:
sudo mkdir -p /tmp/cores
sudo chmod 1777 /tmp/cores
echo "/tmp/cores/core.%e.%p.%t" | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
./crash # Test with any crashing program
ls /tmp/cores/ # Verify core file created
Exercise 3 Solution:
coredumpctl list # List all dumps
coredumpctl info $(coredumpctl list | tail -1 | awk '{print $2}') # Info on latest
coredumpctl debug # Debug most recent dump
Exercise 4 Solution:
#!/bin/bash
echo "=== Core Dump Configuration Check ==="
echo "ulimit -c: $(ulimit -c)"
echo "core_pattern: $(cat /proc/sys/kernel/core_pattern)"
if [ "$(ulimit -c)" = "0" ]; then
echo "WARNING: Core dumps are DISABLED"
else
echo "Core dumps are ENABLED"
fi
Chapter 2: ELF Core Dump Format
Fundamentals
Core dumps are stored in the ELF (Executable and Linkable Format) format—the same format used for Linux executables and shared libraries. Understanding ELF structure is essential because it tells you where to find different pieces of crash data.
An ELF core dump is essentially a specially structured file that contains two main types of information: metadata (what process crashed, signal received, register values) stored in PT_NOTE segments, and memory contents (stack, heap, data sections) stored in PT_LOAD segments.
Unlike executable ELF files that have section headers for code and data, core dumps primarily use program headers to describe memory segments. The readelf and eu-readelf tools can parse these headers, revealing the structure of your crash data.
Deep Dive
ELF File Structure Overview
Every ELF file begins with a fixed-size header that identifies the file and points to important data structures:
┌────────────────────────────────────────────────────────────────────────────┐
│ ELF FILE STRUCTURE │
├────────────────────────────────────────────────────────────────────────────┤
│ ELF Header (52/64 bytes) │
│ - Magic number: 0x7F 'E' 'L' 'F' │
│ - Class (32/64-bit), Endianness, Version │
│ - Type: ET_CORE (4) for core dumps │
│ - Entry point, Program header offset, Section header offset │
├────────────────────────────────────────────────────────────────────────────┤
│ Program Headers (array) │
│ - PT_NOTE: Metadata (registers, process info, file mappings) │
│ - PT_LOAD: Memory segments (actual memory contents) │
├────────────────────────────────────────────────────────────────────────────┤
│ Segment Data │
│ - NOTE data: prstatus, prpsinfo, auxv, file mappings │
│ - LOAD data: Stack, heap, mapped files, anonymous memory │
└────────────────────────────────────────────────────────────────────────────┘
PT_NOTE Segment: The Metadata Treasure Trove
The PT_NOTE segment contains structured metadata about the crashed process. Each note has a name, type, and descriptor (data). Key note types include:
| Note Type | Name | Description |
|---|---|---|
| NT_PRSTATUS | CORE | Process status including registers, signal, PID |
| NT_PRPSINFO | CORE | Process info: state, command name, nice value |
| NT_AUXV | CORE | Auxiliary vector (dynamic linker info) |
| NT_FILE | CORE | File mappings (which files were mapped where) |
| NT_FPREGSET | CORE | Floating-point register state |
| NT_X86_XSTATE | LINUX | Extended CPU state (AVX, etc.) |
The NT_PRSTATUS note is particularly important—it contains:
- The signal that killed the process (e.g., SIGSEGV = 11)
- Current and pending signal masks
- All general-purpose register values (RIP, RSP, RAX, etc.)
- Process and thread IDs
For multi-threaded processes, there’s one NT_PRSTATUS note per thread, allowing you to see what each thread was doing at crash time.
PT_LOAD Segments: The Memory Snapshot
PT_LOAD segments contain the actual memory contents of the process. Each segment has:
- p_vaddr: Virtual address where this memory was mapped
- p_filesz: How many bytes are in the core file
- p_memsz: How many bytes this segment represented in memory
- p_flags: Permissions (read, write, execute)
If p_filesz is 0 but p_memsz is non-zero, the segment was all zeros (like uninitialized BSS) and wasn’t stored to save space.
Inspecting ELF Structure with readelf
# View the ELF header
$ readelf -h core
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 ...
Class: ELF64
Type: CORE (Core file)
...
# View program headers
$ readelf -l core
Program Headers:
Type Offset VirtAddr ...
NOTE 0x0000000000000350 0x0000000000000000 ...
LOAD 0x0000000000001000 0x0000555555554000 ...
LOAD 0x0000000000002000 0x00007ffff7dd5000 ...
# View notes
$ readelf -n core
Displaying notes found in: core
Owner Data size Description
CORE 0x00000150 NT_PRSTATUS (prstatus structure)
CORE 0x00000088 NT_PRPSINFO (prpsinfo structure)
CORE 0x00000130 NT_AUXV (auxiliary vector)
CORE 0x00000550 NT_FILE (mapped files)
The NT_FILE Note: Understanding Memory Mappings
The NT_FILE note is invaluable—it tells you which shared libraries and files were mapped into the process. This helps you understand:
- Which version of libc was running
- What shared libraries were loaded
- Where the executable was mapped
# Example NT_FILE output:
Page size: 4096
Start End Page Offset File
0x0000555555554000 0x0000555555555000 0x00000000 /path/to/program
0x00007ffff7dd5000 0x00007ffff7f6a000 0x00000000 /lib/x86_64-linux-gnu/libc.so.6
How This Fits in Projects
- Project 6: You’ll work with stripped binaries where understanding ELF structure helps locate functions
- Project 7: You’ll parse the ELF/minidump structure programmatically
- All projects: Understanding where data lives in the core file helps you navigate GDB output
Mental Model Diagram
ELF CORE DUMP ANATOMY
┌────────────────────────────────────────────────────────────────────────┐
│ ELF HEADER (64 bytes) │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Magic: 7F 45 4C 46 Type: CORE Machine: x86_64 Entry: 0x0 │ │
│ │ Program Header Offset: 64 Section Header Offset: 0 (none) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│ PROGRAM HEADERS │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [0] PT_NOTE offset=0x350 vaddr=0x0 filesz=0x8a8 │ │
│ │ [1] PT_LOAD offset=0x1000 vaddr=0x555... filesz=0x1000 │ │
│ │ [2] PT_LOAD offset=0x2000 vaddr=0x7ff... filesz=0x195000 │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│ NOTE SEGMENT DATA │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ NT_PRSTATUS: signal=11, pid=1234, regs={rip=0x555..., rsp=...} │ │
│ │ NT_PRPSINFO: state='R', fname="program", args="./program" │ │
│ │ NT_AUXV: AT_PHDR=..., AT_ENTRY=..., AT_BASE=... │ │
│ │ NT_FILE: [0x555...–0x556...] /path/to/program │ │
│ │ [0x7ff...–0x7ff...] /lib/.../libc.so.6 │ │
│ │ NT_FPREGSET: floating point registers │ │
│ │ (For multi-threaded: NT_PRSTATUS for each thread) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│ LOAD SEGMENT DATA │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [Segment 1: Code] .text section from executable │ │
│ │ [Segment 2: Data] .data, .bss, heap │ │
│ │ [Segment 3: Stack] Local variables, return addresses │ │
│ │ [Segment 4: libc] Memory-mapped shared library │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
NOTE: No section headers! Core dumps use program headers only.
Minimal Concrete Example
# Generate a core dump
$ ulimit -c unlimited
$ echo "core" | sudo tee /proc/sys/kernel/core_pattern
$ ./crash
Segmentation fault (core dumped)
# Inspect the ELF structure
$ file core
core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style
$ readelf -h core | head -10
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
Type: CORE (Core file)
$ readelf -n core | grep -A2 "NT_PRSTATUS"
CORE 0x00000150 NT_PRSTATUS (prstatus structure)
# Contains signal 11 (SIGSEGV) and register values
Common Misconceptions
-
“Core dumps have section headers like executables” - No, core dumps typically have 0 section headers. They use program headers (PT_NOTE, PT_LOAD) exclusively.
-
“The entire virtual address space is saved” - No, only mapped pages with actual content are saved. Zero pages and unmapped regions are omitted.
-
“You can run the core dump” - No, it’s not an executable. It’s a memory snapshot that requires the original executable and GDB to interpret.
-
“All notes are in one PT_NOTE segment” - Usually yes, but the notes within that segment are individual records you must parse sequentially.
Check-Your-Understanding Questions
- What is the ELF type field value for core dumps?
- What information is stored in the NT_PRSTATUS note?
- How can you determine which shared libraries were loaded when a process crashed?
- Why do core dumps typically have 0 section headers?
- What does it mean when a PT_LOAD segment has
p_filesz=0butp_memsz > 0?
Check-Your-Understanding Answers
-
ET_CORE(value 4). This distinguishes core dumps from executables (ET_EXEC), shared objects (ET_DYN), and relocatable files (ET_REL). -
NT_PRSTATUS contains: the signal that killed the process, PID, PPID, all general-purpose registers (including the instruction pointer RIP and stack pointer RSP), and signal masks.
-
The NT_FILE note in the PT_NOTE segment lists all memory-mapped files with their virtual address ranges. Use
readelf -n coreto view it. -
Section headers are used by linkers and debuggers for executables, but core dumps only need to represent memory layout. Program headers (PT_LOAD) are sufficient for this, and omitting section headers saves space.
-
This means the memory region was all zeros (like uninitialized BSS). The kernel doesn’t store zero pages in the core file to save space—GDB knows to treat this region as zeros.
Real-World Applications
- Crash reporting tools - Parse ELF structure to extract register values and stack data
- Forensic analysis - Understand exactly what memory a process had access to
- Custom debugging tools - Build specialized analysis tools by parsing core dumps directly
- Minidump generation - Convert full core dumps to smaller formats for upload
Where You’ll Apply It
- Project 6: Understanding stripped binary structure
- Project 7: Parsing minidumps (similar structure)
- All projects: Knowing where GDB gets its information
References
- Anatomy of an ELF core file
- LIEF ELF Coredump Tutorial
- “Practical Binary Analysis” by Dennis Andriesse - Ch. 2-3
- “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Ch. 7
Key Insights
A core dump is just another ELF file—but instead of code and data for execution, it contains memory snapshots (PT_LOAD) and metadata (PT_NOTE) for post-mortem analysis.
Summary
Core dumps use ELF format with PT_NOTE segments for metadata (registers, signal, file mappings) and PT_LOAD segments for memory contents. The NT_PRSTATUS note contains registers and signal info. The NT_FILE note lists memory-mapped files. Use readelf -h, readelf -l, and readelf -n to inspect structure.
Homework/Exercises
-
Exercise 1: Use
readelf -lon a core dump and count the PT_LOAD segments. Correlate them with the output of/proc/[pid]/mapsfrom a running process of the same program. -
Exercise 2: Use
readelf -nto extract the signal number from NT_PRSTATUS. Verify it matches the signal you expected. -
Exercise 3: Write a Python script using the
structmodule to parse the ELF header and count program headers in a core dump. -
Exercise 4: Compare the ELF structure of a core dump vs. the original executable using
readelf -hon both.
Solutions to Homework/Exercises
Exercise 1 Solution:
# First, run a program and check its maps
$ ./myprogram &
$ cat /proc/$!/maps
# Note the memory regions
# Then generate a core dump (Ctrl+\ or kill -QUIT)
$ readelf -l core | grep LOAD
# Each PT_LOAD should correspond to a mapped region
Exercise 2 Solution:
$ readelf -n core | grep -A5 "NT_PRSTATUS"
# The signal is stored in the prstatus structure
# Look for "si_signo" or use a hex dump to find signal at known offset
# For SIGSEGV (11):
$ xxd core | grep -A2 "0x350" # Approximate offset of prstatus
Exercise 3 Solution (outline):
import struct
with open('core', 'rb') as f:
# ELF header is 64 bytes for ELF64
header = f.read(64)
# Parse key fields
magic = header[0:4] # Should be b'\x7fELF'
phoff = struct.unpack('<Q', header[32:40])[0] # Program header offset
phnum = struct.unpack('<H', header[56:58])[0] # Number of program headers
print(f"Program headers: {phnum} at offset {phoff}")
Exercise 4 Solution:
$ readelf -h ./myprogram | grep Type
Type: EXEC (Executable file)
$ readelf -h core | grep Type
Type: CORE (Core file)
# Key differences: Type field, entry point (0 for core), section headers
Chapter 3: GDB for Post-Mortem Debugging
Fundamentals
GDB (GNU Debugger) is the primary tool for analyzing core dumps on Linux. While GDB is commonly used for live debugging (setting breakpoints, stepping through code), post-mortem debugging—loading a core dump after a crash—is fundamentally different: you’re examining a frozen moment in time, not a running process.
In post-mortem mode, you cannot step forward, set breakpoints, or continue execution. Instead, you can inspect the state at crash time: the call stack (backtrace), variable values, memory contents, and register values. The key insight is that a core dump + the original executable + debug symbols together reconstruct the complete picture of what went wrong.
The basic workflow is: gdb <executable> <core-file>. GDB loads the executable to get symbol information and the core file to get the crash state.
Deep Dive
Loading a Core Dump
The fundamental GDB command for core dump analysis is straightforward:
$ gdb ./program core
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Core was generated by `./program'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000555555555149 in main () at crash.c:5
5 *ptr = 42;
(gdb)
GDB immediately shows you:
- Which program generated the core
- Which signal terminated it
- The function and line where it stopped (if symbols are available)
The Backtrace: Your Map of the Crash
The backtrace (or bt) command shows the call stack at crash time:
(gdb) bt
#0 0x0000555555555149 in vulnerable_function (input=0x7fffffffe010 "AAAA") at crash.c:5
#1 0x0000555555555178 in process_data (data=0x7fffffffe010 "AAAA") at crash.c:12
#2 0x00005555555551a2 in main (argc=2, argv=0x7fffffffe108) at crash.c:18
Each frame represents a function call. Frame #0 is where the crash occurred. Higher numbers are callers going back to main (and beyond to _start if you look far enough).
Navigating Stack Frames
You can switch between frames to examine different contexts:
(gdb) frame 1 # Switch to process_data()
#1 0x0000555555555178 in process_data (data=0x7fffffffe010 "AAAA") at crash.c:12
12 vulnerable_function(data);
(gdb) info args # Show function arguments
data = 0x7fffffffe010 "AAAA"
(gdb) info locals # Show local variables
local_buffer = "..."
(gdb) up # Move up one frame (to caller)
(gdb) down # Move down one frame (to callee)
Inspecting Variables and Memory
The print command examines variable values:
(gdb) print ptr
$1 = (int *) 0x0 # NULL pointer - found the bug!
(gdb) print *ptr # Try to dereference
Cannot access memory at address 0x0
(gdb) print my_struct
$2 = {name = "test", value = 42, next = 0x555555558040}
(gdb) print my_struct.name
$3 = "test"
The x (examine) command inspects raw memory:
(gdb) x/16xw $rsp # 16 words in hex, starting at stack pointer
0x7fffffffe000: 0x41414141 0x41414141 0x41414141 0x41414141
(gdb) x/s 0x555555556000 # Examine as string
0x555555556000: "Hello, World!"
(gdb) x/10i $rip # Examine 10 instructions at instruction pointer
0x555555555149 <main+20>: mov DWORD PTR [rax],0x2a
0x55555555514f <main+26>: mov eax,0x0
Format specifiers for x:
x- hexadecimald- decimals- stringi- instruction (disassembly)c- characterb/h/w/g- byte/halfword(2)/word(4)/giant(8) sizes
Register Inspection
Registers often reveal the immediate cause of a crash:
(gdb) info registers
rax 0x0 0 # Often holds bad address
rbx 0x0 0
rcx 0x7ffff7f9a6a0 140737353705120
rdx 0x7ffff7f9c4e0 140737353712864
rsi 0x0 0
rdi 0x7fffffffe010 140737488347152
rbp 0x7fffffffe030 0x7fffffffe030
rsp 0x7fffffffe000 0x7fffffffe000 # Stack pointer
rip 0x555555555149 0x555555555149 <main+20> # Instruction pointer
(gdb) print $rip # Access registers with $ prefix
$1 = (void (*)()) 0x555555555149 <main+20>
The Critical Importance of Debug Symbols
Without debug symbols (compiled with -g), you lose:
- Function names (replaced by addresses)
- Variable names and types
- Source file and line information
# WITH symbols:
#0 0x0000555555555149 in main () at crash.c:5
(gdb) print ptr
$1 = (int *) 0x0
# WITHOUT symbols:
#0 0x0000555555555149 in ?? ()
(gdb) print ptr
No symbol "ptr" in current context.
Essential GDB Commands for Core Analysis
| Command | Shortcut | Description |
|---|---|---|
backtrace |
bt |
Show call stack |
backtrace full |
bt full |
Show stack with local variables |
frame N |
f N |
Switch to frame N |
up / down |
Move up/down the stack | |
info registers |
i r |
Show CPU registers |
info args |
Show function arguments | |
info locals |
Show local variables | |
print EXPR |
p EXPR |
Evaluate and print expression |
x/FMT ADDR |
Examine memory | |
list |
l |
Show source code at current location |
disassemble |
disas |
Show assembly code |
info threads |
i threads |
List all threads |
thread N |
t N |
Switch to thread N |
Using GDB’s Python API (Preview)
GDB can be scripted with Python for automation:
# Save as analyze.py, run with: gdb -x analyze.py ./program core
import gdb
gdb.execute("set pagination off")
print("=== Crash Analysis ===")
print(gdb.execute("bt", to_string=True))
print("=== Registers ===")
rip = gdb.parse_and_eval("$rip")
print(f"RIP: {rip}")
This is covered in depth in Project 4.
How This Fits in Projects
- Project 2: Master the basic GDB workflow with backtraces
- Project 3: Use memory inspection to diagnose corruption
- Project 4: Automate GDB with Python scripting
- Project 5: Apply GDB to multi-threaded crashes
- Project 6: Use GDB with stripped binaries
Mental Model Diagram
GDB CORE DUMP WORKFLOW
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ INPUT GDB OUTPUT │
│ ┌─────────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Executable │ │ │ │ Crash Location │ │
│ │ (with -g) │──────►│ Symbol │──────────►│ Function names │ │
│ │ │ │ Matching │ │ Line numbers │ │
│ └─────────────────┘ │ │ │ Variable types │ │
│ │ │ └──────────────────┘ │
│ ┌─────────────────┐ │ │ │
│ │ Core Dump │──────►│ State │ ┌──────────────────┐ │
│ │ (memory + │ │ Extraction │──────────►│ Backtrace │ │
│ │ registers) │ │ │ │ Variable values │ │
│ └─────────────────┘ │ │ │ Memory contents │ │
│ │ │ │ Register state │ │
│ └──────────────┘ └──────────────────┘ │
│ │
│ KEY COMMANDS: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ bt → Call stack (where did it crash?) │ │
│ │ frame N → Switch context (examine caller) │ │
│ │ info args → Function parameters (what was passed?) │ │
│ │ info locals → Local variables (what state?) │ │
│ │ print VAR → Variable value (specific data) │ │
│ │ x/FMT ADDR → Raw memory (bit-level view) │ │
│ │ info reg → CPU registers (hardware state) │ │
│ │ list → Source code (if available) │ │
│ │ disas → Assembly (always available) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Minimal Concrete Example
# Compile with debug symbols
$ gcc -g -o crash crash.c
# Generate core dump
$ ulimit -c unlimited
$ ./crash
Segmentation fault (core dumped)
# Analyze with GDB
$ gdb ./crash core
(gdb) bt
#0 0x0000555555555149 in main () at crash.c:5
(gdb) list
1 #include <stdio.h>
2
3 int main(void) {
4 int *ptr = NULL;
5 *ptr = 42; # <-- Crash here
6 return 0;
7 }
(gdb) print ptr
$1 = (int *) 0x0
(gdb) info registers rax rip
rax 0x0 0
rip 0x555555555149 0x555555555149 <main+20>
Common Misconceptions
-
“I can step through code in a core dump” - No, the program isn’t running. You can only examine the frozen state. Commands like
next,step,continuedon’t work. -
“I need the exact same binary” - Not exact, but the symbols must match. If you have a debug version of the same source, it will work. Different builds may have different addresses.
-
“GDB shows me the bug” - GDB shows you the symptom (where it crashed). Finding the cause (often earlier in the program) requires detective work.
-
“Missing symbols means I can’t debug” - You can still see addresses and disassembly. It’s harder, but not impossible (covered in Project 6).
Check-Your-Understanding Questions
- What are the two files you need to load a core dump in GDB?
- What does frame #0 represent in a backtrace?
- What command shows local variables in the current stack frame?
- How do you examine 8 bytes of memory at address 0x7fff0000 as hexadecimal?
- Why might
print ptrfail with “No symbol” even though the core dump is valid?
Check-Your-Understanding Answers
-
The executable (preferably compiled with
-gfor symbols) and the core dump file. Command:gdb ./program core -
Frame #0 is the innermost frame—the function that was executing when the crash occurred. It’s where the instruction pointer (RIP) was pointing.
-
info localsshows local variables.info argsshows function arguments. -
x/8xb 0x7fff0000(8 bytes in hex) orx/2xg 0x7fff0000(2 giant/8-byte words in hex) -
The executable was compiled without debug symbols (
-gflag). The core dump is valid and contains the value, but GDB can’t map addresses to variable names without symbols.
Real-World Applications
- Production crash triage - Quickly identify crash location in server applications
- Bug reports - Extract relevant information to share with developers
- Regression testing - Analyze crashes from automated test runs
- Security research - Examine crash state for vulnerability analysis
Where You’ll Apply It
- Project 2: Basic backtrace extraction
- Project 3: Memory inspection for corruption analysis
- Project 4: Scripting GDB for automation
- Project 5: Multi-threaded analysis with
info threads - Project 6: Working without symbols
References
- GDB Manual - Core File Generation
- “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman
- Stanford CS107 GDB Guide
- Brendan Gregg’s GDB Tutorial
Key Insights
GDB + core dump = a time machine to the moment of death. You can’t change the past, but you can thoroughly examine what went wrong.
Summary
GDB loads core dumps with gdb <executable> <core>. Essential commands: bt (backtrace), frame N (switch context), print (variables), x/FMT ADDR (raw memory), info registers (CPU state). Debug symbols (-g) are crucial for meaningful output. Post-mortem debugging examines frozen state—you cannot step or continue.
Homework/Exercises
-
Exercise 1: Create a program with three nested function calls (main → func1 → func2) where func2 crashes. Use GDB to examine each frame’s local variables.
-
Exercise 2: Practice memory examination: write a program that stores the string “DEADBEEF” in a buffer, then examine it with
x/8xb,x/2xw, andx/s. -
Exercise 3: Compare the output of
btwith and without debug symbols on the same crash. Document the differences. -
Exercise 4: Write down 10 GDB commands and their purposes from memory. Then verify against the documentation.
Solutions to Homework/Exercises
Exercise 1 Solution:
// nested.c
void func2(int *p) { *p = 42; }
void func1(int *p) { func2(p); }
int main() { int *p = 0; func1(p); return 0; }
(gdb) bt
#0 func2 (p=0x0) at nested.c:1
#1 func1 (p=0x0) at nested.c:2
#2 main () at nested.c:3
(gdb) frame 1
(gdb) info args
p = 0x0
Exercise 2 Solution:
int main() { char buf[16] = "DEADBEEF"; volatile int x = 1/0; return 0; }
(gdb) x/8xb buf
0x...: 0x44 0x45 0x41 0x44 0x42 0x45 0x45 0x46 # D E A D B E E F
(gdb) x/2xw buf
0x...: 0x44414544 0x46454542 # Note: little-endian
(gdb) x/s buf
0x...: "DEADBEEF"
Exercise 3 Solution:
With -g: Shows function names, file:line, variable names
Without -g: Shows ?? (), no file/line, “No symbol in context” for prints
Exercise 4 Solution: bt, frame, up, down, print, x, info registers, info locals, info args, list, disassemble, quit
Chapter 4: Multi-Threaded Crash Analysis
Fundamentals
Modern applications are often multi-threaded, which adds complexity to crash analysis. When a multi-threaded program crashes, the core dump captures the state of all threads, not just the one that triggered the crash. Understanding how to navigate between threads and correlate their states is essential for diagnosing concurrency bugs.
The challenge with multi-threaded crashes is that the crashing thread often reveals the symptom (e.g., a NULL pointer dereference), but the cause may be another thread that corrupted shared data or failed to properly synchronize. You must examine all threads to understand the full picture.
GDB provides commands like info threads, thread N, and thread apply all to navigate and inspect multiple threads in a core dump.
Deep Dive
How Threads Appear in Core Dumps
Each thread in a process has its own:
- Stack (with its own local variables and call chain)
- Registers (including its own instruction pointer)
- Thread ID (TID/LWP - Light Weight Process)
In the ELF core dump, each thread gets its own NT_PRSTATUS note containing that thread’s registers. GDB parses these to reconstruct the state of each thread.
Listing All Threads
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7fb4740 (LWP 12345) main () at main.c:30
2 Thread 0x7ffff7fb3700 (LWP 12346) worker_func () at worker.c:15
3 Thread 0x7ffff7fb2700 (LWP 12347) writer_func () at writer.c:22
The * indicates the “current” thread—the one GDB is focused on. In a crash, this is typically the thread that received the fatal signal.
Switching Between Threads
(gdb) thread 2 # Switch to thread 2
[Switching to thread 2 (Thread 0x7ffff7fb3700 (LWP 12346))]
#0 worker_func () at worker.c:15
(gdb) bt # Now shows thread 2's stack
#0 worker_func () at worker.c:15
#1 thread_entry () at main.c:20
...
Getting All Backtraces at Once
The most powerful command for multi-threaded crash analysis:
(gdb) thread apply all bt
Thread 3 (Thread 0x7ffff7fb2700 (LWP 12347)):
#0 writer_func () at writer.c:22
#1 thread_entry () at main.c:25
...
Thread 2 (Thread 0x7ffff7fb3700 (LWP 12346)):
#0 worker_func () at worker.c:15
#1 thread_entry () at main.c:20
...
Thread 1 (Thread 0x7ffff7fb4740 (LWP 12345)):
#0 0x0000555555555149 in main () at main.c:30
Common Multi-Threaded Bug Patterns
-
Data Race - Two threads access shared data without synchronization, one writes, one reads. The reader may see corrupted or inconsistent data.
-
Use-After-Free Race - Thread A frees memory, Thread B still has a pointer and uses it.
-
Double-Free Race - Two threads each try to free the same memory.
-
Deadlock-Induced Timeout - While not a crash per se, if a program is killed due to a deadlock, the core dump shows threads waiting on locks.
Detecting a Data Race from a Core Dump
Consider this scenario:
// Shared global
char *g_data = NULL;
// Thread 1: Writer
void writer_thread() {
g_data = malloc(100);
strcpy(g_data, "Hello");
}
// Thread 2: Reader (crashes!)
void reader_thread() {
printf("%s\n", g_data); // CRASH if g_data is still NULL
}
In the core dump:
(gdb) thread apply all bt
Thread 2:
#0 reader_thread () at race.c:12
...
Thread 1:
#0 writer_thread () at race.c:6
...
(gdb) thread 2
(gdb) print g_data
$1 = (char *) 0x0 # Still NULL when reader tried to use it!
(gdb) thread 1
(gdb) info locals
# Maybe see that malloc was about to be called
The race condition is revealed: Thread 2 accessed g_data before Thread 1 finished initializing it.
Examining Locks and Synchronization State
For programs using pthreads, you can examine mutex states:
(gdb) print my_mutex
$1 = {__data = {__lock = 1, __count = 0, __owner = 12346, ...}}
# ^^^^^^^^^^^
# This thread (LWP 12346) holds the lock
If a thread is waiting on a lock, its backtrace often shows pthread_mutex_lock or similar.
How This Fits in Projects
- Project 5: Create and analyze a multi-threaded race condition crash
- Project 10: Handle multi-threaded crashes in your crash reporter
Mental Model Diagram
MULTI-THREADED CRASH ANALYSIS
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ SINGLE PROCESS │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Thread 1 (Main) Thread 2 (Worker) Thread 3 (Writer) │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Stack │ │ Stack │ │ Stack │ │ │
│ │ │ Registers │ │ Registers │ │ Registers │ │ │
│ │ │ TID: 12345 │ │ TID: 12346 │ │ TID: 12347 │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │
│ │ │ │ │ │ │
│ │ └──────────────────────┼─────────────────────┘ │ │
│ │ │ │ │
│ │ ┌───────────▼───────────┐ │ │
│ │ │ SHARED MEMORY │ │ │
│ │ │ - Global variables │ │ │
│ │ │ - Heap │ │ │
│ │ │ - Mutexes │ │ │
│ │ └───────────────────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
│ CRASH IN THREAD 1: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Core dump captures ALL threads' states │ │
│ │ - Each thread has its own NT_PRSTATUS in the core │ │
│ │ - The symptom is in Thread 1, but the cause may be Thread 2 or 3 │ │
│ │ - Use `thread apply all bt` to see all backtraces │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ GDB COMMANDS: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ info threads → List all threads │ │
│ │ thread N → Switch to thread N │ │
│ │ thread apply all bt → Backtrace ALL threads │ │
│ │ thread apply all info locals → All threads' local vars │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Minimal Concrete Example
// race_crash.c
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
char *g_data = NULL;
void *writer_thread(void *arg) {
sleep(1); // Simulate slow initialization
g_data = malloc(100);
return NULL;
}
void *reader_thread(void *arg) {
// Race: may execute before writer finishes
printf("Data: %s\n", g_data); // CRASH if g_data is NULL
return NULL;
}
int main() {
pthread_t writer, reader;
pthread_create(&writer, NULL, writer_thread, NULL);
pthread_create(&reader, NULL, reader_thread, NULL);
pthread_join(writer, NULL);
pthread_join(reader, NULL);
return 0;
}
$ gdb ./race_crash core
(gdb) info threads
Id Target Id Frame
* 2 Thread ... (LWP ...) reader_thread () at race_crash.c:15
3 Thread ... (LWP ...) writer_thread () at race_crash.c:10
1 Thread ... (LWP ...) pthread_join () ...
(gdb) thread 2
(gdb) print g_data
$1 = (char *) 0x0 # NULL - writer hasn't finished!
Common Misconceptions
-
“The crashing thread is always the buggy one” - Often the crash is a symptom. Another thread corrupted data that caused this thread to crash.
-
“Core dumps only capture the crashing thread” - No, all threads are captured. Each has its own registers and stack in the dump.
-
“I can see lock contention history” - No, you only see the current state. You can’t see what happened before the crash.
-
“Thread IDs are stable across runs” - No, TIDs (LWP numbers) are assigned by the kernel and vary between runs.
Check-Your-Understanding Questions
- How do you list all threads in a GDB core dump session?
- What does the
*mean ininfo threadsoutput? - How do you get a backtrace for every thread at once?
- Where in the core dump are individual thread states stored?
- If Thread 1 crashes due to a NULL pointer, how might Thread 2 be responsible?
Check-Your-Understanding Answers
-
Use
info threadsto list all threads with their IDs, target IDs, and current frames. -
The
*indicates the currently selected/focused thread—usually the one that received the fatal signal. -
Use
thread apply all btto get backtraces for all threads in one command. -
Each thread has its own NT_PRSTATUS note in the PT_NOTE segment of the core dump, containing that thread’s registers.
-
Thread 2 might have been responsible for initializing the pointer but hadn’t done so yet (race condition), or Thread 2 might have freed or corrupted the memory Thread 1 was using.
Key Insights
In multi-threaded crashes, the crashing thread shows the symptom, but the cause may be in any thread. Always examine all threads.
Summary
Multi-threaded core dumps capture all threads’ states. Use info threads to list them, thread N to switch, and thread apply all bt for all backtraces. Data races, use-after-free, and synchronization issues require examining shared state across threads. The crashing thread often isn’t the root cause.
Chapter 5: Kernel Crash Analysis (kdump and crash)
Fundamentals
When the Linux kernel itself crashes (a “kernel panic”), the normal core dump mechanism can’t work—the kernel is the core dump generator. Instead, Linux uses kdump, which boots a secondary “capture kernel” to save the memory of the panicked kernel. The resulting file is called a vmcore.
Analyzing a vmcore requires the crash utility, which is essentially “GDB for the kernel.” It understands kernel data structures and can navigate process lists, examine kernel stacks, and inspect driver state—all from a frozen snapshot of the entire system.
This is advanced material, but understanding the basics gives you powerful debugging capabilities for system-level issues.
Deep Dive
How kdump Works
When the kernel panics, it can’t simply write a file—the file system might be corrupted, and the kernel’s own code might be broken. Instead, kdump uses kexec to immediately boot into a small, pre-loaded “capture kernel” that runs from reserved memory:
Normal Operation:
┌─────────────────────────────────────────────────────────────────┐
│ RUNNING KERNEL │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Normal kernel uses most of memory │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ┌──────────────┐ │
│ │ Reserved │ ← Capture kernel loaded here (crashkernel=) │
│ │ Memory │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
After Panic:
┌─────────────────────────────────────────────────────────────────┐
│ CAPTURE KERNEL RUNNING │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Panicked kernel's memory preserved, accessible via │ │
│ │ /proc/vmcore in capture kernel │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ┌──────────────┐ │
│ │ Capture │ ← This small kernel is now running │
│ │ Kernel │ ← Saves /proc/vmcore to disk │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Configuring kdump
- Reserve memory - Add
crashkernel=256M(or more) to kernel command line in GRUB - Install packages -
kexec-toolsandcrash - Enable the service -
systemctl enable kdump - Get debug symbols - Install
kernel-debuginfoor equivalent
# Check kdump status
$ systemctl status kdump
● kdump.service - Crash recovery kernel arming
Active: active (exited)
# Check crashkernel reservation
$ cat /proc/cmdline | grep crashkernel
crashkernel=256M
# List crash dumps (after a panic)
$ ls /var/crash/
127.0.0.1-2024-12-20-15:30:00/
The crash Utility
The crash utility is an interactive tool for analyzing vmcore files:
# Basic invocation
$ crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/.../vmcore
crash> help # List available commands
crash> bt # Backtrace of panicking task
crash> log # Kernel log buffer (dmesg)
crash> ps # Process list at crash time
crash> files <pid> # Open files for a process
crash> vm <pid> # Virtual memory info
crash> struct <name> # Examine kernel structure
crash> quit # Exit
Essential crash Commands
| Command | Description |
|---|---|
bt |
Backtrace of current (panicking) task |
bt -a |
Backtrace of all CPUs |
log |
Kernel log buffer (like dmesg) |
ps |
List all processes |
files <pid> |
Open files for a process |
vm <pid> |
Virtual memory for a process |
struct <name> <addr> |
Display kernel structure |
mod |
List loaded modules |
kmem -i |
Memory usage summary |
foreach bt |
Backtrace every process |
Analyzing a Kernel Panic
When you trigger a panic (e.g., via a buggy kernel module), the output in crash looks like:
crash> bt
PID: 1234 TASK: ffff88810a4d8000 CPU: 1 COMMAND: "insmod"
#0 [ffffc90000a77e30] machine_kexec at ffffffff8100259b
#1 [ffffc90000a77e80] __crash_kexec at ffffffff8110d9ab
#2 [ffffc90000a77f00] panic at ffffffff8106a3e8
#3 [ffffc90000a77f30] oops_end at ffffffff81c01b9a
#4 [ffffc90000a77f80] no_context at ffffffff8104d2ab
#5 [ffffc90000a77ff0] do_page_fault at ffffffff81c0605e
#6 [ffffc90000a77ff8] page_fault at ffffffff82000b9e
#7 [ffffc90000a78050] buggy_init at ffffffffc0670010 [buggy_module]
^^^^^^^^^^^^^^^ YOUR BUGGY CODE
crash> log | tail -20
[ 123.456] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 123.457] #PF: supervisor write access in kernel mode
...
[ 123.465] Kernel panic - not syncing: Fatal exception
How This Fits in Projects
- Project 8: Configure kdump and trigger a kernel panic with a buggy module
- Project 9: Use crash to analyze the resulting vmcore
Mental Model Diagram
KERNEL CRASH ANALYSIS PIPELINE
┌─────────────────────────────────────────────────────────────────────────────┐
│ │
│ 1. PANIC OCCURS 2. KEXEC TRIGGERS 3. CAPTURE & SAVE │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ BUG: NULL ptr │ │ kexec boots │ │ /proc/vmcore │ │
│ │ dereference │ ───► │ capture kernel │ ───► │ saved to │ │
│ │ in kernel code │ │ from reserved │ │ /var/crash/ │ │
│ │ │ │ memory │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │
│ ▼ │
│ 4. ANALYSIS 5. INVESTIGATION │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ $ crash vmlinux vmcore │ │
│ │ │ │
│ │ crash> bt # See panic stack trace │ │
│ │ crash> log # See kernel messages │ │
│ │ crash> ps # See running processes │ │
│ │ crash> mod # See loaded modules │ │
│ │ │ │
│ │ Result: Identify buggy code path in kernel/module │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ REQUIREMENTS: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ - crashkernel= parameter in boot command line │ │
│ │ - kdump service enabled and running │ │
│ │ - crash utility installed │ │
│ │ - Kernel debug symbols (vmlinux with debuginfo) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Minimal Concrete Example
// buggy_module.c - A kernel module that causes a panic
#include <linux/module.h>
#include <linux/kernel.h>
static int __init buggy_init(void) {
int *ptr = NULL;
printk(KERN_INFO "About to crash...\n");
*ptr = 42; // PANIC: NULL pointer dereference in kernel mode
return 0;
}
static void __exit buggy_exit(void) {
printk(KERN_INFO "Goodbye\n");
}
module_init(buggy_init);
module_exit(buggy_exit);
MODULE_LICENSE("GPL");
# Build the module
$ make -C /lib/modules/$(uname -r)/build M=$(pwd) modules
# Load it (IN A VM!)
$ sudo insmod buggy_module.ko
# System panics, kdump captures vmcore
# After reboot, analyze
$ sudo crash /usr/lib/debug/.../vmlinux /var/crash/.../vmcore
crash> bt
crash> log
Key Insights
Kernel crashes require a completely different capture mechanism (kdump) because the kernel can’t debug itself. The crash utility is GDB for the entire operating system.
Summary
Kernel panics are captured by kdump, which boots a capture kernel to save memory as a vmcore. The crash utility analyzes vmcore files using commands like bt, log, and ps. This requires crashkernel reservation, kdump service, and kernel debug symbols.
Chapter 6: Automation and Scripting
Fundamentals
Manual crash analysis doesn’t scale. Production systems may generate dozens of crashes daily, and each needs initial triage to determine severity and potential root cause. GDB’s Python API and batch mode enable automation, letting you build scripts that extract key information without human intervention.
Automation serves two purposes: efficiency (processing many dumps quickly) and consistency (every dump gets the same analysis, reducing human error). This chapter covers the techniques used in Project 4 (automated triage) and Project 10 (centralized crash reporter).
Deep Dive
GDB Batch Mode
The simplest automation is GDB’s batch mode, which runs commands from a file:
# commands.gdb
set pagination off
bt
info registers
quit
$ gdb -q --batch -x commands.gdb ./program core
#0 main () at crash.c:5
...
The -q (quiet) flag suppresses the welcome message. --batch exits after running commands.
GDB Python API
For more sophisticated automation, GDB embeds a Python interpreter. Your Python scripts run inside GDB and can access its internals:
# analyze.py - Run with: gdb -q --batch -x analyze.py ./program core
import gdb
# Disable paging
gdb.execute("set pagination off")
# Get backtrace as string
bt = gdb.execute("bt", to_string=True)
print("=== BACKTRACE ===")
print(bt)
# Access registers programmatically
rip = gdb.parse_and_eval("$rip")
rsp = gdb.parse_and_eval("$rsp")
print(f"RIP: {rip}")
print(f"RSP: {rsp}")
# Get the signal that killed the process
try:
# This works if the core was from a signal
info = gdb.execute("info signals", to_string=True)
print("=== SIGNALS ===")
print(info[:500]) # First 500 chars
except:
pass
# Examine a specific address
try:
mem = gdb.execute("x/16xb $rsp", to_string=True)
print("=== STACK TOP ===")
print(mem)
except:
pass
Key GDB Python Functions
| Function | Description |
|---|---|
gdb.execute(cmd) |
Run a GDB command |
gdb.execute(cmd, to_string=True) |
Run command, capture output as string |
gdb.parse_and_eval(expr) |
Evaluate expression, return GDB Value |
gdb.selected_frame() |
Get current stack frame |
gdb.selected_thread() |
Get current thread |
gdb.inferiors() |
List of debugged programs |
Crash Fingerprinting
To deduplicate crashes (identify unique bugs vs. repeat occurrences), you need a stable “fingerprint.” A common approach:
- Extract the top 3-5 frames of the backtrace
- Normalize addresses to function names (or offset within function)
- Hash the result
import hashlib
def get_crash_fingerprint(bt_output):
"""Generate a fingerprint from a backtrace."""
lines = bt_output.strip().split('\n')
# Take first 5 frames
frames = []
for line in lines[:5]:
# Extract function name (simplified)
if ' in ' in line:
func = line.split(' in ')[1].split(' ')[0]
frames.append(func)
fingerprint = '|'.join(frames)
return hashlib.md5(fingerprint.encode()).hexdigest()[:16]
Building an Analysis Pipeline
A complete automation pipeline might look like:
┌──────────────────────────────────────────────────────────────────┐
│ CRASH ANALYSIS PIPELINE │
├──────────────────────────────────────────────────────────────────┤
│ 1. Core dump arrives (via core_pattern pipe or file watch) │
│ 2. Python script invokes GDB with analysis script │
│ 3. GDB script extracts: │
│ - Backtrace (all threads) │
│ - Registers │
│ - Signal info │
│ - Key variables (if symbols available) │
│ 4. Generate fingerprint from backtrace │
│ 5. Store results: │
│ - Raw core dump (compressed) │
│ - Analysis report (JSON/text) │
│ - Fingerprint for deduplication │
│ 6. Alert if new unique crash or high-severity │
└──────────────────────────────────────────────────────────────────┘
How This Fits in Projects
- Project 4: Build an automated crash triage tool
- Project 10: Create a complete crash reporting system
Key Insights
Automation transforms crash analysis from a manual, ad-hoc process into a systematic pipeline that can handle production scale.
Summary
GDB batch mode runs commands from files. GDB Python API provides programmatic access via gdb.execute() and gdb.parse_and_eval(). Crash fingerprinting uses backtrace frames to identify unique bugs. Automation enables production-scale crash analysis.
Glossary
| Term | Definition |
|---|---|
| Core Dump | A file containing a snapshot of a process’s memory and CPU state at termination |
| Backtrace | The call stack showing the sequence of function calls leading to the current point |
| SIGSEGV | Signal 11, Segmentation Fault—raised when a process accesses invalid memory |
| ELF | Executable and Linkable Format—the binary format for Linux executables and core dumps |
| PT_NOTE | ELF program header type containing metadata (registers, signal, file mappings) |
| PT_LOAD | ELF program header type containing actual memory contents |
| NT_PRSTATUS | Note type containing process status and registers at crash time |
| Debug Symbols | Metadata linking binary addresses to source code (file/line/variable names) |
| ulimit | Shell command to set resource limits, including core dump size (ulimit -c) |
| core_pattern | Kernel parameter (/proc/sys/kernel/core_pattern) controlling dump location |
| systemd-coredump | Modern Linux daemon that captures and manages core dumps |
| coredumpctl | Command-line tool to list and debug systemd-managed core dumps |
| GDB | GNU Debugger—the primary tool for analyzing core dumps on Linux |
| Post-Mortem Debugging | Analyzing a crashed program from its core dump after the fact |
| Stack Frame | A section of the stack containing a function’s local variables and return address |
| kdump | Kernel crash dump mechanism using kexec to boot a capture kernel |
| vmcore | The memory dump file created by kdump after a kernel panic |
| crash | The utility for analyzing kernel crash dumps (vmcore files) |
| kexec | Mechanism to boot into a new kernel without going through firmware |
| Minidump | A compact crash dump format used by Breakpad/Crashpad |
| Symbolication | The process of resolving memory addresses to function/line names |
| Fingerprint | A hash identifying a unique crash type for deduplication |
| Data Race | A bug where two threads access shared data without synchronization |
| LWP | Light Weight Process—another term for a thread’s kernel identifier |
Why Linux Crash Dump Analysis Matters
Modern Relevance
Crash analysis is a fundamental skill for anyone working with systems software. Consider:
- Cloud infrastructure runs millions of processes. When one crashes, operators need fast root cause analysis
- IoT and embedded systems often can’t be debugged live; crash dumps are the only evidence
- Security researchers analyze crashes to find vulnerabilities
- SRE/DevOps teams need to correlate crashes with deployments and load patterns
According to Red Hat’s documentation, the crash utility and kernel debug symbols are essential tools for diagnosing kernel issues in enterprise Linux environments.
Real-World Statistics
- Large-scale services may see thousands of process crashes daily (Facebook’s analysis infrastructure processes millions of crash reports)
- Kernel panics, while rarer, can cause significant outages (each minute of downtime can cost enterprises $5,600 on average per Gartner estimates)
- The average time to diagnose a crash without proper tooling can be hours; with proper crash analysis, minutes
Context and Evolution
Core dumps have existed since the earliest Unix systems (1970s). The name “core” comes from magnetic core memory. While the underlying technology has evolved (from simple memory snapshots to ELF-formatted files with metadata), the concept remains: preserve the state for later analysis.
Modern developments include:
- systemd-coredump (2012+) for centralized management
- Breakpad/Crashpad (Google) for cross-platform minidumps
- Sentry, Bugsnag, Crashlytics for cloud-based crash aggregation
- eBPF for live system analysis (complementing post-mortem)
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Core Dump Fundamentals | Core dumps are ELF files capturing process memory + CPU state. Generation requires ulimit -c and core_pattern. Modern systems use systemd-coredump. |
| ELF Core Format | PT_NOTE segments hold metadata (registers in NT_PRSTATUS, files in NT_FILE). PT_LOAD segments hold memory. No section headers. |
| GDB Post-Mortem | Load with gdb <exe> <core>. Key commands: bt, frame, print, x, info registers. Debug symbols (-g) are essential. |
| Multi-Threaded Analysis | All threads captured in core. Use info threads, thread N, thread apply all bt. Crashing thread shows symptom, cause may be elsewhere. |
| Kernel Crash Analysis | kdump uses kexec to boot capture kernel. vmcore analyzed with crash utility. Requires crashkernel reservation and debug symbols. |
| Automation | GDB batch mode and Python API enable scripted analysis. Fingerprinting identifies unique crashes. Scales to production. |
Project-to-Concept Map
| Project | Concepts Applied |
|---|---|
| Project 1: The First Crash | Core Dump Fundamentals |
| Project 2: The GDB Backtrace | GDB Post-Mortem, Debug Symbols |
| Project 3: The Memory Inspector | GDB Post-Mortem, ELF Core Format |
| Project 4: Automated Crash Detective | Automation, GDB Python API |
| Project 5: Multi-threaded Mayhem | Multi-Threaded Analysis |
| Project 6: Stripped Binary Crash | ELF Core Format, GDB without symbols |
| Project 7: Minidump Parser | ELF Core Format (similar concepts) |
| Project 8: Kernel Panics | Kernel Crash Analysis, kdump |
| Project 9: Analyzing with crash | Kernel Crash Analysis, crash utility |
| Project 10: Centralized Reporter | Automation, Fingerprinting, All concepts |
Deep Dive Reading by Concept
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Core Dumps | “The Linux Programming Interface” Ch. 22 | Definitive coverage of signals and process termination |
| ELF Format | “Practical Binary Analysis” Ch. 2-3 | Deep dive into ELF structure |
| GDB Basics | “The Art of Debugging with GDB” Ch. 1-4 | Foundational GDB skills |
| Memory Layout | “Computer Systems: A Programmer’s Perspective” Ch. 7, 9 | Understanding virtual memory |
| Multi-threading | “The Linux Programming Interface” Ch. 29-30 | POSIX threads and synchronization |
| Kernel Internals | “Linux Kernel Development” by Robert Love | For kernel panic analysis |
| Kernel Modules | “Linux Device Drivers, 3rd Ed” Ch. 1-2 | Writing and debugging modules |
Quick Start: Your First 48 Hours
Day 1: Foundation (4-6 hours)
- Read Theory Primer Chapters 1-3 (2 hours)
- Core Dump Fundamentals
- ELF Core Format (overview)
- GDB Post-Mortem
- Complete Project 1 (2 hours)
- Configure your system for core dumps
- Verify dumps are generated
- Definition of Done:
file core*shows ELF core file
- Start Project 2 (1-2 hours)
- Load your first core dump in GDB
- Get a backtrace with line numbers
Day 2: Practical Skills (4-6 hours)
- Finish Project 2 (1 hour)
- Compare with/without debug symbols
- Definition of Done: Can identify crash location by file:line
- Start Project 3 (3-4 hours)
- Create a memory corruption scenario
- Practice
printandxcommands - Inspect variables across stack frames
- Review and Practice (1 hour)
- Re-do the GDB homework exercises
- Take notes on commands you find useful
Recommended Learning Paths
Path 1: The Developer (Focus on User-Space)
If you’re a software developer wanting to debug your own applications:
- Project 1 → Project 2 → Project 3 (Weeks 1-2)
- Project 4 (Automation) → Project 5 (Multi-threaded) (Weeks 3-4)
- Project 6 (Stripped binaries) if you work with releases (Week 5)
- Skip kernel projects unless needed
Path 2: The SRE/DevOps Engineer
If you’re managing infrastructure and need to triage crashes:
- Project 1 → Project 2 (quick start) (Week 1)
- Project 4 (Automation - your main tool) (Week 2)
- Project 8 → Project 9 (Kernel crashes) (Weeks 3-4)
- Project 10 (Build your own crash collection) (Weeks 5-6)
Path 3: The Security Researcher
If you’re analyzing crashes for vulnerabilities:
- Project 1 → Project 2 → Project 3 (Weeks 1-2)
- Project 6 (Stripped binaries - common in targets) (Week 3)
- Project 7 (Minidump parsing) (Week 4)
- Deep study of ELF format and memory corruption patterns
Success Metrics
You’ve mastered this material when you can:
- Configure any Linux system for core dump capture in under 5 minutes
- Get a backtrace from a core dump and identify the crash location
- Navigate stack frames and inspect variables/memory in GDB
- Write a Python script that automates basic crash triage
- Analyze a multi-threaded crash and identify cross-thread issues
- Work with stripped binaries using disassembly
- Configure kdump and trigger a test kernel panic (in a VM)
- Use the crash utility to analyze a vmcore
- Explain the ELF core format and what each section contains
- Design a crash reporting pipeline for a production system
Project Overview Table
| # | Project | Difficulty | Time | Key Skill |
|---|---|---|---|---|
| 1 | The First Crash | Beginner | 4-8h | System configuration |
| 2 | The GDB Backtrace | Beginner | 4-8h | Basic GDB workflow |
| 3 | The Memory Inspector | Intermediate | 10-15h | Memory examination |
| 4 | Automated Crash Detective | Intermediate | 15-20h | GDB scripting |
| 5 | Multi-threaded Mayhem | Advanced | 15-20h | Thread analysis |
| 6 | Stripped Binary Crash | Advanced | 15-20h | Disassembly |
| 7 | Minidump Parser | Advanced | 20-30h | Binary parsing |
| 8 | Kernel Panics | Expert | 20-30h | Kernel modules |
| 9 | Analyzing with crash | Expert | 15-20h | Kernel debugging |
| 10 | Centralized Reporter | Master | 40+h | System design |
Project List
The following 10 projects guide you from your first intentional crash to building production-grade crash analysis infrastructure.
Project 1: The First Crash — Understanding Core Dump Generation
- File: P01-first-crash-core-dump-generation.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust (for comparison)
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Systems Configuration, Process Signals
- Software or Tool: ulimit, systemd-coredump, coredumpctl
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you will build: A controlled environment that generates, captures, and verifies core dumps from intentional crashes, along with a configuration script that sets up any Linux system for crash capture.
Why it teaches crash dump analysis: Before you can analyze a crash, you must reliably capture one. This project forces you to understand how the kernel decides whether to dump core, where it writes the dump, and how modern Linux (systemd) has changed the traditional core.PID pattern. You’ll learn by intentionally breaking things and verifying the evidence is preserved.
Core challenges you will face:
- Configuring ulimit correctly → Maps to Core Dump Fundamentals (soft vs hard limits, shell vs process)
- Understanding core_pattern → Maps to systemd-coredump integration
- Choosing storage location → Maps to real-world deployment considerations
- Triggering different crash types → Maps to understanding signals (SIGSEGV, SIGABRT, SIGFPE)
Real World Outcome
You will have a shell script that configures any Linux system for core dump capture and a test program that crashes in multiple ways. Running the test will produce visible, verifiable core dumps.
Example Output:
$ ./setup-coredumps.sh
[+] Checking current ulimit -c: 0
[+] Setting ulimit -c unlimited for current shell
[+] Current core_pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
[+] systemd-coredump is active — using coredumpctl
[+] Configuration complete!
$ ./crash-test segfault
[*] About to trigger: SIGSEGV (Segmentation Fault)
[*] Dereferencing NULL pointer...
Segmentation fault (core dumped)
$ coredumpctl list
TIME PID UID GID SIG COREFILE EXE SIZE
Sat 2025-01-04 10:23:45 UTC 1234 1000 1000 SIGSEGV present /home/user/crash-test 245.2K
$ ./crash-test abort
[*] About to trigger: SIGABRT (Abort)
[*] Calling abort()...
Aborted (core dumped)
$ coredumpctl list | tail -2
Sat 2025-01-04 10:23:45 UTC 1234 1000 1000 SIGSEGV present /home/user/crash-test 245.2K
Sat 2025-01-04 10:24:01 UTC 1235 1000 1000 SIGABRT present /home/user/crash-test 246.1K
$ file $(coredumpctl -o /tmp/core.test info --no-pager 1234 2>/dev/null; echo /tmp/core.test)
/tmp/core.test: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from './crash-test segfault'
The Core Question You Are Answering
“Where does my crashed process’s memory go, and how do I make sure the kernel actually writes it?”
Before writing any code, understand that core dumps are not automatic. The kernel checks multiple conditions: Is RLIMIT_CORE non-zero? Is the executable setuid? Does the process have permission to write to the dump location? Is there enough disk space? Modern systems add another layer: systemd-coredump intercepts dumps before they hit the filesystem. You need to understand this pipeline before you can debug anything.
Concepts You Must Understand First
- Resource Limits (ulimit)
- What is the difference between soft and hard limits?
- Why does
ulimit -c unlimitedin a script not affect programs started afterward? - How do you set permanent limits via
/etc/security/limits.conf? - Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 36
- Signals and Process Termination
- Which signals generate core dumps by default (SIGQUIT, SIGILL, SIGABRT, SIGFPE, SIGSEGV, SIGBUS, SIGSYS, SIGTRAP)?
- How can you catch a signal vs letting it dump core?
- What happens when a signal handler re-raises the signal?
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 20-22
- systemd-coredump
- How does the kernel pipe core dumps to an external program?
- Where does systemd-coredump store files (
/var/lib/systemd/coredump/)? - How does compression (LZ4) affect storage and retrieval?
- Book Reference: systemd-coredump(8) and coredumpctl(1) man pages
Questions to Guide Your Design
- Configuration Detection
- How will your script detect if systemd-coredump is active vs traditional core files?
- What should happen on non-systemd systems (Alpine, older RHEL)?
- Crash Triggering
- How will you trigger each signal type (NULL deref for SIGSEGV, abort() for SIGABRT, 1/0 for SIGFPE)?
- Should you compile with or without optimizations? Why?
- Verification
- How will you verify the dump was actually created (not a zero-size file)?
- What fields in
coredumpctl infoprove the dump is usable?
- Cleanup
- How do you remove old test dumps without affecting real crashes?
- What is the coredumpctl retention policy?
Thinking Exercise
Trace the Kernel Path
Before coding, trace what happens when a process dereferences NULL:
1. Process executes: *(int *)0 = 42;
2. CPU raises page fault (address 0 is not mapped)
3. Kernel's page fault handler runs
4. Handler finds no valid mapping → sends SIGSEGV to process
5. Process has no handler for SIGSEGV → default action is "dump core + terminate"
6. Kernel checks RLIMIT_CORE:
- If 0 → no dump, just terminate
- If >0 → proceed
7. Kernel reads /proc/sys/kernel/core_pattern:
- If starts with "|" → pipe to that program (systemd-coredump)
- Otherwise → write to file with that pattern
8. Kernel writes ELF core file with process memory + registers
9. Process terminates, parent receives SIGCHLD
Questions while tracing:
- At which step can you lose the dump?
- What if the pipe to systemd-coredump fails?
- How does the kernel know which memory regions to include?
The Interview Questions They Will Ask
- “A production service crashed but there’s no core dump. Walk me through how you would debug why.”
- “What is the difference between ulimit in .bashrc and /etc/security/limits.conf?”
- “How does systemd-coredump differ from traditional core dumps, and what are the tradeoffs?”
- “Why might a setuid program not generate a core dump even with ulimit -c unlimited?”
- “How would you configure core dumps on a container running in Kubernetes?”
Hints in Layers
Hint 1: Start with Detection
First, detect the current configuration before changing anything. Read /proc/sys/kernel/core_pattern and compare ulimit -c (soft) vs ulimit -Hc (hard).
Hint 2: Handle Both Modes
Your script should work on both systemd and non-systemd systems. Use which coredumpctl or check for the |/usr/lib/systemd/systemd-coredump pattern to detect the mode.
Hint 3: Test Program Structure
main(argc, argv):
if argc < 2:
print usage: "./crash-test [segfault|abort|fpe|bus]"
exit
switch argv[1]:
case "segfault":
print "Triggering SIGSEGV..."
ptr = NULL
*ptr = 42 // crash here
case "abort":
print "Triggering SIGABRT..."
abort()
case "fpe":
print "Triggering SIGFPE..."
volatile x = 0
y = 1 / x // crash here
case "bus":
print "Triggering SIGBUS..."
// Requires unaligned access on strict architectures
Hint 4: Verification Commands Use these commands to verify your setup:
# Check if dump was created (systemd)
coredumpctl list | head -5
# Extract and verify
coredumpctl dump <PID> -o /tmp/test.core
file /tmp/test.core # Should show "ELF 64-bit LSB core file"
# Check size (should be non-zero)
ls -la /tmp/test.core
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Resource Limits | “The Linux Programming Interface” by Kerrisk | Ch. 36: Process Resources |
| Signals | “The Linux Programming Interface” by Kerrisk | Ch. 20-22: Signals |
| Core Dumps | “The Linux Programming Interface” by Kerrisk | Ch. 22.1: Core Dump Files |
| Practical GDB | “The Art of Debugging” by Matloff & Salzman | Ch. 1: Getting Started |
Common Pitfalls and Debugging
Problem 1: “ulimit -c unlimited had no effect”
- Why: ulimit only affects the current shell and its children. Running
sudo ulimit -c unlimitedspawns a root subshell that exits immediately. - Fix: Add limit to
/etc/security/limits.conffor persistent change, or run ulimit in the same shell that launches the process. - Quick test:
bash -c 'ulimit -c unlimited; ./crash-test segfault'
Problem 2: “Core dump was created but has 0 bytes”
- Why: The process may have died before the dump completed, or filesystem is full.
- Fix: Check
dmesg | grep -i corefor kernel messages. Verify disk space withdf -h. - Quick test:
dmesg | tail -20
Problem 3: “coredumpctl shows ‘missing’ in COREFILE column”
- Why: systemd-coredump may have retention policies that deleted old dumps, or the dump failed during capture.
- Fix: Check
/etc/systemd/coredump.confforMaxUse=andKeepFree=settings. - Quick test:
coredumpctl info <PID>for detailed error messages
Problem 4: “Crash happens but no ‘core dumped’ message”
- Why: Shell might not report core dump status, or signal was caught.
- Fix: Check exit status:
./crash-test segfault; echo "Exit: $?"(139 = 128+11 = SIGSEGV with core) - Quick test: Exit code 139 (SIGSEGV) or 134 (SIGABRT) indicates dump should have occurred
Definition of Done
- Setup script detects and reports current core dump configuration
- Setup script configures ulimit and verifies core_pattern
- Test program triggers at least 3 different signal types (SIGSEGV, SIGABRT, SIGFPE)
- Each crash produces a verifiable core dump (non-zero size, correct ELF type)
coredumpctl listorls core.*shows all test crashes- Script works on at least 2 different Linux distributions (Ubuntu, Fedora, or similar)
- Documentation explains the differences between systemd and traditional core patterns
Project 2: The GDB Backtrace — Extracting Crash Context
- File: P02-gdb-backtrace-crash-context.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Debugging, Stack Traces
- Software or Tool: GDB, debug symbols (-g flag)
- Main Book: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman
What you will build: A debugging workflow that extracts meaningful crash information from core dumps using GDB. You will create programs that crash in various ways and practice the essential GDB commands to understand exactly what went wrong.
Why it teaches crash dump analysis: The backtrace is your first tool when analyzing any crash. But a raw backtrace is often useless without understanding stack frames, argument values, and local variables. This project teaches you to navigate from “Segmentation fault” to “Line 47 of foo.c passed NULL to bar()”.
Core challenges you will face:
- Loading core dumps correctly → Maps to GDB’s
core-filecommand and executable matching - Reading backtraces with symbols → Maps to understanding compile flags (-g, -O)
- Navigating stack frames → Maps to frame selection and variable inspection
- Comparing with/without debug symbols → Maps to real-world debugging constraints
Real World Outcome
You will have a debugging toolkit that demonstrates GDB’s core dump analysis capabilities. Running your workflow will produce clear, actionable crash information.
Example Output:
$ ./demo-crashes linked-list-corruption
[*] Creating a linked list with 5 nodes
[*] Corrupting node 3's next pointer with garbage
[*] Traversing list (will crash)...
Segmentation fault (core dumped)
$ gdb ./demo-crashes core.12345
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Reading symbols from ./demo-crashes...
[New LWP 12345]
Core was generated by `./demo-crashes linked-list-corruption'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
47 printf("Node value: %d\n", current->value);
(gdb) bt
#0 0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
#1 0x0000555555555456 in test_linked_list_corruption () at demo-crashes.c:89
#2 0x00005555555556b2 in main (argc=2, argv=0x7fffffffde68) at demo-crashes.c:142
(gdb) frame 0
#0 0x0000555555555249 in traverse_list (head=0x5555555592a0) at demo-crashes.c:47
47 printf("Node value: %d\n", current->value);
(gdb) print current
$1 = (struct Node *) 0xdeadbeef
(gdb) print *current
Cannot access memory at address 0xdeadbeef
(gdb) info locals
head = 0x5555555592a0
current = 0xdeadbeef
count = 3
(gdb) # The bug: after 3 iterations, current became 0xdeadbeef (our corruption)
(gdb) # Root cause: someone wrote 0xdeadbeef to node[2]->next
The Core Question You Are Answering
“The program crashed—but WHERE exactly, and with WHAT state?”
A crash report saying “Segmentation fault” tells you almost nothing. You need: (1) the exact line of code, (2) the call chain that led there, (3) the values of relevant variables, and (4) the state of memory at the crash point. GDB gives you all of this—if you know the commands.
Concepts You Must Understand First
- Debug Symbols (-g flag)
- What information does
-gembed in the binary? - How does DWARF format map addresses to source lines?
- Why can you still get a backtrace without symbols, just without names?
- Book Reference: “Practical Binary Analysis” by Andriesse — Ch. 2
- What information does
- Stack Frames
- What is stored in each stack frame (return address, saved registers, locals)?
- How does GDB number frames (0 is current, higher is caller)?
- What is the frame pointer (RBP) and how does it help?
- Book Reference: “Computer Systems: A Programmer’s Perspective” by Bryant — Ch. 3.7
- GDB Core Commands
bt(backtrace) — shows call stackframe N— select stack frame Ninfo locals— show local variables in current frameinfo args— show function argumentsprint <expr>— evaluate and print expression- Book Reference: “The Art of Debugging” by Matloff — Ch. 2-3
Questions to Guide Your Design
- Test Case Design
- What crash scenarios will you create (NULL pointer, buffer overflow, use-after-free)?
- How will you ensure the crash happens at a predictable location?
- Symbol Comparison
- How will you demonstrate the difference between
-gand no-g? - What about
-gwith optimization (-O2 -g)?
- How will you demonstrate the difference between
- Variable Inspection
- How will you show variables at different stack depths?
- What happens when you try to print an optimized-out variable?
- Documentation
- What GDB commands will you document for your reference?
- How will you create a “cheat sheet” for common scenarios?
Thinking Exercise
Trace the Stack
Given this call chain that crashes:
void foo(int *p) { *p = 42; } // Crash here if p is NULL
void bar(int x) { if (x < 0) foo(NULL); else { int y = x; foo(&y); } }
void baz() { bar(-1); }
int main() { baz(); return 0; }
Draw the stack at crash time:
High addresses
┌─────────────────┐
│ main's frame │ ← Frame 3
│ (return addr) │
├─────────────────┤
│ baz's frame │ ← Frame 2
│ (return addr) │
├─────────────────┤
│ bar's frame │ ← Frame 1
│ x = -1 │
│ (return addr) │
├─────────────────┤
│ foo's frame │ ← Frame 0 (current)
│ p = NULL │
│ crash point │
└─────────────────┘
Low addresses
Questions while tracing:
- If you only see frame 0, how do you know the real bug is in bar()?
- What would
info argsshow in frame 1? - How would this look different without debug symbols?
The Interview Questions They Will Ask
- “You have a core dump from a production crash. Walk me through your first 5 GDB commands.”
- “What is the difference between
btandbt full?” - “How can you tell if a crash is a NULL pointer dereference vs a use-after-free from the backtrace?”
- “The backtrace shows
??instead of function names. What does that mean and how do you fix it?” - “How do you inspect a variable that GDB says is ‘optimized out’?”
- “What is the frame pointer, and why do some binaries omit it?”
Hints in Layers
Hint 1: Create Multiple Crash Types Start with 3-4 distinct crash scenarios: NULL dereference, stack buffer overflow, heap corruption, double-free. Each teaches different debugging patterns.
Hint 2: Compile Two Versions
Always compile the same source twice: gcc -g -O0 (debug) and gcc -O2 (release). Compare the backtrace quality to understand why debug symbols matter.
Hint 3: Build a Command Reference
Essential GDB Commands for Core Dump Analysis:
- bt : Show backtrace (call stack)
- bt full : Backtrace with local variables
- frame N : Switch to frame N
- up / down : Move one frame up/down
- info locals : Show local variables
- info args : Show function arguments
- print VAR : Print variable value
- print *PTR : Dereference pointer
- print PTR@10 : Print array of 10 elements
- x/10x ADDR : Examine 10 hex words at address
- list : Show source around current line
- info registers : Show CPU registers
Hint 4: Automate with GDB Batch Mode
# Run GDB commands non-interactively
gdb -batch -ex "bt" -ex "info locals" ./program core.123
# Save to file
gdb -batch -ex "bt full" ./program core.123 > crash-report.txt
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| GDB Basics | “The Art of Debugging” by Matloff & Salzman | Ch. 1-4 |
| Stack Frames | “CSAPP” by Bryant & O’Hallaron | Ch. 3.7: Procedures |
| Debug Symbols | “Practical Binary Analysis” by Andriesse | Ch. 2: ELF Format |
| Memory Layout | “CSAPP” by Bryant & O’Hallaron | Ch. 9: Virtual Memory |
Common Pitfalls and Debugging
Problem 1: “GDB says ‘no debugging symbols found’“
- Why: The binary was compiled without
-g, or symbols were stripped. - Fix: Recompile with
gcc -gor locate a debug symbol package (e.g.,-dbgsympackages on Ubuntu). - Quick test:
file ./programshould say “with debug_info” if symbols present
Problem 2: “Backtrace shows ‘??’ for function names”
- Why: GDB can’t map addresses to symbols. Either wrong executable or missing symbols.
- Fix: Ensure the executable matches the core dump (same build). Use
info sharedto check shared library symbols. - Quick test:
readelf -s ./program | head -20should show symbol table
Problem 3: “Variable shows as ‘optimized out’“
- Why: Compiler optimized away the variable (stored in register, inlined, etc.)
- Fix: Recompile with
-O0for debugging, or useinfo registersto find register values. - Quick test:
gcc -g -O0should preserve all variables
Problem 4: “Core file doesn’t match executable”
- Why: The binary was rebuilt after the crash, changing addresses.
- Fix: Keep the exact binary that crashed alongside the core dump. Use version control or preserve binaries with each release.
- Quick test:
file core.123shows the original executable path
Definition of Done
- Created at least 4 different crash scenarios (NULL, overflow, use-after-free, etc.)
- Can load core dump in GDB and get full backtrace with symbols
- Documented the difference in backtrace quality with and without
-g - Can navigate frames and inspect variables at each level
- Created a GDB command cheat sheet with examples
- Can use GDB batch mode to generate crash reports non-interactively
- Can explain what each line of a backtrace means
Project 3: The Memory Inspector — Deep State Examination
- File: P03-memory-inspector-deep-state.md
- Main Programming Language: C
- Alternative Programming Languages: C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Memory Layout, Debugging, Forensics
- Software or Tool: GDB, hexdump, /proc filesystem
- Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron
What you will build: A memory forensics toolkit that goes beyond backtraces to examine heap state, corrupted data structures, and memory patterns. You will create programs with subtle memory bugs and use GDB’s memory examination commands to find root causes that backtraces alone can’t reveal.
Why it teaches crash dump analysis: Many crashes don’t happen at the bug—they happen later when corrupted data is used. A backtrace shows you where it died, but memory inspection shows you what went wrong. This is the difference between “it crashed in strcmp()” and “someone wrote past the end of the username buffer 200 lines earlier.”
Core challenges you will face:
- Examining raw memory → Maps to GDB’s
xcommand and format specifiers - Finding corruption patterns → Maps to recognizing freed memory, stack canaries, guard bytes
- Tracing data structure state → Maps to following pointers and understanding layout
- Correlating addresses with regions → Maps to stack vs heap vs data vs code identification
Real World Outcome
You will have a collection of memory forensics scenarios and the skills to investigate corruption that causes delayed crashes.
Example Output:
$ ./memory-mysteries heap-overflow
[*] Allocating two adjacent buffers
[*] Buffer A: 32 bytes for user input
[*] Buffer B: 32 bytes for privilege level (should be "user")
[*] Overwriting Buffer A with 48 bytes (oops, 16 byte overflow)
[*] Checking privilege level...
[!] Unexpected privilege: 'admin' (Buffer B was corrupted!)
[*] Attempting privileged operation...
Segmentation fault (core dumped)
$ gdb ./memory-mysteries core.5678
(gdb) bt
#0 0x00007ffff7c9b152 in __strcmp_avx2 () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000055555555543a in check_operation (priv=0x5555555596d0 "") at memory-mysteries.c:89
#2 0x0000555555555678 in do_admin_thing () at memory-mysteries.c:134
...
(gdb) frame 1
#1 0x000055555555543a in check_operation (priv=0x5555555596d0 "") at memory-mysteries.c:89
(gdb) print priv
$1 = 0x5555555596d0 ""
(gdb) x/32xb 0x5555555596d0
0x5555555596d0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596d8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596e0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x5555555596e8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
(gdb) # Buffer B is all zeros — was overwritten with null bytes from overflow!
(gdb) # Let's look at Buffer A (32 bytes before)
(gdb) x/48xb 0x5555555596d0-32
0x5555555596b0: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41 <- Buffer A "AAAA..."
0x5555555596b8: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0x5555555596c0: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41
0x5555555596c8: 0x41 0x41 0x41 0x41 0x41 0x41 0x41 0x41 <- End of A, start of B
0x5555555596d0: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 <- Buffer B (corrupted)
0x5555555596d8: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
(gdb) # Aha! Buffer A was filled with 'A' (0x41) and the null terminator overflowed into B
(gdb) # The 48-byte string + null terminator overwrote B completely
The Core Question You Are Answering
“The crash site is just the symptom—where is the actual BUG?”
Most memory corruption bugs don’t crash immediately. A heap overflow corrupts an adjacent allocation that isn’t used until minutes later. A use-after-free might work 99 times because the memory hasn’t been reused. The backtrace tells you where the train derailed; memory inspection tells you where the track was sabotaged.
Concepts You Must Understand First
- Process Memory Layout
- Where are stack, heap, data, and code segments?
- How do you identify which region an address belongs to?
- What do typical stack addresses vs heap addresses look like?
- Book Reference: “CSAPP” by Bryant — Ch. 9.7-9.8
- GDB Memory Examination
x/NFU ADDR: N=count, F=format (x,d,c,s,i), U=unit (b,h,w,g)- Common patterns:
x/20xb(20 hex bytes),x/s(string),x/10i(10 instructions) - How to examine memory relative to variables:
x/16xb &buffer - Book Reference: “The Art of Debugging” by Matloff — Ch. 3
- Heap Internals Basics
- How does malloc track allocations (size, flags in chunk headers)?
- What patterns indicate freed memory (0xdeadbeef, 0xfeeefeee, etc.)?
- What is the “red zone” and how does Address Sanitizer use it?
- Book Reference: “Secure Coding in C and C++” by Seacord — Ch. 4
- Data Structure Layout
- How does the compiler arrange struct fields in memory?
- What is padding and alignment?
- How do you correlate
offsetof()with memory examination? - Book Reference: “CSAPP” by Bryant — Ch. 3.9
Questions to Guide Your Design
- Scenario Selection
- What memory bugs will you demonstrate (heap overflow, use-after-free, stack buffer overflow, uninitialized memory)?
- How will you make the corruption visible but not immediately fatal?
- Memory Patterns
- What recognizable patterns will you use (0x41 for ‘A’, 0xDEADBEEF, specific strings)?
- How will you demonstrate that the corruption pattern reveals the source?
- Region Identification
- How will you teach identifying stack vs heap addresses?
- Will you show how to use
/proc/PID/mapsorinfo proc mappings?
- Forensic Workflow
- What systematic approach will you document for investigating corruption?
- How do you trace back from corrupted memory to the corrupting code?
Thinking Exercise
Map the Memory Regions
Given this GDB output, identify each address’s region:
(gdb) print &argc
$1 = (int *) 0x7fffffffde04
(gdb) print buffer
$2 = 0x5555555592a0 "Hello"
(gdb) print main
$3 = {int (int, char **)} 0x555555555169
(gdb) print &global_counter
$4 = (int *) 0x555555558010
Questions:
- Which address is on the stack? (Hint: 0x7fff… is high memory)
- Which address is heap? (Hint: dynamically allocated, similar to code addresses)
- Which is code? (Hint: same region as program counter)
- Which is data segment? (Hint: near code but different section)
Typical x86-64 Linux layout:
0x7fff_xxxx_xxxx Stack (grows down)
0x7f00_xxxx_xxxx Shared libraries
...
0x5555_5555_xxxx Heap (grows up, glibc mmap allocations)
0x5555_5555_8xxx Data segment (.data, .bss)
0x5555_5555_5xxx Code segment (.text)
The Interview Questions They Will Ask
- “How do you determine if a crash is caused by heap corruption vs stack corruption?”
- “Explain the GDB command
x/20xb $rsp— what does each part mean?” - “You see address 0xdeadbeef in a pointer. What does this typically indicate?”
- “How can you tell if memory was freed before being used?”
- “Walk me through debugging a crash where the backtrace shows the bug is in libc’s malloc.”
- “What is the difference between examining memory with
printvsxin GDB?”
Hints in Layers
Hint 1: Start with Known Patterns
Fill buffers with recognizable patterns before corruption: memset(buf, 'A', size) or fill with sequential bytes. When you see these patterns where they shouldn’t be, you’ve found your corruption.
Hint 2: Compare Before and After Create scenarios where you can examine memory state before and after the corruption. Use GDB breakpoints before the bug to capture “good” state, then compare with the core dump.
Hint 3: GDB Memory Examination Cheat Sheet
x/Nx ADDR - N hex words (4 bytes each)
x/Nxb ADDR - N hex bytes
x/Nxg ADDR - N hex giant words (8 bytes)
x/Ns ADDR - N strings
x/Ni ADDR - N instructions
x/Nc ADDR - N characters
Examples:
x/32xb &buffer - 32 bytes of buffer
x/s $rdi - String at first argument
x/10i $pc - 10 instructions at program counter
x/8xg $rsp - 8 stack slots (64-bit)
Hint 4: Use info Files for Memory Map
(gdb) info files
# Shows all loaded segments and their address ranges
(gdb) info proc mappings
# Shows /proc/PID/maps style output (if available from core)
(gdb) maintenance info sections
# Shows ELF sections with addresses
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Memory Layout | “CSAPP” by Bryant & O’Hallaron | Ch. 9: Virtual Memory |
| GDB Memory Commands | “The Art of Debugging” by Matloff | Ch. 3: Inspecting Variables |
| Heap Internals | “Secure Coding in C and C++” by Seacord | Ch. 4: Dynamic Memory |
| Stack Frame Layout | “CSAPP” by Bryant & O’Hallaron | Ch. 3.7: Procedures |
| Binary Inspection | “Practical Binary Analysis” by Andriesse | Ch. 5: Binary Analysis Basics |
Common Pitfalls and Debugging
Problem 1: “Memory shows all zeros but program used this data”
- Why: Core dump may not include all memory pages (sparse dump). Or memory was freed.
- Fix: Check if systemd-coredump limits dump size. Use
coredumpctl infoto see ProcessState. - Quick test:
coredumpctl info <PID> | grep "Size"— compare with expected memory usage
Problem 2: “Can’t tell stack from heap addresses”
- Why: Without context, addresses are just numbers. Need memory map.
- Fix: Use
info proc mappingsin GDB or examine/proc/PID/mapsfrom a live process. - Quick test: Stack addresses typically start with 0x7fff on x86-64 Linux
Problem 3: “Heap metadata looks corrupted”
- Why: Heap overflow overwrote malloc’s bookkeeping, causing strange chunk sizes.
- Fix: This is often the symptom, not the cause. Look for the buffer that overflowed into metadata.
- Quick test: Look for ASCII patterns (like 0x41414141) in chunk headers
Problem 4: “Same crash, different memory contents each run”
- Why: Address Space Layout Randomization (ASLR) changes addresses each run.
- Fix: Disable ASLR for debugging:
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space - Quick test: Re-enable after debugging! (echo 2 for full ASLR)
Definition of Done
- Created at least 3 memory corruption scenarios (heap overflow, use-after-free, uninitialized)
- Can identify memory region (stack/heap/data/code) from any address
- Documented the
xcommand with multiple format examples - Can trace corruption from crash site back to the bug location
- Demonstrated finding a recognizable pattern in unexpected memory
- Created a memory forensics workflow/checklist
- Can explain heap chunk metadata and what corruption looks like
Project 4: Automated Crash Detective — GDB Scripting and Python API
- File: P04-automated-crash-detective-gdb-scripting.md
- Main Programming Language: Python
- Alternative Programming Languages: GDB Command Scripts, Bash
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Automation, Scripting, Tooling
- Software or Tool: GDB Python API, gdb.Command, batch mode
- Main Book: “The Art of Debugging with GDB, DDD, and Eclipse” by Matloff & Salzman
What you will build: An automated crash analysis tool that takes a core dump and executable as input and produces a structured report with backtrace, local variables, thread state, and memory analysis—all without human interaction. You will learn GDB’s Python API to build reusable debugging automation.
Why it teaches crash dump analysis: Manual GDB debugging doesn’t scale. When you have 100 crashes per day, you need automation. This project teaches you to codify your debugging knowledge into scripts that run consistently and generate reports for further analysis or integration with crash aggregation systems.
Core challenges you will face:
- Learning GDB’s Python API → Maps to gdb.Command, gdb.Frame, gdb.Value
- Extracting structured data → Maps to parsing GDB output vs using API objects
- Handling edge cases → Maps to corrupted frames, missing symbols, stripped binaries
- Generating useful reports → Maps to what information developers actually need
Real World Outcome
You will have a Python-based crash analysis tool that generates JSON or Markdown reports from core dumps automatically.
Example Output:
$ ./crash-analyzer --core ./core.5678 --exe ./myserver --format json
{
"timestamp": "2025-01-04T10:30:45Z",
"executable": "/home/user/myserver",
"signal": "SIGSEGV",
"signal_code": 11,
"crashing_thread": 1,
"total_threads": 4,
"backtrace": [
{
"frame": 0,
"function": "process_request",
"file": "server.c",
"line": 234,
"args": {"req": "0x5555555596d0", "len": "1024"},
"locals": {"buffer": "0x7fffffffdd00", "i": "127"}
},
{
"frame": 1,
"function": "handle_client",
"file": "server.c",
"line": 189,
"args": {"client_fd": "5"}
},
{
"frame": 2,
"function": "main",
"file": "server.c",
"line": 312
}
],
"registers": {
"rip": "0x555555555abc",
"rsp": "0x7fffffffdd00",
"rbp": "0x7fffffffde10"
},
"analysis": {
"crash_type": "null_pointer_dereference",
"likely_cause": "req pointer was NULL at frame 0",
"stack_corrupted": false
}
}
$ ./crash-analyzer --core ./core.5678 --exe ./myserver --format markdown
# Crash Report: myserver
**Signal:** SIGSEGV (Segmentation Fault)
**Time:** 2025-01-04 10:30:45 UTC
**Threads:** 4 (crashed in thread 1)
## Backtrace
| Frame | Function | Location | Arguments |
|-------|----------|----------|-----------|
| 0 | process_request | server.c:234 | req=0x5555555596d0, len=1024 |
| 1 | handle_client | server.c:189 | client_fd=5 |
| 2 | main | server.c:312 | argc=1, argv=... |
## Analysis
**Likely Cause:** NULL pointer dereference at `req` parameter
**Recommendation:** Add null check before line 234
The Core Question You Are Answering
“How do I turn my manual debugging workflow into a repeatable, scriptable process?”
Every time you debug a crash, you run the same commands: bt, info locals, info threads. Why type them manually? A script can do it faster, more consistently, and produce structured output for downstream processing. This is the foundation of crash aggregation systems used by Google, Microsoft, and every major software company.
Concepts You Must Understand First
- GDB Batch Mode
- How does
gdb -batch -x script.gdbwork? - What is the difference between
-x(command file) and-ex(inline command)? - How do you capture output to a file?
- Book Reference: “The Art of Debugging” by Matloff — Ch. 7
- How does
- GDB Python API Fundamentals
- How do you load a Python script in GDB (
source,python-interactive)? - What are gdb.Frame, gdb.Value, gdb.Type, gdb.Symbol?
- How do you iterate stack frames programmatically?
- Book Reference: GDB Manual, Python API chapter (online)
- How do you load a Python script in GDB (
- Creating Custom GDB Commands
- How does the gdb.Command class work?
- How do you handle arguments in custom commands?
- How do you output structured data (JSON) from GDB?
- Book Reference: GDB Documentation — “Extending GDB with Python”
- Error Handling in GDB Scripts
- What happens when gdb.parse_and_eval() fails?
- How do you detect and handle missing debug symbols?
- How do you skip corrupted frames?
- Book Reference: Experience and experimentation
Questions to Guide Your Design
- Input/Output
- What input formats will you support (core file + exe, coredumpctl)?
- What output formats will you produce (JSON, Markdown, plain text)?
- How will you handle command-line arguments?
- Information Extraction
- What information is essential in every report (backtrace, signal, registers)?
- What optional information adds value (locals, threads, memory)?
- How deep should you go into memory inspection?
- Error Handling
- What if the executable doesn’t match the core?
- What if debug symbols are missing?
- What if a stack frame is corrupted?
- Extensibility
- How will others add new analyses?
- Can you support plugins for different crash types?
Thinking Exercise
Design the API
Before coding, design the data model for crash information:
# What fields should a StackFrame have?
class StackFrame:
frame_number: int
function_name: str # or "??" if unknown
file_name: Optional[str]
line_number: Optional[int]
address: int
arguments: Dict[str, str] # name -> value as string
locals: Dict[str, str] # name -> value as string
# What fields should a CrashReport have?
class CrashReport:
executable_path: str
core_path: str
timestamp: datetime
signal: str # "SIGSEGV"
signal_number: int # 11
backtrace: List[StackFrame]
registers: Dict[str, int]
threads: List[ThreadInfo]
analysis: AnalysisResult
Questions:
- How do you get the signal name from the signal number?
- What if a local variable’s value is “optimized out”?
- How do you serialize gdb.Value to JSON?
The Interview Questions They Will Ask
- “How would you build a system to analyze 1000 crash dumps per day?”
- “What is the GDB Python API, and when would you use it over command scripts?”
- “How do you handle a crash dump where the executable was compiled without debug symbols?”
- “Design a crash fingerprinting algorithm—how would you identify if two crashes are the same bug?”
- “What information should a crash report contain for a developer to diagnose the issue?”
- “How would you integrate automated crash analysis with a CI/CD pipeline?”
Hints in Layers
Hint 1: Start with Batch Mode Before writing Python, get comfortable with GDB batch mode:
gdb -batch \
-ex "file ./myprogram" \
-ex "core-file ./core.123" \
-ex "bt" \
-ex "info threads"
Capture this output, then parse it (as a stepping stone to the Python API).
Hint 2: Basic Python Script Structure
import gdb
import json
class CrashAnalyzer(gdb.Command):
"""Analyze current core dump and output report."""
def __init__(self):
super().__init__("crash-analyze", gdb.COMMAND_USER)
def invoke(self, arg, from_tty):
report = {}
# Get backtrace
frame = gdb.newest_frame()
frames = []
while frame:
frames.append(self.extract_frame_info(frame))
frame = frame.older()
report["backtrace"] = frames
# Output as JSON
print(json.dumps(report, indent=2))
def extract_frame_info(self, frame):
# ... implement frame extraction
pass
CrashAnalyzer() # Register the command
Hint 3: Handle Missing Symbols Gracefully
def get_function_name(frame):
try:
name = frame.name()
return name if name else "??"
except gdb.error:
return "??"
def get_local_value(name, frame):
try:
val = frame.read_var(name)
return str(val)
except gdb.error as e:
return f"<unavailable: {e}>"
Hint 4: Wrapper Script for Easy Invocation
#!/bin/bash
# crash-analyzer.sh
GDB_SCRIPT="$(dirname $0)/crash_analyzer.py"
gdb -batch \
-ex "source $GDB_SCRIPT" \
-ex "file $2" \
-ex "core-file $1" \
-ex "crash-analyze $3" # Pass format as arg
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| GDB Scripting | “The Art of Debugging” by Matloff | Ch. 7: Scripting |
| Python APIs | GDB Manual (online) | Python API Reference |
| JSON Processing | Python Documentation | json module |
| CLI Design | “The Linux Command Line” by Shotts | Ch. 25: Scripts |
Common Pitfalls and Debugging
Problem 1: “gdb.error: No frame selected”
- Why: You’re trying to access frame data before loading the core file.
- Fix: Ensure the
core-filecommand runs before your analysis script. - Quick test: Run commands interactively first to verify order
Problem 2: “Cannot convert gdb.Value to JSON”
- Why: gdb.Value objects aren’t JSON-serializable; you need to convert to strings/ints.
- Fix: Use
str(value)orint(value)depending on type. Handle errors for complex types. - Quick test:
print(type(val), str(val))in your script
Problem 3: “Script works interactively but fails in batch mode”
- Why: Batch mode may have different timing or output buffering.
- Fix: Ensure all gdb commands complete before Python code runs. Use
gdb.execute()withto_string=Trueto capture output. - Quick test: Add debug prints to track execution flow
Problem 4: “Missing symbols for system libraries”
- Why: Debug symbols for libc, etc. are in separate packages.
- Fix: Install debug symbol packages (
libc6-dbgon Debian,glibc-debuginfoon Fedora). - Quick test:
gdb -ex "info sharedlibrary"to see which libs lack symbols
Definition of Done
- Script loads core dump and executable via command-line arguments
- Produces JSON output with backtrace, signal, and thread info
- Produces human-readable (Markdown) output as alternative
- Handles missing debug symbols gracefully (shows addresses, not errors)
- Handles multiple threads and identifies the crashing thread
- Includes basic analysis (null pointer detection, crash type classification)
- Works with coredumpctl integration (can extract core from systemd storage)
- Documented installation and usage instructions
Project 5: Multi-threaded Mayhem — Analyzing Concurrent Crashes
- File: P05-multi-threaded-mayhem-concurrent-crashes.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Concurrency, Thread Debugging, Race Conditions
- Software or Tool: GDB thread commands, pthreads, helgrind
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you will build: A multi-threaded test suite that demonstrates various concurrency bugs (data races, deadlocks, thread-unsafe crashes) and the GDB techniques to diagnose them from core dumps. You’ll learn to navigate thread state in crashes where the symptom is in one thread but the cause is in another.
Why it teaches crash dump analysis: Single-threaded debugging is straightforward—follow the backtrace to the bug. Multi-threaded crashes are puzzles: Thread A crashes because Thread B corrupted shared data. Thread C is deadlocked waiting for a mutex Thread D holds. This project teaches you to think across threads and correlate state.
Core challenges you will face:
- Navigating thread state in GDB → Maps to
info threads,thread N,thread apply all - Understanding thread-specific crash context → Maps to which thread received the signal
- Correlating data across threads → Maps to finding the “other thread” that caused corruption
- Recognizing concurrency bug patterns → Maps to data races, deadlocks, use-after-free across threads
Real World Outcome
You will have a collection of multi-threaded crash scenarios and the skills to diagnose which thread caused the problem, not just which thread crashed.
Example Output:
$ ./thread-chaos race-condition
[*] Spawning 4 threads to increment shared counter
[*] Thread 0 starting...
[*] Thread 1 starting...
[*] Thread 2 starting...
[*] Thread 3 starting...
[*] Thread 1 corrupted shared data structure (intentionally)
[*] Thread 3 accessing corrupted data...
Segmentation fault (core dumped)
$ gdb ./thread-chaos core.9876
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7fb1000 (LWP 9876) 0x00007ffff7c9b152 in __strcmp_avx2 ()
2 Thread 0x7ffff6fb0700 (LWP 9877) 0x00007ffff7ce2a8d in __lll_lock_wait ()
3 Thread 0x7ffff67af700 (LWP 9878) 0x0000555555555420 in worker_thread ()
4 Thread 0x7ffff5fae700 (LWP 9879) 0x00007ffff7d0b3bf in __GI___nanosleep ()
(gdb) # Thread 1 (LWP 9876) is the crashed thread - marked with *
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fb1000 (LWP 9876))]
#0 0x00007ffff7c9b152 in __strcmp_avx2 () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff7c9b152 in __strcmp_avx2 ()
#1 0x00005555555554a3 in process_item (item=0xdeadbeef) at thread-chaos.c:78
#2 0x0000555555555523 in worker_thread (arg=0x0) at thread-chaos.c:95
#3 0x00007ffff7d8b6db in start_thread () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) # item=0xdeadbeef is suspicious - looks like corruption pattern
(gdb) # Let's check all threads' backtraces
(gdb) thread apply all bt
Thread 4 (Thread 0x7ffff5fae700 (LWP 9879)):
#0 0x00007ffff7d0b3bf in __GI___nanosleep ()
#1 0x0000555555555389 in sleepy_thread (arg=0x3) at thread-chaos.c:67
Thread 3 (Thread 0x7ffff67af700 (LWP 9878)):
#0 0x0000555555555420 in worker_thread (arg=0x2) at thread-chaos.c:92
#1 0x00007ffff7d8b6db in start_thread ()
Thread 2 (Thread 0x7ffff6fb0700 (LWP 9877)):
#0 0x00007ffff7ce2a8d in __lll_lock_wait ()
#1 0x00007ffff7ce5393 in pthread_mutex_lock ()
#2 0x0000555555555501 in corrupt_shared_data (arg=0x1) at thread-chaos.c:88
#3 0x00007ffff7d8b6db in start_thread ()
(gdb) # Thread 2 is in corrupt_shared_data! That's likely the culprit
(gdb) # Thread 1 crashed, but Thread 2's function name suggests it corrupted the data
The Core Question You Are Answering
“In a multi-threaded crash, which thread actually CAUSED the problem?”
The thread that receives SIGSEGV is often the victim, not the perpetrator. Thread A writes garbage to a shared pointer; Thread B later dereferences it and crashes. The backtrace for Thread B shows the crash in innocent code. You need to examine ALL threads to find the real bug.
Concepts You Must Understand First
- Thread Representation in Core Dumps
- How are threads captured (all threads, all registers)?
- What is LWP (Light Weight Process)?
- Which thread is marked as the “crashing thread”?
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 29
- GDB Thread Commands
info threads— list all threads with current framethread N— switch to thread Nthread apply all CMD— run CMD on each threadthread apply all bt— backtrace for all threads (most useful!)- Book Reference: “The Art of Debugging” by Matloff — Ch. 5
- Concurrency Bug Patterns
- Data race: Two threads access same data, at least one writes, no synchronization
- Deadlock: Circular wait on locks (hard to see in crash, easier in live debug)
- Use-after-free across threads: Thread A frees, Thread B uses
- Race on refcount: Thread A decrements to 0 and frees while Thread B still using
- Book Reference: “The Linux Programming Interface” by Kerrisk — Ch. 30
- Mutex and Synchronization State
- How do you examine mutex state in a core dump?
- What does a “waiting on lock” frame look like?
- How do you identify which thread holds a lock?
- Book Reference: glibc/pthread internals documentation
Questions to Guide Your Design
- Scenario Selection
- What concurrency bugs will you demonstrate (race condition, deadlock, cross-thread use-after-free)?
- How will you make the crashes reproducible (or intentionally non-deterministic)?
- Thread Identification
- How will you name or tag threads so they’re identifiable in GDB?
- Will you use pthread_setname_np()?
- Evidence Planting
- How will you make it clear which thread caused the problem (comments, patterns)?
- What corruption patterns will help identify the source?
- Analysis Workflow
- What systematic approach will you document for multi-threaded crash analysis?
- How do you correlate state across threads?
Thinking Exercise
Map Thread Interactions
Given 4 threads accessing a shared linked list:
- Thread 1: Reads nodes
- Thread 2: Adds nodes
- Thread 3: Removes nodes
- Thread 4: Reads nodes
Questions:
- If Thread 1 crashes dereferencing a freed node, which thread might have freed it?
- If Thread 2 crashes while adding a node, could another thread be involved?
- How would you use
thread apply all btto investigate?
Draw the interaction diagram:
Thread 1 (Read) Thread 2 (Add) Thread 3 (Remove) Thread 4 (Read)
| | | |
| ←── shared_list (mutex protected?) ──→ |
| | | |
read(node) add(new) remove(node) read(node)
| | | |
↓ ↓ ↓ ↓
If node freed Race with Frees node If node freed
→ SIGSEGV remove? | → SIGSEGV
↓
If reader still
using → crash
The Interview Questions They Will Ask
- “How do you determine which thread caused a crash in a multi-threaded program?”
- “What is a data race, and how would you detect one from a core dump?”
- “The backtrace shows the crash in a standard library function (strcmp). How do you find your bug?”
- “How do you examine mutex state in a core dump?”
- “Describe a scenario where the crashing thread is NOT where the bug is.”
- “What tools besides GDB help find concurrency bugs? (helgrind, tsan)”
Hints in Layers
Hint 1: Name Your Threads
Use pthread_setname_np() to give threads meaningful names. GDB will show these in info threads:
pthread_setname_np(pthread_self(), "worker-1");
Hint 2: Use Recognizable Corruption Patterns When one thread intentionally corrupts data, use patterns that stand out:
// Corrupting thread:
shared_ptr = (void*)0xDEADBEEF; // Obvious bad pointer
// Or fill with pattern:
memset(shared_buffer, 0x41, size); // All 'A's
Hint 3: Thread Apply All Is Your Friend
(gdb) thread apply all bt # All backtraces
(gdb) thread apply all bt full # All backtraces with locals
(gdb) thread apply all print shared_var # Check shared var in each thread
Hint 4: Look for Mutex Wait Patterns A thread blocked on a mutex will have a frame like:
#0 __lll_lock_wait () at lowlevellock.S:49
#1 pthread_mutex_lock () at pthread_mutex_lock.c:80
This tells you which thread is waiting. Use info threads to find who might be holding the lock.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| POSIX Threads | “The Linux Programming Interface” by Kerrisk | Ch. 29-30 |
| Thread Debugging | “The Art of Debugging” by Matloff | Ch. 5: Debugging in Multi-threaded Environments |
| Race Conditions | “C Programming: A Modern Approach” by King | Ch. 19: Program Design |
| Lock-free Patterns | “C++ Concurrency in Action” by Williams | Ch. 5-6 |
Common Pitfalls and Debugging
Problem 1: “All threads show the same backtrace”
- Why: Copy/paste error, or you’re looking at the wrong column in
info threads. - Fix: Use
thread apply all btwhich clearly labels each thread’s backtrace. - Quick test: Count unique LWP numbers in
info threads
Problem 2: “Can’t tell which thread holds the mutex”
- Why: Mutex internals are opaque in most debuggers.
- Fix: Look for the thread that’s NOT waiting on the mutex and has the lock in scope. Or examine
pthread_mutex_tinternals (glibc-specific). - Quick test:
thread apply all print my_mutexto see state in each context
Problem 3: “Race condition doesn’t reproduce”
- Why: Races are timing-dependent by nature.
- Fix: Add sleeps, use multiple runs, or use
sched_yield()to increase collision probability. - Quick test: Run in a loop:
for i in {1..100}; do ./thread-chaos; done
Problem 4: “Thread that caused corruption has already exited”
- Why: Threads can exit before the crash occurs.
- Fix: Look for evidence in remaining threads or memory. The corruption pattern itself may indicate the source.
- Quick test: Check thread count at crash time vs expected
Definition of Done
- Created at least 3 multi-threaded crash scenarios (race, deadlock attempt, cross-thread UAF)
- Can use
info threadsandthread apply all bteffectively - Can identify the crashing thread vs the culprit thread
- Documented the thread analysis workflow
- Threads have meaningful names visible in GDB
- Demonstrated finding the “other thread” that caused a crash
- Can explain how mutex wait patterns appear in backtraces
Project 6: The Stripped Binary Challenge — Debugging Without Symbols
- File: P06-stripped-binary-debugging-without-symbols.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Assembly
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Reverse Engineering, Assembly, Binary Analysis
- Software or Tool: GDB, objdump, readelf, IDA Free or Ghidra
- Main Book: “Practical Binary Analysis” by Dennis Andriesse
What you will build: A debugging workflow for analyzing crashes in stripped binaries (no debug symbols, no function names). You’ll learn to use disassembly, register analysis, and memory patterns to understand crashes when the backtrace shows only hex addresses.
Why it teaches crash dump analysis: Production binaries are often stripped to reduce size and protect intellectual property. Third-party libraries and closed-source software never have your debug symbols. Security researchers analyze malware that’s intentionally obfuscated. This project teaches you to debug when you have nothing but the binary and the crash.
Core challenges you will face:
- Reading disassembly → Maps to understanding x86-64 instructions
- Correlating addresses to code → Maps to using objdump and readelf
- Understanding calling conventions → Maps to finding function arguments in registers
- Reconstructing context without symbols → Maps to pattern recognition in memory
Real World Outcome
You will be able to extract useful crash information from binaries with no debug symbols—a skill that distinguishes expert debuggers.
Example Output:
$ file ./mystery-server
./mystery-server: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked,
interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped
$ # No debug info! Let's crash it and analyze
$ ./mystery-server &
[1] 4567
$ curl http://localhost:8080/crash
curl: (52) Empty reply from server
$ # Server crashed
$ gdb ./mystery-server core.4567
(gdb) bt
#0 0x0000000000401a3c in ?? ()
#1 0x0000000000401b89 in ?? ()
#2 0x0000000000401df2 in ?? ()
#3 0x00000000004015a1 in ?? ()
#4 0x00007ffff7db3d90 in __libc_start_call_main () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) # All "??" - no symbols. But we can still analyze!
(gdb) x/10i 0x0000000000401a3c
0x401a3c: mov (%rax),%edx # Crashed here - dereferencing RAX
0x401a3e: add %edx,%r12d
0x401a41: add $0x8,%rax
0x401a45: cmp %rax,%rbx
...
(gdb) info registers rax
rax 0x0 0 # RAX is 0! NULL pointer dereference
(gdb) # Let's find what function we're in by looking at frame 1's call site
(gdb) x/5i 0x0000000000401b89-5
0x401b84: call 0x401a20 # Calling our crashing function
0x401b89: mov %eax,%r13d # Return point (frame 1's address)
(gdb) # So 0x401a20 is the start of the crashing function
(gdb) # Let's see what it does
(gdb) x/30i 0x401a20
0x401a20: push %rbp
0x401a21: push %rbx
0x401a22: push %r12
0x401a24: mov %rdi,%rbp # First arg (pointer)
0x401a27: mov %rsi,%rbx # Second arg (count)
...
0x401a3c: mov (%rax),%edx # CRASH: rax derived from arg
The Core Question You Are Answering
“How do I debug a crash when the binary tells me NOTHING?”
Stripped binaries are the norm in production. When something crashes, you won’t have source lines, function names, or variable names. But you still have the CPU state, memory contents, and the binary itself. With assembly knowledge and systematic analysis, you can still find the bug.
Concepts You Must Understand First
- x86-64 Calling Convention
- Arguments 1-6 are in: RDI, RSI, RDX, RCX, R8, R9
- Return value is in RAX
- Callee-saved: RBP, RBX, R12-R15
- Stack grows down, RSP is stack pointer
- Book Reference: “CSAPP” by Bryant — Ch. 3.7
- Basic x86-64 Instructions
mov (%rax), %edx— Load memory at address in RAX into EDXcall ADDR— Call function at ADDRpush/pop— Stack operationscmp/je/jne— Comparison and conditional jumps- Book Reference: “CSAPP” by Bryant — Ch. 3
- ELF Structure Without Symbols
- How to find function boundaries (look for push rbp; mov rsp, rbp)
- Using .plt and .got for library function identification
- String references can reveal function purpose
- Book Reference: “Practical Binary Analysis” by Andriesse — Ch. 2-3
- GDB Disassembly Commands
x/Ni ADDR— Disassemble N instructions at ADDRdisassemble ADDR,+LEN— Disassemble rangeinfo registers— Show all registersx/s ADDR— Try to interpret memory as string- Book Reference: GDB Manual, Examining Memory
Questions to Guide Your Design
- Scenario Creation
- How will you create a stripped binary that crashes in interesting ways?
- Will you keep a non-stripped version for verification?
- Analysis Workflow
- What systematic steps will you follow for stripped binary analysis?
- How will you document the workflow for future reference?
- Tool Integration
- Will you use objdump, readelf, or both?
- Will you introduce a disassembler (Ghidra, IDA Free)?
- Pattern Recognition
- What common patterns will you learn to recognize (function prologues, loops, string operations)?
- How will you identify library calls?
Thinking Exercise
Decode the Crash
Given this GDB output from a stripped binary:
(gdb) bt
#0 0x0000000000401234 in ?? ()
#1 0x00007ffff7ca5678 in strlen () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x0000000000401567 in ?? ()
(gdb) info registers
rax 0x0 0
rdi 0x0 0
Questions:
- Frame 1 is
strlenfrom libc. What does this tell you about the crash? - RDI is 0, and
strlentakes its argument in RDI. What happened? - The bug is likely in frame 2 (0x401567). Why?
- What would you examine next?
Analysis:
1. strlen crashed because it received NULL (RDI=0)
2. Frame 2 called strlen with a NULL pointer
3. Next step: disassemble around 0x401567 to see why NULL was passed
4. Look for: mov $0x0,%rdi or a conditional that should have checked for NULL
The Interview Questions They Will Ask
- “You have a core dump from a stripped binary. Walk me through your analysis approach.”
- “What x86-64 registers contain function arguments 1, 2, and 3?”
- “How can you find function boundaries in a stripped binary?”
- “The crash is in libc’s malloc. How do you find the bug in your code?”
- “How do you identify what function a code block implements without symbols?”
- “What tools besides GDB help with stripped binary analysis?”
Hints in Layers
Hint 1: Find Function Starts Look for the standard function prologue:
push %rbp
mov %rsp,%rbp
sub $0xNN,%rsp ; Allocate stack space
Every push %rbp is likely a function entry.
Hint 2: Use Cross-References Find who calls the crashing function:
(gdb) x/5i RETURN_ADDRESS-5
# Look for "call CRASHING_FUNC_ADDR"
Hint 3: Library Calls Are Labeled Even in stripped binaries, calls to libc functions go through PLT:
call 0x401030 <strlen@plt>
GDB shows these names because they’re in the dynamic symbol table.
Hint 4: String References Look for strings to understand purpose:
$ strings -t x ./mystery-server | grep -i error
4a20 Error: invalid input
4b30 Connection error
Then in GDB:
(gdb) x/s 0x404a20
0x404a20: "Error: invalid input"
Find code that references this address.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| x86-64 Assembly | “CSAPP” by Bryant & O’Hallaron | Ch. 3: Machine-Level Representation |
| ELF Format | “Practical Binary Analysis” by Andriesse | Ch. 2: ELF Binary Format |
| Reverse Engineering | “Practical Reverse Engineering” by Dang | Ch. 1-2 |
| GDB Advanced | “The Art of Debugging” by Matloff | Ch. 7: Advanced Topics |
Common Pitfalls and Debugging
Problem 1: “Can’t tell where functions start/end”
- Why: Without symbols, there are no boundaries marked.
- Fix: Look for
push %rbp(function start) andret(function end). Also look for alignment padding (nop instructions). - Quick test:
objdump -d ./binary | grep -E "(push.*%rbp|retq)"
Problem 2: “Register values don’t make sense”
- Why: You’re looking at registers after the crash, not before. Some may be corrupted.
- Fix: Focus on callee-saved registers (RBP, RBX, R12-R15) as they preserve values across calls.
- Quick test: Verify stack integrity first
Problem 3: “Can’t find the string that caused the crash”
- Why: String might be on the heap (not in binary) or already freed.
- Fix: Examine memory at the pointer address. Check if it’s a valid address range.
- Quick test:
info proc mappingsto see valid address ranges
Problem 4: “Disassembly looks like garbage”
- Why: You might be disassembling data, not code. Or wrong address.
- Fix: Use the addresses from the backtrace. They’re valid instruction pointers.
- Quick test:
x/1i $pcshould always show a valid instruction
Definition of Done
- Created a stripped binary that crashes in an interesting way
- Can analyze the crash using only GDB and the binary (no source)
- Documented the x86-64 calling convention and key registers
- Can find function boundaries without symbols
- Can identify what libc functions are being called
- Can trace from crash site to the actual bug
- Created a stripped binary analysis workflow/checklist
- Verified findings against the (hidden) source code
Project 7: Minidump Parser — Understanding Compact Crash Formats
- File: P07-minidump-parser-compact-crash-formats.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Binary Parsing, File Formats, Cross-Platform
- Software or Tool: Breakpad, minidump_dump, custom parser
- Main Book: “Practical Binary Analysis” by Dennis Andriesse
What you will build: A parser for Google Breakpad’s minidump format—a compact, cross-platform crash dump format used by Chrome, Firefox, and many other applications. You’ll learn to read the minidump specification and extract crash information from real-world minidump files.
Why it teaches crash dump analysis: Not all crash dumps are ELF core files. Breakpad minidumps are ubiquitous in commercial software because they’re small (kilobytes vs megabytes), cross-platform, and contain just enough information for diagnosis. Understanding this format exposes you to how the industry handles crash reporting at scale.
Core challenges you will face:
- Reading a binary file format specification → Maps to understanding headers, streams, and directories
- Parsing variable-length structures → Maps to handling strings, lists, and nested data
- Extracting platform-specific data → Maps to handling x86-64 vs ARM vs Windows contexts
- Correlating with symbol files → Maps to understanding .sym files and address resolution
Real World Outcome
You will have a working minidump parser that extracts crash information comparable to Google’s minidump_dump tool.
Example Output:
$ ./minidump-parser ./crash.dmp
=== Minidump Analysis: crash.dmp ===
Header:
Signature: MDMP
Version: 0xa793
Stream Count: 12
Timestamp: 2025-01-04 15:30:45 UTC
Crash Info:
Exception Code: EXCEPTION_ACCESS_VIOLATION (0xc0000005)
Crash Address: 0x00007ff6123456ab
Crashing Thread: 0
Threads (4 total):
Thread 0 (Crashed):
Stack: 0x000000c8f9b00000 - 0x000000c8f9c00000
Context:
RIP: 0x00007ff6123456ab
RSP: 0x000000c8f9bff8a0
RBP: 0x000000c8f9bff8f0
Thread 1:
Stack: 0x000000c8f9400000 - 0x000000c8f9500000
Context: [suspended]
Modules (15 loaded):
0x00007ff612340000 - 0x00007ff612380000 myapp.exe
0x00007ffc12340000 - 0x00007ffc12560000 ntdll.dll
0x00007ffc10000000 - 0x00007ffc10200000 kernel32.dll
...
Memory Regions (5 captured):
0x000000c8f9bff800 - 0x000000c8f9c00000 (2048 bytes) - Stack near crash
...
Stack Trace (unsymbolicated):
#0 0x00007ff6123456ab myapp.exe + 0x156ab
#1 0x00007ff612345123 myapp.exe + 0x15123
#2 0x00007ffc12345678 ntdll.dll + 0x5678
The Core Question You Are Answering
“How do crash reporting systems represent crashes in a portable, compact format?”
ELF core dumps are Linux-specific and huge. Minidumps solve this: they’re small, cross-platform, and contain just the information needed for diagnosis. Understanding this format teaches you how production crash reporting actually works at companies like Google, Mozilla, and Microsoft.
Concepts You Must Understand First
- Minidump Format Structure
- Header: Signature, version, stream directory location
- Stream directory: Array of (type, size, offset) entries
- Streams: Thread list, module list, exception, memory, system info
- Book Reference: MSDN Minidump File Format documentation
- Stream Types You’ll Parse
- MINIDUMP_STREAM_TYPE: ThreadListStream, ModuleListStream, ExceptionStream
- Memory streams: MemoryListStream, Memory64ListStream
- Context stream: Thread context (registers) per architecture
- Book Reference: Breakpad source code (src/google_breakpad/common/minidump_format.h)
- CPU Context Structures
- Different for x86, x86-64, ARM, ARM64
- Contains all registers at crash time
- Must know architecture to interpret correctly
- Book Reference: Processor-specific ABI documents
- Symbol Files (.sym)
- Breakpad’s text-based symbol format
- Maps addresses to function names and source lines
- PUBLIC records for function starts, FUNC + line records for details
- Book Reference: Breakpad symbol file format documentation
Questions to Guide Your Design
- Parsing Strategy
- Will you read the entire file into memory or seek to each stream?
- How will you handle endianness (minidumps from Windows are little-endian)?
- Platform Support
- Will you support x86-64 contexts only, or multiple architectures?
- How will you detect the architecture from the minidump?
- Output Format
- What human-readable format will you produce?
- Will you support JSON output for programmatic use?
- Symbol Integration
- Will you support loading .sym files for symbolication?
- How will you match modules to their symbol files?
Thinking Exercise
Map the Minidump Structure
Given a hex dump of a minidump header:
00000000: 4d44 4d50 93a7 0000 0c00 0000 2001 0000 MDMP........
00000010: 0000 0000 xxxx xxxx xxxx xxxx xxxx xxxx ................
Questions:
- What is the signature (first 4 bytes)?
- What is the version (next 4 bytes)?
- How many streams are in the directory?
- Where is the stream directory located?
Decode:
Signature: "MDMP" (0x504d444d in little-endian)
Version: 0x0000a793
Stream Count: 12 (0x0000000c)
Stream Directory RVA: 0x00000120 (288 bytes into file)
The Interview Questions They Will Ask
- “What is a minidump, and how does it differ from a core dump?”
- “Walk me through the structure of a minidump file.”
- “How does Breakpad capture crash information without stopping the process?”
- “What information would you need to symbolicate a minidump stack trace?”
- “How would you design a crash reporting system for a mobile app?”
- “What are the privacy implications of crash dumps, and how do minidumps address them?”
Hints in Layers
Hint 1: Start with the Header The header is fixed-size and tells you where everything else is:
typedef struct {
uint32_t signature; // "MDMP"
uint32_t version;
uint32_t stream_count;
uint32_t stream_directory_rva; // Offset to directory
uint32_t checksum;
uint32_t timestamp;
uint64_t flags;
} MINIDUMP_HEADER;
Hint 2: Parse the Stream Directory Each entry tells you the type, size, and location of a stream:
typedef struct {
uint32_t stream_type; // e.g., ThreadListStream = 3
uint32_t size;
uint32_t rva; // Offset in file
} MINIDUMP_DIRECTORY;
Hint 3: Use Existing Tools for Verification
Breakpad’s minidump_dump tool shows you what to expect:
$ minidump_dump crash.dmp
# Compare your output with this
Hint 4: Handle Variable-Length Strings Minidumps use MINIDUMP_STRING for module names:
typedef struct {
uint32_t length; // In bytes, not including terminator
uint16_t buffer[1]; // UTF-16LE string data
} MINIDUMP_STRING;
Read length bytes, convert from UTF-16LE to your preferred encoding.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Binary Parsing | “Practical Binary Analysis” by Andriesse | Ch. 2-3: Binary Formats |
| File I/O in C | “The C Programming Language” by K&R | Ch. 7: Input and Output |
| Endianness | “CSAPP” by Bryant | Ch. 2.1: Information Storage |
| Windows Internals | “Windows Internals” by Russinovich | Ch. 3: System Mechanisms |
Common Pitfalls and Debugging
Problem 1: “Signature doesn’t match ‘MDMP’“
- Why: You might be reading big-endian or the file isn’t a minidump.
- Fix: Check byte order. Minidumps are little-endian. Verify with
filecommand. - Quick test:
xxd -l 16 crash.dmpshould show “MDMP” as ASCII
Problem 2: “Stream RVA points to wrong location”
- Why: RVA is relative to file start, not current position. Off-by-one error.
- Fix: Seek to absolute position in file, not relative.
- Quick test: Print RVAs and manually verify with hex editor
Problem 3: “Module names are garbled”
- Why: Minidumps store strings as UTF-16LE, not ASCII.
- Fix: Convert UTF-16LE to UTF-8. The
lengthfield is in bytes. - Quick test: Check if bytes alternate with 0x00 (ASCII in UTF-16)
Problem 4: “Context structure size doesn’t match”
- Why: Context structure varies by CPU architecture.
- Fix: Check SystemInfo stream for processor architecture, use correct struct.
- Quick test: Print context size vs expected for the architecture
Definition of Done
- Can read and validate the minidump header
- Can enumerate all streams in the stream directory
- Can parse ThreadListStream and show thread information
- Can parse ModuleListStream and show loaded modules
- Can parse ExceptionStream and show crash details
- Can extract and display CPU context (registers) for crashing thread
- Output matches
minidump_dumptool for test files - Documented the minidump format as learned
Project 8: Kernel Panic Anatomy — Triggering and Capturing with kdump
- File: P08-kernel-panic-kdump-capture.md
- Main Programming Language: C (kernel module)
- Alternative Programming Languages: Shell scripting for setup
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Kernel Development, System Recovery, Enterprise Linux
- Software or Tool: kdump, kexec, crash, kernel module development
- Main Book: “Linux Kernel Development” by Robert Love
What you will build: A controlled kernel panic environment using kdump—the industry-standard mechanism for capturing kernel crash dumps. You’ll write a simple kernel module that triggers a panic on demand and configure kdump to capture the vmcore for analysis.
Why it teaches crash dump analysis: User-space crashes are tame compared to kernel panics. When the kernel crashes, there’s no OS to save you—kdump uses a clever trick (kexec) to boot a minimal “capture kernel” that saves memory before rebooting. This is how enterprise Linux (RHEL, SUSE) handles kernel crashes, and understanding it is essential for systems engineers.
Core challenges you will face:
- Configuring kdump correctly → Maps to kernel parameters, crashkernel reservation
- Writing a kernel module → Maps to basic kernel development
- Understanding kexec → Maps to the boot-into-capture-kernel mechanism
- Handling the vmcore → Maps to where and how the dump is saved
Real World Outcome
You will have a working kdump setup in a VM that reliably captures kernel panics, along with a kernel module that triggers panics on demand for testing.
Example Output:
$ # On a VM configured with kdump
$ sudo kdumpctl status
kdump: Kdump is operational
$ cat /proc/cmdline
... crashkernel=256M ...
$ # Our panic-trigger module
$ ls /sys/kernel/panic_trigger/
trigger
$ # Trigger the panic (VM will crash and reboot)
$ echo 1 | sudo tee /sys/kernel/panic_trigger/trigger
[ 123.456789] panic_trigger: Triggering kernel panic!
[ 123.456790] Kernel panic - not syncing: Manually triggered panic
...
$ # After reboot, kdump captured the vmcore
$ ls /var/crash/127.0.0.1-2025-01-04-15:30:45/
vmcore vmcore-dmesg.txt
$ # Check that it's valid
$ sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
/var/crash/127.0.0.1-2025-01-04-15:30:45/vmcore
crash> bt
PID: 1234 TASK: ffff8881234abcd0 CPU: 0 COMMAND: "tee"
#0 [ffffc90001234e40] machine_kexec at ffffffff81060abc
#1 [ffffc90001234e90] __crash_kexec at ffffffff811234de
#2 [ffffc90001234f00] panic at ffffffff81890123
#3 [ffffc90001234f80] panic_trigger_write at ffffffffc0001234 [panic_trigger]
...
The Core Question You Are Answering
“When the kernel itself crashes, how do you capture the state when there’s no OS to help?”
The kernel is the foundation—when it panics, everything stops. The ingenious solution is kdump: pre-load a second “capture kernel” into reserved memory, and when panic occurs, kexec immediately boots into it. This minimal kernel’s only job is to save the crashed kernel’s memory to disk before rebooting. It’s one of the most elegant debugging mechanisms in systems software.
Concepts You Must Understand First
- Kernel Panic Basics
- What triggers a kernel panic (BUG(), NULL deref in kernel, deadlock)?
- What information is printed to console?
- Why can’t you just write to disk from the panicking kernel?
- Book Reference: “Linux Kernel Development” by Love — Ch. 18: Debugging
- kexec and kdump Architecture
- kexec: Boot a new kernel without going through BIOS
- kdump: Use kexec to boot into a pre-loaded capture kernel
- crashkernel parameter: Reserve memory for the capture kernel
- Book Reference: kernel.org documentation on kdump
- Kernel Module Basics
- Module loading with insmod/modprobe
- init and exit functions
- Sysfs interface for triggering actions
- Book Reference: “Linux Device Drivers” by Corbet — Ch. 1-2
- vmcore Format
- The memory dump created by kdump
- ELF format with special notes
- Contains all kernel memory at panic time
- Book Reference: crash utility documentation
Questions to Guide Your Design
- VM Setup
- What virtualization platform will you use (QEMU, VirtualBox, VMware)?
- How much RAM should you reserve for crashkernel?
- kdump Configuration
- Where should vmcores be saved (local disk, NFS, SSH)?
- What kernel debug symbols do you need?
- Panic Module Design
- What trigger mechanism (sysfs file, procfs, ioctl)?
- Should it support different panic types (BUG, NULL, deadlock)?
- Safety Considerations
- How will you ensure this only runs in a VM?
- What warnings should the module print?
Thinking Exercise
Trace the kdump Boot Sequence
Map out what happens when you trigger a panic with kdump configured:
1. User writes to /sys/kernel/panic_trigger/trigger
2. Kernel module calls panic("...")
3. Kernel enters panic() function:
- Disables interrupts, stops other CPUs
- Prints panic message to console
- Calls __crash_kexec() if kdump is configured
4. __crash_kexec():
- Copies registers and memory info to pre-defined location
- Jumps to the capture kernel (loaded at boot time)
5. Capture kernel boots:
- Minimal initramfs with makedumpfile
- Reads original kernel's memory from /proc/vmcore
- Writes vmcore to /var/crash/
- Reboots into normal kernel
6. After reboot:
- vmcore is available for analysis with crash utility
Questions:
- Why must the capture kernel be pre-loaded (not loaded at panic time)?
- Why does crashkernel memory need to be reserved at boot?
- What if the panic happens in the capture kernel?
The Interview Questions They Will Ask
- “A production server kernel panicked overnight. Walk me through your investigation process.”
- “What is kdump, and how does it differ from regular core dumps?”
- “How does kexec boot a new kernel without going through BIOS?”
- “What is the crashkernel boot parameter, and how do you determine the right value?”
- “What information do you need to analyze a vmcore?”
- “How would you configure kdump to send crash dumps to a remote server?”
Hints in Layers
Hint 1: Start with kdump Configuration Before writing modules, get kdump working with manual triggers:
# Fedora/RHEL
sudo dnf install kexec-tools crash kernel-debuginfo
sudo systemctl enable kdump
# Add crashkernel=256M to kernel command line
sudo reboot
# Test kdump is operational
sudo kdumpctl status
Hint 2: Simple Panic Module Skeleton
// panic_trigger.c
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sysfs.h>
#include <linux/kobject.h>
static struct kobject *panic_kobj;
static ssize_t trigger_store(struct kobject *kobj,
struct kobj_attribute *attr, const char *buf, size_t count)
{
pr_alert("panic_trigger: Triggering kernel panic!\n");
panic("Manually triggered panic from panic_trigger module");
return count; // Never reached
}
static struct kobj_attribute trigger_attr = __ATTR_WO(trigger);
static int __init panic_trigger_init(void) { /* ... */ }
static void __exit panic_trigger_exit(void) { /* ... */ }
Hint 3: Verify Debug Symbols The crash utility needs matching kernel debug symbols:
# Check kernel version
uname -r
# Install debug symbols
# Fedora: sudo dnf debuginfo-install kernel
# Ubuntu: sudo apt install linux-image-$(uname -r)-dbgsym
# Verify
ls /usr/lib/debug/lib/modules/$(uname -r)/vmlinux
Hint 4: Test in VM Only! Add a safety check to your module:
static int __init panic_trigger_init(void)
{
if (!hypervisor_is_present()) {
pr_err("panic_trigger: Refusing to load on bare metal!\n");
return -EPERM;
}
// ...
}
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Kernel Modules | “Linux Device Drivers, 3rd Ed” by Corbet | Ch. 1-2: Building Modules |
| Kernel Debugging | “Linux Kernel Development” by Love | Ch. 18: Debugging |
| Kernel Internals | “Understanding the Linux Kernel” by Bovet | Ch. 4: Interrupts |
| System Boot | “How Linux Works” by Ward | Ch. 5: How Linux Boots |
Common Pitfalls and Debugging
Problem 1: “kdump service won’t start”
- Why: crashkernel parameter missing or insufficient memory reserved.
- Fix: Add
crashkernel=256M(or more) to kernel command line in GRUB. - Quick test:
cat /proc/cmdline | grep crashkernel
Problem 2: “Panic happens but no vmcore created”
- Why: Capture kernel failed to boot or write. Check crash directory.
- Fix: Check
/var/crash/for partial dumps or errors. Check serial console output. - Quick test:
journalctl -b -1 | grep -i kdump(previous boot logs)
Problem 3: “crash says ‘vmcore and vmlinux do not match’“
- Why: Debug symbols are for a different kernel version.
- Fix: Install debug symbols for the exact kernel that crashed.
- Quick test:
crash --versionand compare kernel versions
Problem 4: “Module won’t load: ‘Unknown symbol’“
- Why: Missing kernel headers or mismatched versions.
- Fix: Install kernel-devel package for your running kernel.
- Quick test:
ls /lib/modules/$(uname -r)/build
Definition of Done
- VM configured with kdump operational (
kdumpctl statusshows ready) - Kernel module created that triggers panic via sysfs
- Successfully triggered panic and vmcore was captured
- vmcore can be opened with crash utility
- Can extract backtrace showing the panic call chain
- Documented the complete kdump setup process
- Module includes safety check to prevent loading on bare metal
Project 9: Analyzing Kernel Crashes with the crash Utility
- File: P09-analyzing-kernel-crashes-crash-utility.md
- Main Programming Language: crash commands (interactive)
- Alternative Programming Languages: crash extensions in C
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Kernel Internals, Debugging, System Analysis
- Software or Tool: crash utility, kernel debug symbols
- Main Book: “Understanding the Linux Kernel” by Bovet & Cesati
What you will build: Expertise in using the crash utility to analyze kernel vmcores. You’ll investigate real kernel panics (from Project 8 and downloaded examples) and learn to navigate kernel data structures, examine process state, and identify root causes of kernel crashes.
Why it teaches crash dump analysis: The crash utility is to kernel debugging what GDB is to user-space. It’s the standard tool used by Red Hat, SUSE, and kernel developers worldwide. Mastering it opens doors to kernel debugging, enterprise support, and deep systems understanding.
Core challenges you will face:
- Navigating crash’s command set → Maps to bt, task, vm, kmem, and dozens more
- Understanding kernel data structures → Maps to task_struct, mm_struct, etc.
- Correlating kernel state → Maps to finding what led to the panic
- Reading kernel source alongside crash → Maps to effective debugging workflow
Real World Outcome
You will have a systematic workflow for analyzing kernel crashes using crash, with a command reference and documented investigation of real vmcores.
Example Output:
$ sudo crash /usr/lib/debug/lib/modules/5.15.0/vmlinux /var/crash/vmcore
crash 8.0.0
...
KERNEL: /usr/lib/debug/lib/modules/5.15.0/vmlinux
DUMPFILE: /var/crash/vmcore
DATE: Sat Jan 4 15:30:45 UTC 2025
UPTIME: 01:23:45
LOAD AVERAGE: 0.50, 0.35, 0.20
TASKS: 256
NODENAME: myserver
RELEASE: 5.15.0-generic
VERSION: #1 SMP
MACHINE: x86_64
MEMORY: 8 GB
PANIC: "Kernel panic - not syncing: Manually triggered panic"
crash> bt
PID: 1234 TASK: ffff888123456780 CPU: 2 COMMAND: "tee"
#0 [ffffc90001234000] machine_kexec at ffffffff81060abc
#1 [ffffc90001234050] __crash_kexec at ffffffff81123def
#2 [ffffc90001234100] panic at ffffffff818901ab
#3 [ffffc90001234180] panic_trigger_write at ffffffffc0123456 [panic_trigger]
#4 [ffffc900012341d0] kernfs_fop_write_iter at ffffffff8132abcd
#5 [ffffc90001234250] vfs_write at ffffffff8128def0
#6 [ffffc90001234290] ksys_write at ffffffff8128f123
#7 [ffffc900012342d0] do_syscall_64 at ffffffff81890456
#8 [ffffc90001234300] entry_SYSCALL_64_after_hwframe at ffffffff82000089
crash> task
PID: 1234 TASK: ffff888123456780 CPU: 2 COMMAND: "tee"
struct task_struct {
state = 0x0,
stack = 0xffffc90001230000,
pid = 1234,
tgid = 1234,
comm = "tee",
...
}
crash> vm
PID: 1234 TASK: ffff888123456780 CPU: 2 COMMAND: "tee"
MM PGD RSS TOTAL_VM
ffff888112233440 ffff88810abcd000 2048k 12288k
VMA START END FLAGS FILE
ffff888100001000 7f0000000000 7f0000021000 8000875 /usr/bin/tee
...
crash> log | tail -20
[ 123.456789] panic_trigger: Triggering kernel panic!
[ 123.456790] Kernel panic - not syncing: Manually triggered panic
The Core Question You Are Answering
“How do I investigate a kernel crash when I have gigabytes of kernel memory and thousands of data structures?”
A vmcore contains everything: all processes, all memory, all kernel state. The challenge is navigation. The crash utility provides commands to examine specific data structures, follow pointers, and correlate state across subsystems. It’s like GDB but for the entire operating system state.
Concepts You Must Understand First
- crash Command Categories
- Process commands: bt, task, ps, foreach
- Memory commands: vm, kmem, rd, wr
- Kernel state: log, timer, irq, runq
- System info: sys, mach, net
- Book Reference: crash whitepaper by Dave Anderson (Red Hat)
- Key Kernel Data Structures
- task_struct: Process descriptor
- mm_struct: Memory management
- inode, dentry: Filesystem
- sk_buff, socket: Networking
- Book Reference: “Understanding the Linux Kernel” by Bovet — throughout
- Kernel Stack Traces
- How to read kernel backtraces
- Identifying the panic function
- Finding the root cause vs symptom
- Book Reference: “Linux Kernel Development” by Love — Ch. 18
- Using Source Code
- Correlating crash output with kernel source
- Using LXR or elixir.bootlin.com
- Understanding inline functions and macros
- Book Reference: Kernel source code itself
Questions to Guide Your Design
- Learning Path
- What commands will you focus on first?
- How will you practice each command?
- Sample Vmcores
- Where will you get vmcores to practice with?
- Will you create different crash types in Project 8?
- Documentation
- What format will your crash command reference take?
- How will you document your analysis workflow?
- Advanced Features
- Will you explore crash extensions?
- Will you write custom crash macros?
Thinking Exercise
Trace a NULL Pointer Panic
Given this crash backtrace:
crash> bt
#0 page_fault at ffffffff81789abc
#1 do_page_fault at ffffffff8101def0
#2 async_page_fault at ffffffff82000123
#3 my_driver_read at ffffffffc0123456 [my_driver]
#4 vfs_read at ffffffff8128abc0
#5 ksys_read at ffffffff8128bcd0
#6 do_syscall_64 at ffffffff81890456
Questions:
- What is the immediate cause of the panic?
- Which frame contains your (or the third-party) code?
- What commands would you use to investigate my_driver_read?
Investigation steps:
crash> dis my_driver_read # Disassemble the function
crash> bt -f # Show frame addresses
crash> x/20x <frame_address> # Examine stack at crash point
crash> task # What process triggered this?
crash> files # What files did it have open?
The Interview Questions They Will Ask
- “Walk me through analyzing a kernel panic using the crash utility.”
- “What does the ‘bt’ command show, and how do you interpret kernel stack traces?”
- “How do you find what process was running when the kernel panicked?”
- “A customer reports a production kernel panic. What information do you need?”
- “How do you examine the contents of a kernel data structure in crash?”
- “What is the difference between ‘rd’ and ‘struct’ commands in crash?”
Hints in Layers
Hint 1: Essential Commands to Learn First
bt # Backtrace of current/specified task
log # Kernel ring buffer (dmesg)
task # Current task's task_struct
ps # Process list
vm # Virtual memory info
kmem -i # Memory usage summary
sys # System information
Hint 2: Investigating a Specific Task
crash> ps | grep suspicious_process
1234 1 0 ffff888123456780 RU 0.5 myprocess
crash> set 1234 # Set context to PID 1234
crash> bt # Backtrace of that process
crash> files # Open files
crash> vm # Memory maps
crash> task -R # Reveal task_struct contents
Hint 3: Examining Memory
crash> kmem -s # Slab cache info
crash> kmem <address> # What does this address point to?
crash> rd <address> 64 # Read 64 bytes at address
crash> struct task_struct <addr> # Interpret as task_struct
crash> list task_struct.tasks -H <head> # Walk a linked list
Hint 4: Using foreach
crash> foreach bt # Backtrace of ALL tasks
crash> foreach RU bt # Backtrace of running tasks only
crash> foreach files # Open files for all processes
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| crash Basics | crash whitepaper by Dave Anderson | Entire document |
| Kernel Internals | “Understanding the Linux Kernel” by Bovet | Ch. 3: Processes |
| Memory Management | “Understanding the Linux Kernel” by Bovet | Ch. 8: Memory |
| Kernel Debugging | “Linux Kernel Development” by Love | Ch. 18: Debugging |
Common Pitfalls and Debugging
Problem 1: “crash says ‘cannot find vmlinux’“
- Why: Debug symbols not installed or wrong path.
- Fix: Install kernel debuginfo package. Use full path to vmlinux.
- Quick test:
find /usr/lib/debug -name "vmlinux*"
Problem 2: “Backtrace shows only addresses, no function names”
- Why: Missing debug symbols or wrong vmlinux.
- Fix: Ensure vmlinux matches the kernel that crashed exactly.
- Quick test:
crash -s vmlinux vmcoreshows version mismatch warnings
Problem 3: “‘struct’ command shows garbage”
- Why: Wrong address or data structure mismatch.
- Fix: Verify the address with
kmem <addr>first. Check kernel version. - Quick test:
kmem -sto find valid slab addresses to practice with
Problem 4: “Can’t find the process that crashed”
- Why: The crashing context might be interrupt or kernel thread.
- Fix: Check
btoutput for COMMAND field. Useps -kfor kernel threads. - Quick test:
btshows PID and COMMAND at the top
Definition of Done
- Can load a vmcore into crash and get basic info
- Mastered bt, log, task, ps, vm commands
- Can investigate which process/thread triggered a panic
- Can examine kernel data structures with struct command
- Can navigate memory with rd, kmem, and address interpretation
- Created a crash command cheat sheet with examples
- Analyzed at least 3 different types of kernel crashes
- Documented a complete investigation workflow
Project 10: Centralized Crash Reporter — Production-Grade Infrastructure
- File: P10-centralized-crash-reporter-infrastructure.md
- Main Programming Language: Python (backend), C (collector)
- Alternative Programming Languages: Go, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: System Design, Distributed Systems, DevOps
- Software or Tool: systemd-coredump, REST API, PostgreSQL, object storage
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you will build: A complete crash reporting infrastructure that collects crashes from multiple hosts, stores them centrally, generates analysis reports, fingerprints for deduplication, and provides a web interface for browsing crashes. This is a mini-version of Sentry, Crashlytics, or Mozilla’s crash-stats.
Why it teaches crash dump analysis: Everything you’ve learned comes together: core dump capture, GDB analysis, automation, fingerprinting. This project teaches you to think at scale: How do you handle 10,000 crashes per day? How do you identify the top 10 bugs? How do you integrate with alerting systems?
Core challenges you will face:
- Reliable crash collection → Maps to handling partial dumps, network failures
- Scalable storage → Maps to compression, retention, object storage
- Fingerprinting and deduplication → Maps to grouping same bugs together
- Analysis pipeline → Maps to automated report generation at scale
- User interface → Maps to presenting crash data usefully
Real World Outcome
You will have a working crash reporting system that you can deploy for real services, demonstrating end-to-end understanding of production crash analysis.
Example Output:
Web Interface:
┌─────────────────────────────────────────────────────────────────┐
│ Crash Reporter Dashboard Last 24h ▼ │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Total Crashes: 847 Unique Bugs: 23 Hosts Affected: 12 │
│ │
│ Top Crashes (by occurrence) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ #1 NULL_DEREF_process_request_server.c:234 (423 crashes)│ │
│ │ First: Jan 3, 14:22 Last: Jan 4, 08:45 │ │
│ │ Hosts: web-01, web-02, web-03 │ │
│ │ [View Stack] [View Crashes] [Mark Fixed] │ │
│ ├─────────────────────────────────────────────────────────────┤ │
│ │ #2 SEGV_handle_upload_api.c:892 (198 crashes) │ │
│ │ First: Jan 4, 02:15 Last: Jan 4, 08:30 │ │
│ │ Hosts: api-01, api-02 │ │
│ │ [View Stack] [View Crashes] [Mark Fixed] │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
API Output:
$ curl https://crash-reporter.internal/api/v1/crashes/latest
{
"crashes": [
{
"id": "crash-2025-01-04-abc123",
"fingerprint": "NULL_DEREF_process_request_server.c:234",
"timestamp": "2025-01-04T08:45:23Z",
"host": "web-01",
"executable": "/opt/myapp/server",
"signal": "SIGSEGV",
"backtrace": [
{"frame": 0, "function": "process_request", "file": "server.c", "line": 234},
{"frame": 1, "function": "handle_client", "file": "server.c", "line": 189}
],
"analysis": {
"crash_type": "null_pointer_dereference",
"variable": "req",
"recommendation": "Add null check for req parameter"
}
}
],
"total": 847,
"page": 1,
"per_page": 50
}
Collection Agent:
$ sudo crash-agent status
Crash Collection Agent v1.0.0
Status: Running
Server: https://crash-reporter.internal
Collected today: 12 crashes
Queue: 0 pending
$ # When a crash happens:
[2025-01-04 08:45:23] Detected new core dump: /var/lib/systemd/coredump/core.server.1234.xxx
[2025-01-04 08:45:24] Analyzing with GDB...
[2025-01-04 08:45:25] Generated report (fingerprint: NULL_DEREF_process_request_server.c:234)
[2025-01-04 08:45:26] Uploaded to server (crash-2025-01-04-abc123)
[2025-01-04 08:45:26] Core dump retained locally for 7 days
The Core Question You Are Answering
“How do you build a system that turns thousands of raw crashes into actionable insights?”
A crash dump is useless if it sits in /var/crash on one server. You need: collection (get dumps from everywhere), analysis (extract useful information), aggregation (group similar crashes), storage (keep them queryable), and presentation (show developers what matters). This is crash analysis at production scale.
Concepts You Must Understand First
- System Architecture
- Collection agents on each host
- Central server with API and storage
- Analysis workers (possibly distributed)
- Web frontend for humans
- Book Reference: “Designing Data-Intensive Applications” by Kleppmann — Ch. 1-2
- Crash Fingerprinting
- What identifies a unique bug? (crash location, call stack, signal)
- Hash functions for fingerprints
- Handling variability (ASLR, different call paths to same bug)
- Book Reference: Mozilla crash-stats documentation (online)
- Data Pipeline
- Collection: Watch for new core dumps, extract metadata
- Analysis: Run GDB, generate report, compute fingerprint
- Storage: Metadata in database, cores in object storage
- Book Reference: “Data Pipelines with Apache Airflow” patterns
- Scalability Considerations
- Rate limiting (prevent DoS from crash loops)
- Retention policies (can’t keep everything forever)
- Sampling (for very high volume)
- Book Reference: “Site Reliability Engineering” by Google
Questions to Guide Your Design
- Collection
- How will agents discover new crashes (inotify, polling, coredumpctl)?
- What happens if the network is down?
- How do you handle crash loops (same bug crashing repeatedly)?
- Analysis
- Will you analyze on the host or send raw cores to the server?
- How will you handle missing debug symbols?
- What timeout for analysis?
- Storage
- What metadata goes in the database?
- Where do you store actual core dumps (if at all)?
- What’s your retention policy?
- Fingerprinting
- What parts of the stack trace go into the fingerprint?
- How do you handle different code paths to the same bug?
- How do you detect when a bug is “fixed”?
- Interface
- What views do users need (top bugs, recent crashes, specific host)?
- How do you integrate with alerting (PagerDuty, Slack)?
- Can developers mark bugs as “known” or “fixed”?
Thinking Exercise
Design the Data Flow
Trace a crash from occurrence to dashboard:
Host (web-01) Central Server User
│ │ │
├──[1. Crash occurs] │ │
│ core.server.1234.xxx │ │
│ │ │
├──[2. Agent detects crash] │ │
│ inotify on /var/crash │ │
│ │ │
├──[3. Agent analyzes locally] │ │
│ GDB batch → JSON report │ │
│ │ │
├──[4. Agent POSTs to server]────►├──[5. Server receives] │
│ POST /api/v1/crashes │ Validates, stores │
│ {report, fingerprint} │ │
│ │ │
│ ├──[6. Fingerprint lookup] │
│ │ Existing bug or new? │
│ │ │
│ ├──[7. Update aggregates] │
│ │ Increment counter │
│ │ Update last_seen │
│ │ │
│ ├──[8. Trigger alerts] │
│ │ If new bug or spike │
│ │ │ │
│ │ └───────────────────►│
│ │ Slack/PagerDuty │
│ │ │
│ │ ◄──────────┤
│ │ [9. User views │
│ │ dashboard] │
Questions:
- What if step 4 fails (network issue)?
- What if the same crash happens 1000 times in 1 minute?
- How do you handle crashes from different versions of the same software?
The Interview Questions They Will Ask
- “Design a crash reporting system for a company with 1000 servers.”
- “How would you implement crash fingerprinting to group similar crashes?”
- “What’s your retention strategy for crash dumps at scale?”
- “How do you handle a ‘crash storm’ where a bug causes thousands of crashes per minute?”
- “How would you integrate crash reporting with your deployment pipeline?”
- “What privacy concerns exist with crash dumps, and how do you address them?”
Hints in Layers
Hint 1: Start with the Agent Build the collection agent first—it’s the foundation:
# crash_agent.py - skeleton
import subprocess
import requests
import json
def watch_for_crashes():
# Use coredumpctl or inotify to detect new crashes
pass
def analyze_crash(core_path, exe_path):
# Run GDB batch mode, return JSON report
pass
def compute_fingerprint(report):
# Hash crash location + top N frames
pass
def upload_crash(report, fingerprint):
# POST to central server
pass
Hint 2: Fingerprint Algorithm A simple but effective fingerprint:
def compute_fingerprint(report):
components = [
report['signal'],
report['backtrace'][0]['function'], # Crash function
report['backtrace'][0].get('file', 'unknown'),
report['backtrace'][0].get('line', 0),
]
# Add a few more frames for uniqueness
for frame in report['backtrace'][1:4]:
components.append(frame['function'])
fingerprint_string = '|'.join(str(c) for c in components)
return hashlib.sha256(fingerprint_string.encode()).hexdigest()[:16]
Hint 3: Database Schema
CREATE TABLE crashes (
id UUID PRIMARY KEY,
fingerprint VARCHAR(64),
timestamp TIMESTAMP,
host VARCHAR(255),
executable VARCHAR(512),
signal VARCHAR(32),
report JSONB, -- Full analysis report
FOREIGN KEY (fingerprint) REFERENCES bugs(fingerprint)
);
CREATE TABLE bugs (
fingerprint VARCHAR(64) PRIMARY KEY,
first_seen TIMESTAMP,
last_seen TIMESTAMP,
total_count INTEGER,
status VARCHAR(32), -- new, known, fixed
title VARCHAR(255) -- Human-readable summary
);
Hint 4: Rate Limiting Crash Loops
class CrashRateLimiter:
def __init__(self, max_per_minute=10):
self.max_per_minute = max_per_minute
self.fingerprint_counts = defaultdict(list)
def should_report(self, fingerprint):
now = time.time()
# Clean old entries
self.fingerprint_counts[fingerprint] = [
t for t in self.fingerprint_counts[fingerprint]
if now - t < 60
]
# Check limit
if len(self.fingerprint_counts[fingerprint]) >= self.max_per_minute:
return False
self.fingerprint_counts[fingerprint].append(now)
return True
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| System Design | “Designing Data-Intensive Applications” by Kleppmann | Ch. 1-3, 10-11 |
| API Design | “REST API Design Rulebook” by Masse | Throughout |
| Database Design | “Database Internals” by Petrov | Ch. 1-2 |
| Operations | “Site Reliability Engineering” by Google | Ch. 4, 12 |
Common Pitfalls and Debugging
Problem 1: “Agent can’t keep up with crash rate”
- Why: Analysis is slow, crashes queue up.
- Fix: Implement rate limiting, skip duplicates within time window, async analysis.
- Quick test: Monitor queue depth over time
Problem 2: “Fingerprints are too specific (every crash is ‘unique’)”
- Why: Including too much variable data (addresses, PIDs).
- Fix: Only use stable components (function names, file/line, signal). Strip addresses.
- Quick test: Same bug should produce same fingerprint across runs
Problem 3: “Database grows too fast”
- Why: Storing too much per crash, no retention.
- Fix: Store summary in DB, full report in object storage. Implement TTL.
- Quick test: Track DB size growth over time
Problem 4: “Can’t analyze crashes without debug symbols”
- Why: Agents might not have symbols for all software.
- Fix: Symbol server that agents can query. Fall back to address-only fingerprint.
- Quick test: Test with stripped binary, verify graceful degradation
Definition of Done
- Collection agent watches for and reports new crashes
- Central server receives and stores crash reports
- Fingerprinting correctly groups same bugs together
- Web interface shows crash list, bug list, and statistics
- API provides programmatic access to crash data
- Rate limiting prevents crash loop DoS
- Retention policy automatically removes old data
- Alerting integration (Slack webhook or similar) works
- Documentation covers deployment and configuration
- System tested with simulated crash load
Project Comparison Table
| # | Project | Difficulty | Time | Depth of Understanding | Fun Factor | Real-World Value |
|---|---|---|---|---|---|---|
| 1 | The First Crash — Core Dump Generation | Level 1: Beginner | 4-6 hours | ★★☆☆☆ Foundation | ★★★☆☆ | Essential baseline |
| 2 | The GDB Backtrace — Extracting Crash Context | Level 1: Beginner | 6-8 hours | ★★★☆☆ Practical | ★★★★☆ | Daily debugging |
| 3 | The Memory Inspector — Deep State Examination | Level 2: Intermediate | 10-15 hours | ★★★★☆ Deep | ★★★★☆ | Real investigation |
| 4 | Automated Crash Detective — GDB Scripting | Level 2: Intermediate | 15-20 hours | ★★★★☆ Automation | ★★★★★ | CI/CD integration |
| 5 | Multi-threaded Mayhem — Concurrent Crashes | Level 3: Advanced | 20-30 hours | ★★★★★ Expert | ★★★☆☆ | Production systems |
| 6 | The Stripped Binary Challenge | Level 3: Advanced | 15-25 hours | ★★★★★ Expert | ★★★★★ | Security/Forensics |
| 7 | Minidump Parser — Compact Crash Formats | Level 2: Intermediate | 15-20 hours | ★★★★☆ Format Mastery | ★★★★☆ | Cross-platform |
| 8 | Kernel Panic Anatomy — kdump Configuration | Level 3: Advanced | 20-30 hours | ★★★★★ Kernel-Level | ★★★☆☆ | System reliability |
| 9 | Analyzing Kernel Crashes with crash | Level 3: Advanced | 20-30 hours | ★★★★★ Kernel-Level | ★★★★☆ | Kernel debugging |
| 10 | Centralized Crash Reporter | Level 3: Advanced | 40-60 hours | ★★★★★ Architecture | ★★★★★ | Production essential |
Time Investment Summary
| Learning Path | Total Time | Projects |
|---|---|---|
| Minimum Viable | 10-14 hours | Projects 1, 2 |
| Working Professional | 40-60 hours | Projects 1-4 |
| Expert Track | 100-150 hours | Projects 1-6 |
| Full Mastery | 160-240 hours | All 10 projects |
Recommendation
If You Are New to Crash Analysis
Start with Project 1: The First Crash
This is non-negotiable. You need to understand:
- How core dumps are generated
- Where they’re stored
- What triggers their creation
Many developers have never seen a core dump because ulimit defaults to 0. Fix that first.
Then immediately do Project 2: The GDB Backtrace
This teaches you the 80% of crash debugging that covers 95% of real-world cases. Once you can read a backtrace and examine variables, you’re dangerous.
If You Already Debug with GDB
Start with Project 3: The Memory Inspector
Go deeper. Learn to examine arbitrary memory, understand heap corruption, and trace data structures. This separates the competent from the experts.
Then do Project 4: Automated Crash Detective
Automation pays dividends. A GDB Python script that runs on every crash in CI catches bugs before they hit production.
If You Work on Production Systems
Prioritize Projects 5 and 10
- Project 5 (Multi-threaded Mayhem) teaches you to debug the crashes that only happen under load.
- Project 10 (Centralized Crash Reporter) gives you visibility into crash patterns across your fleet.
If You’re a Kernel Developer or SRE
Projects 8 and 9 are essential
Kernel crashes are different. You need kdump configured, you need to know the crash utility, and you need to understand kernel data structures.
If You Work in Security or Forensics
Project 6: The Stripped Binary Challenge
Real-world malware and proprietary software rarely come with symbols. Learning to debug without them is a superpower.
Recommended Learning Order
For Most Developers: For Kernel/Systems:
┌──────────────────────┐ ┌──────────────────────┐
│ Project 1 (Foundation)│ │ Project 1 (Foundation)│
└───────────┬──────────┘ └───────────┬──────────┘
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Project 2 (GDB Basics)│ │ Project 2 (GDB Basics)│
└───────────┬──────────┘ └───────────┬──────────┘
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Project 3 (Deep Mem) │ │ Project 8 (kdump) │
└───────────┬──────────┘ └───────────┬──────────┘
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Project 4 (Automation)│ │ Project 9 (crash) │
└───────────┬──────────┘ └───────────┬──────────┘
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Project 5 (Threading)│ │ Project 5 (Threading)│
└───────────┬──────────┘ └───────────┬──────────┘
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Project 10 (Reporter)│ │ Project 10 (Reporter)│
└──────────────────────┘ └──────────────────────┘
Final Overall Project: Production Crash Forensics Platform
The Goal
Combine everything you’ve learned into a complete crash forensics platform that could be deployed in a real production environment.
System Name: CrashLens
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ CrashLens Platform │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Collection │ │ Collection │ │ Collection │ │
│ │ Agent (Host 1) │ │ Agent (Host 2) │ │ Agent (Host N) │ │
│ │ │ │ │ │ │ │
│ │ • systemd hook │ │ • systemd hook │ │ • systemd hook │ │
│ │ • minidump gen │ │ • minidump gen │ │ • minidump gen │ │
│ │ • symbol fetch │ │ • symbol fetch │ │ • symbol fetch │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └──────────────────────┼──────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Message Queue (Redis/RabbitMQ) │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────┴───────────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ Analysis Workers │ │ Symbol Server │ │
│ │ │◄───────────────│ │ │
│ │ • GDB Python automation │ symbols │ • debuginfod-compatible│ │
│ │ • Fingerprint generation│ │ • Build ID indexed │ │
│ │ • Stack unwinding │ │ • S3 backend │ │
│ │ • Minidump parsing │ │ │ │
│ └───────────┬─────────────┘ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ PostgreSQL Database │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ crashes │ │ bugs │ │ binaries │ │ users │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ • metadata │ │ • fingerp. │ │ • build_id │ │ • API keys │ │ │
│ │ │ • frames │ │ • status │ │ • symbols │ │ • teams │ │ │
│ │ │ • memory │ │ • issue_id │ │ • version │ │ • alerts │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ REST API │ │ Web Dashboard │ │
│ │ │ │ │ │
│ │ • Crash submission │ │ • Bug list view │ │
│ │ • Query/search │ │ • Crash timeline │ │
│ │ • Bug management │ │ • Stack trace viewer │ │
│ │ • Integration webhooks │ │ • Trend analysis │ │
│ └─────────────────────────┘ │ • Issue tracker link │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation Phases
Phase 1: Foundation (Projects 1, 2, 10)
- Implement basic collection agent using systemd-coredump hooks
- Create central database schema for crashes and bugs
- Build simple REST API for crash submission
- Implement basic fingerprinting from stack traces
Phase 2: Analysis Engine (Projects 3, 4, 6)
- Create GDB Python scripts for automated analysis
- Implement minidump generation for reduced storage
- Handle stripped binaries with graceful degradation
- Add memory region extraction for heap analysis
Phase 3: Multi-threading and Advanced (Project 5)
- Extend fingerprinting to handle concurrent crashes
- Detect and flag potential race conditions
- Implement deadlock detection in crash analysis
- Add thread state comparison tools
Phase 4: Kernel Support (Projects 8, 9)
- Add kdump collection support for kernel panics
- Implement crash utility integration for kernel analysis
- Create kernel-specific fingerprinting
- Handle vmcore files in analysis pipeline
Phase 5: Production Polish (Project 7)
- Implement Breakpad minidump format support
- Add cross-platform crash ingestion (Windows, macOS)
- Create symbol server with debuginfod compatibility
- Build trend analysis and anomaly detection
Success Criteria
Your CrashLens platform is complete when:
- Collection: Agents automatically capture crashes from 10+ hosts
- Analysis: Crashes are analyzed within 60 seconds of occurrence
- Fingerprinting: Same bugs are correctly grouped (verify with synthetic crashes)
- Symbols: Debug symbols are automatically fetched for known binaries
- Dashboard: Web UI shows crash list, bug trends, and drill-down views
- API: Integration with issue trackers (GitHub/Jira) works
- Kernel: Kernel panics are captured and analyzed via kdump
- Scale: System handles 1000+ crashes/day without degradation
- Retention: Old data is automatically aged out per policy
- Documentation: Deployment guide covers all components
Stretch Goals
- Machine Learning: Classify crashes by root cause
- Automated Bisect: Integrate with git to find regressing commits
- Live Debugging: Connect to running process from dashboard
- Distributed Tracing: Correlate crashes with request traces
- Cost Analysis: Estimate business impact of each bug
From Learning to Production: What Is Next
After completing these projects, you understand crash dump analysis deeply. Here’s how your skills map to production tools and what gaps remain:
| Your Project | Production Equivalent | Gap to Fill |
|---|---|---|
| Project 1: Core Dump Generation | systemd-coredump, apport | Multi-distro configuration, storage policies |
| Project 2: GDB Backtrace | GDB, LLDB | IDE integration, remote debugging |
| Project 3: Memory Inspector | Valgrind, ASan, MSan | Runtime instrumentation, sanitizers |
| Project 4: GDB Scripting | Mozilla rr, Pernosco | Record/replay debugging |
| Project 5: Multi-threaded | Helgrind, ThreadSanitizer | Data race detection |
| Project 6: Stripped Binaries | Ghidra, IDA Pro, Binary Ninja | Full reverse engineering |
| Project 7: Minidump Parser | Breakpad, Crashpad | Client library integration |
| Project 8: kdump | Red Hat Crash, Oracle kdump | Enterprise kernel support |
| Project 9: crash Utility | crash + extensions | Custom crash plugins |
| Project 10: Crash Reporter | Sentry, Backtrace.io, Raygun | SaaS scale, ML classification |
Career Applications
Site Reliability Engineering (SRE)
- Your Project 10 skills directly apply to building observability infrastructure
- Project 8/9 kernel skills are essential for infrastructure debugging
- Fingerprinting knowledge helps reduce alert fatigue
Security Engineering
- Project 6 stripped binary skills are core to malware analysis
- Memory examination skills from Project 3 help in exploit development
- Core dump analysis is essential for incident response
Systems Programming
- Every project contributes to building robust, debuggable systems
- Understanding crash formats helps design better error handling
- Automation skills from Project 4 integrate into CI/CD pipelines
Kernel Development
- Projects 8/9 are essential prerequisites
- Understanding of ELF format and memory layout is foundational
- Crash analysis is a daily activity for kernel developers
Summary
This learning path covers Linux Crash Dump Analysis through 10 hands-on projects, taking you from basic core dump generation to building a production-grade crash reporting platform.
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | The First Crash — Core Dump Generation | C | Level 1: Beginner | 4-6 hours |
| 2 | The GDB Backtrace — Extracting Crash Context | C | Level 1: Beginner | 6-8 hours |
| 3 | The Memory Inspector — Deep State Examination | C | Level 2: Intermediate | 10-15 hours |
| 4 | Automated Crash Detective — GDB Scripting | Python | Level 2: Intermediate | 15-20 hours |
| 5 | Multi-threaded Mayhem — Analyzing Concurrent Crashes | C | Level 3: Advanced | 20-30 hours |
| 6 | The Stripped Binary Challenge — Debugging Without Symbols | C/Assembly | Level 3: Advanced | 15-25 hours |
| 7 | Minidump Parser — Understanding Compact Crash Formats | C/Python | Level 2: Intermediate | 15-20 hours |
| 8 | Kernel Panic Anatomy — Triggering and Capturing with kdump | C | Level 3: Advanced | 20-30 hours |
| 9 | Analyzing Kernel Crashes with the crash Utility | C | Level 3: Advanced | 20-30 hours |
| 10 | Centralized Crash Reporter — Production-Grade Infrastructure | Python | Level 3: Advanced | 40-60 hours |
Recommended Learning Paths
For Beginners: Start with Projects 1 → 2 → 3 → 4
For Intermediate Developers: Projects 1 → 2 → 4 → 5 → 10
For Kernel/Systems Engineers: Projects 1 → 2 → 8 → 9 → 5
For Security Professionals: Projects 1 → 2 → 3 → 6 → 7
Expected Outcomes
After completing these projects, you will:
- Generate and configure core dumps on any Linux system with confidence
- Navigate GDB fluently using commands like
bt,frame,print,x, andinfo - Examine memory in depth including heap structures, stack frames, and data corruption
- Automate crash analysis using GDB’s Python API for CI/CD integration
- Debug multi-threaded crashes including race conditions and deadlocks
- Analyze stripped binaries using assembly-level debugging techniques
- Parse minidump formats for compact, portable crash representation
- Configure and use kdump to capture kernel panics
- Navigate kernel crashes using the crash utility and kernel data structures
- Design and build production crash reporting infrastructure
The Deeper Understanding
Beyond the technical skills, you’ll understand:
- Why crashes happen: Not just “null pointer,” but the architectural reasons software fails
- What memory really is: A flat array of bytes with conventions layered on top
- How debuggers work: They’re not magic—they read the same files you can read
- Why symbols matter: And how to work without them when necessary
- How to think forensically: Reconstructing what happened from incomplete evidence
You’ll have built 10+ working projects that demonstrate deep understanding of crash dump analysis from first principles.
Additional Resources and References
Standards and Specifications
- ELF Format: System V ABI and Linux Extensions
- DWARF Debugging Format: DWARF 5 Standard
- Breakpad Minidump Format: Google Breakpad Documentation
- x86-64 ABI: System V AMD64 ABI
Official Documentation
- GDB Manual: https://sourceware.org/gdb/current/onlinedocs/gdb/
- systemd-coredump: https://www.freedesktop.org/software/systemd/man/systemd-coredump.html
- Linux Kernel kdump: https://www.kernel.org/doc/html/latest/admin-guide/kdump/kdump.html
- crash Utility: https://crash-utility.github.io/
Books (Essential Reading)
| Book | Author | Why It Matters |
|---|---|---|
| Computer Systems: A Programmer’s Perspective | Bryant & O’Hallaron | Foundation for understanding memory, processes, and linking |
| The Linux Programming Interface | Michael Kerrisk | Comprehensive coverage of signals, process memory, and core dumps |
| Linux Kernel Development | Robert Love | Essential for kernel crash analysis |
| Understanding the Linux Kernel | Bovet & Cesati | Deep dive into kernel internals |
| Debugging with GDB | Richard Stallman et al. | Official GDB documentation in book form |
| Practical Binary Analysis | Dennis Andriesse | Reverse engineering and binary formats |
| Expert C Programming | Peter van der Linden | Deep C knowledge for understanding crashes |
Online Resources
- Julia Evans’ Debugging Zines: https://wizardzines.com/ — Approachable visual guides
- Brendan Gregg’s Blog: https://www.brendangregg.com/ — Performance and debugging
- LWN.net Kernel Articles: https://lwn.net/ — In-depth kernel coverage
- GDB Dashboard: https://github.com/cyrus-and/gdb-dashboard — Enhanced GDB interface
Tools Referenced
| Tool | Purpose | Installation |
|---|---|---|
| GDB | GNU Debugger | apt install gdb |
| LLDB | LLVM Debugger | apt install lldb |
| coredumpctl | systemd core dump management | Part of systemd |
| crash | Kernel crash analysis | apt install crash |
| Breakpad | Client crash reporting | Build from source |
| debuginfod | Symbol server | apt install debuginfod |
| objdump | Binary examination | Part of binutils |
| readelf | ELF file analysis | Part of binutils |
| addr2line | Address to source mapping | Part of binutils |
Community and Help
- GDB Mailing List: https://sourceware.org/gdb/mailing-lists/
- Stack Overflow Tags: [gdb], [core-dump], [crash-dump]
- Reddit: r/linux, r/linuxadmin, r/ReverseEngineering
- Linux Kernel Mailing List (LKML): For kernel crash questions
This learning path was designed to take you from zero knowledge to expert-level crash dump analysis skills. The projects are ordered to build on each other, with each one adding new concepts and reinforcing what came before. Complete them all, and you’ll have skills that set you apart in systems programming, site reliability engineering, security research, or kernel development.