Project 5: Multi-threaded Mayhem
Master the art of debugging concurrent crashes by analyzing core dumps from multi-threaded programs with data races, learning to correlate thread states and identify the root cause of elusive race conditions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Language | C (with pthreads) |
| Prerequisites | Project 3, threading concepts, basic GDB |
| Key Topics | pthreads, data races, memory visibility, GDB thread debugging |
1. Learning Objectives
By completing this project, you will be able to:
- Create controlled data races that reliably cause crashes for analysis
- View all threads in a crashed process using
info threads - Switch between thread contexts using
thread <N>to examine each thread’s state - Generate complete backtraces for all threads with
thread apply all bt - Correlate thread states to identify which thread caused the corruption
- Understand memory visibility issues between threads without proper synchronization
- Debug production-like concurrency bugs from post-mortem analysis alone
- Recognize common patterns of race condition crashes in core dumps
2. Theoretical Foundation
2.1 Core Concepts
What is a Data Race?
A data race occurs when:
- Two or more threads access the same memory location
- At least one access is a write
- The accesses are not synchronized (no locks, atomics, or memory barriers)
Data Race Scenario:
Shared Memory: g_ptr = 0x1234 (valid pointer)
Time ─────────────────────────────────────────────────────────────────►
Thread 1 (Writer): Thread 2 (Reader):
... ...
g_ptr = NULL; ─────────────► value = *g_ptr; // CRASH!
// Writes NULL // Reads stale value 0x1234
// Then dereferences NULL
... ...
The reader may see the OLD value (0x1234) or the NEW value (NULL)
depending on timing, CPU caching, and compiler optimizations.
Memory Visibility Without Synchronization
Modern CPUs and compilers aggressively optimize code. Without explicit synchronization:
What You Write: What Actually Executes:
// Thread 1 // Thread 1 (optimized)
data = 42; ready = true; // REORDERED!
ready = true; data = 42;
// Thread 2 // Thread 2
if (ready) { if (ready) {
use(data); // Might see 0! use(data); // data not set yet!
} }
The CPU store buffer, cache coherence protocols, and compiler reordering can all cause one thread to see an inconsistent view of memory.
The Thread Model in Linux
Each thread in a Linux process shares:
- Address space: All threads see the same virtual memory
- File descriptors: Open files, sockets, pipes
- Signal handlers: Though signal masks can differ
- Process ID: All threads share the same PID
Each thread has its own:
- Thread ID (TID): Unique within the process
- Stack: Each thread gets its own stack
- Registers: Including the instruction pointer (RIP)
- Signal mask: Which signals are blocked
- errno: Thread-local storage for error codes
Process Memory Layout with Threads:
┌───────────────────────────────────────────────┐
│ KERNEL │
├───────────────────────────────────────────────┤
│ STACK (Thread 1) │ ← Each thread
│ Local variables, return addresses │ has its own
├───────────────────────────────────────────────┤ stack
│ STACK (Thread 2) │
│ Local variables, return addresses │
├───────────────────────────────────────────────┤
│ STACK (Thread 3) │
│ Local variables, return addresses │
├───────────────────────────────────────────────┤
│ ↓ ↓ ↓ │
│ (free space) │
│ ↑ ↑ ↑ │
├───────────────────────────────────────────────┤
│ HEAP │ ← SHARED
│ malloc'd data, global pointers │ by all
├───────────────────────────────────────────────┤ threads
│ .BSS (uninitialized) │ ← SHARED
│ Global variables like g_data │
├───────────────────────────────────────────────┤
│ .DATA (initialized) │ ← SHARED
│ Global variables with initial values │
├───────────────────────────────────────────────┤
│ .TEXT (code) │ ← SHARED
│ Executable instructions │
└───────────────────────────────────────────────┘
What Happens During a Multi-threaded Crash
When any thread causes a fatal signal (like SIGSEGV):
- The kernel stops ALL threads in the process
- Each thread’s state is frozen at its current instruction
- The core dump captures:
- All thread stacks
- All thread register sets
- Shared memory (heap, data sections)
- Thread-local storage
- The crash signal is delivered to the faulting thread
This means a core dump is a snapshot of the entire process, not just the crashing thread.
2.2 Why This Matters
Single-threaded debugging is straightforward: the backtrace shows the exact sequence of events. Multi-threaded debugging is fundamentally different:
The thread that crashes is often not the thread that caused the bug.
Consider:
- Thread A sets a pointer to NULL
- Thread B reads that pointer and crashes
The backtrace in Thread B shows where the crash occurred, but the root cause is in Thread A. Without examining all threads, you’ll never find it.
Real-world statistics:
- 70%+ of production bugs in multi-threaded systems involve race conditions
- Data races are the #1 cause of intermittent failures
- Most race conditions don’t crash immediately—they corrupt data silently
2.3 Historical Context
The pthread (POSIX Threads) API was standardized in 1995 (POSIX.1c). Before this:
- Multi-threading was vendor-specific and non-portable
- Each Unix variant had its own threading model
- Programs had to be rewritten for each platform
The Linux threading implementation evolved significantly:
- LinuxThreads (1996-2003): Original implementation, many compatibility issues
- NPTL (2003-present): Native POSIX Thread Library, 1:1 thread-to-kernel mapping
GDB’s threading support has improved dramatically:
- Early GDB had minimal threading awareness
- Modern GDB can attach to processes with thousands of threads
- Thread-specific breakpoints and watchpoints
- Non-stop mode for debugging without stopping all threads
2.4 Common Misconceptions
Misconception 1: “My program works, so there are no races”
Reality: Data races are timing-dependent. A program can run correctly millions of times, then fail once under heavy load. The absence of crashes doesn’t mean the absence of races.
Misconception 2: “The crashing thread is always at fault”
Reality: In multi-threaded programs, Thread A can corrupt data that Thread B uses later. The crash happens in B, but the bug is in A.
Misconception 3: “Adding a mutex everywhere fixes races”
Reality: Incorrect mutex usage causes deadlocks. Over-synchronization destroys performance. The goal is correct synchronization, not more synchronization.
Misconception 4: “Volatile keyword prevents data races”
Reality: volatile only prevents compiler optimizations. It does NOT provide atomicity or memory ordering. Use proper synchronization primitives.
// WRONG: volatile doesn't prevent races
volatile int counter = 0;
void thread_func() {
counter++; // Still a race! Read-modify-write is not atomic
}
// CORRECT: Use atomics
_Atomic int counter = 0;
void thread_func() {
atomic_fetch_add(&counter, 1); // Atomic operation
}
3. Project Specification
3.1 What You Will Build
You will create a C program called mt_crash.c that:
- Spawns multiple threads with clearly defined roles (writer, reader, main)
- Contains a deliberate data race on a shared pointer
- Crashes predictably in a thread other than the one causing the corruption
- Demonstrates thread analysis techniques in GDB
You will then analyze the resulting core dump to:
- View all threads and their states
- Switch between threads to examine their context
- Identify the “guilty” thread by correlating state
- Understand the timeline of events that led to the crash
3.2 Functional Requirements
FR1: Program Structure
- Create at least 3 threads: main, writer_thread, reader_thread
- Use pthreads API for thread creation and management
- Compile with
-g -pthreadfor debug symbols and threading support
FR2: Data Race Implementation
- Declare a global pointer initialized to valid memory
- Writer thread modifies the pointer after a delay
- Reader thread or main thread dereferences the pointer
- No synchronization between threads (deliberate race)
FR3: Crash Behavior
- The crash must occur in a thread OTHER than the writer
- The crash must be a SIGSEGV (NULL pointer dereference)
- Timing should be controlled via sleep() to make race predictable
FR4: Debug Information
- Compile with
-gflag for full debug symbols - Include meaningful function and variable names
- Add comments marking the race condition
3.3 Non-Functional Requirements
NFR1: Reproducibility
- The crash should occur reliably (>90% of runs)
- Sleep timings should be adjustable
- Should work on any modern Linux system
NFR2: Educational Clarity
- Code should be simple and readable
- Each thread should have a distinct, named function
- Variable names should clearly indicate shared state
NFR3: Analysis Friendliness
- Thread functions should not be inlined
- Global variables should have descriptive names
- Stack depth should be >1 frame for meaningful backtraces
3.4 Example Usage / Output
Compiling and Running:
$ gcc -g -pthread -o mt_crash mt_crash.c
$ ulimit -c unlimited
$ ./mt_crash
Main thread starting...
Writer thread starting...
Reader thread starting...
Main thread accessing shared data...
Segmentation fault (core dumped)
GDB Analysis Session:
$ gdb ./mt_crash core.12345
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f4e8b200740 (LWP 12345) 0x000055555555519a in main ()
at mt_crash.c:45
2 Thread 0x7f4e8aa00700 (LWP 12346) 0x00007f4e8a9b7360 in writer_thread ()
at mt_crash.c:18
3 Thread 0x7f4e8a1ff700 (LWP 12347) 0x00007f4e8aab5d95 in reader_thread ()
at mt_crash.c:28
(gdb) thread apply all bt
Thread 3 (Thread 0x7f4e8a1ff700 (LWP 12347)):
#0 0x00007f4e8aab5d95 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4e8aab5c3e in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x0000555555555175 in reader_thread (arg=0x0) at mt_crash.c:28
#3 0x00007f4e8a800ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f4e8aabe9cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 2 (Thread 0x7f4e8aa00700 (LWP 12346)):
#0 0x00007f4e8a9b7360 in __GI___clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f4e8aab5c3e in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x0000555555555152 in writer_thread (arg=0x0) at mt_crash.c:18
#3 0x00007f4e8a800ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f4e8aabe9cf in clone () from /lib/x86_64-linux-gnu/libc.so.6
Thread 1 (Thread 0x7f4e8b200740 (LWP 12345)):
#0 0x000055555555519a in main () at mt_crash.c:45
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f4e8aa00700 (LWP 12346))]
#0 0x00007f4e8a9b7360 in __GI___clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) frame 2
#2 0x0000555555555152 in writer_thread (arg=0x0) at mt_crash.c:18
18 g_shared_ptr = NULL; // THE CULPRIT!
(gdb) p g_shared_ptr
$1 = (char *) 0x0
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f4e8b200740 (LWP 12345))]
#0 0x000055555555519a in main () at mt_crash.c:45
45 char c = *g_shared_ptr; // CRASH HERE - dereferencing NULL
3.5 Real World Outcome
After completing this project, you will be able to:
- Triage production crashes from multi-threaded applications
- Identify data races by correlating thread states in core dumps
- Navigate complex thread interactions using GDB’s thread commands
- Distinguish symptoms from causes in concurrent bug reports
- Write effective bug reports that identify the actual race condition
4. Solution Architecture
4.1 High-Level Design
Program Flow:
┌──────────────────────────────────────────────────────────────────────┐
│ MAIN THREAD │
│ │
│ 1. Initialize g_shared_ptr = "Hello" │
│ 2. Create writer_thread │
│ 3. Create reader_thread │
│ 4. sleep(2) ─────────────────────────────────────────────┐ │
│ 5. Access *g_shared_ptr ← CRASH HERE │ │
│ │ │
└─────────────────────────────────────────────────────────────┼─────────┘
│
│ Time
│
┌─────────────────────────────────────────────────────────────┼─────────┐
│ WRITER THREAD │ │
│ │ │
│ 1. sleep(1) ◄─────────────────────────────────────────────┘ │
│ 2. g_shared_ptr = NULL ← ROOT CAUSE │
│ 3. sleep(5) (to keep thread alive in dump) │
│ │
└────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ READER THREAD │
│ │
│ 1. sleep(10) (just to be present in dump) │
│ │
└────────────────────────────────────────────────────────────────────────┘
Timeline:
─────────────────────────────────────────────────────────────────────────►
0s 1s 2s 3s
│ │ │ │
│ │ │ │
├─ Main: │ │ │
│ Create │ │ │
│ threads │ ├─ Main wakes │
│ │ │ from sleep │
│ │ │ │
│ ├─ Writer wakes │ Accesses │
│ │ from sleep │ g_shared_ptr │
│ │ │ │
│ │ Sets ptr │ CRASH! │
│ │ to NULL │ (NULL deref) │
│ │ │ │
4.2 Key Components
Component 1: Global Shared State
char *g_shared_ptr = NULL; // The shared variable
char g_data[] = "Hello, World!"; // Initial valid data
Component 2: Writer Thread
- Waits for 1 second (ensures main thread is blocked)
- Sets
g_shared_ptrto NULL - Remains alive (sleeping) to appear in core dump
Component 3: Reader Thread
- Simply sleeps to demonstrate multiple threads
- Shows how to view “innocent” threads in dump
Component 4: Main Thread
- Initializes shared pointer to valid data
- Spawns worker threads
- Waits for 2 seconds (longer than writer’s 1-second delay)
- Dereferences the now-NULL pointer (CRASH)
4.3 Data Structures
// The race is on this single global pointer
char *g_shared_ptr;
// Thread arguments (not strictly needed, but good practice)
typedef struct {
int thread_id;
const char *name;
} thread_args_t;
// pthread types
pthread_t writer_tid;
pthread_t reader_tid;
4.4 Algorithm Overview
ALGORITHM: Creating a Controlled Data Race
INPUT: None
OUTPUT: Core dump with multi-threaded crash
1. INITIALIZE:
g_shared_ptr = &g_data[0] // Point to valid data
2. CREATE WRITER THREAD:
writer_thread():
sleep(1) // Wait for main to sleep
g_shared_ptr = NULL // Corrupt shared state
sleep(5) // Stay alive for dump
3. CREATE READER THREAD:
reader_thread():
sleep(10) // Just be present
4. MAIN THREAD CONTINUES:
sleep(2) // Wait longer than writer's delay
char c = *g_shared_ptr // Dereference (now NULL) -> CRASH
5. KERNEL HANDLES SIGSEGV:
- Stop all threads
- Save all thread states
- Write core dump
- Terminate process
5. Implementation Guide
5.1 Development Environment Setup
Required Packages:
# Debian/Ubuntu
sudo apt-get install build-essential gdb
# Fedora/RHEL
sudo dnf install gcc gdb
# Verify installations
gcc --version
gdb --version
Enable Core Dumps:
# Allow unlimited core file size
ulimit -c unlimited
# Check the setting
ulimit -c
# Optionally, set core file naming pattern
sudo sh -c 'echo "core.%e.%p" > /proc/sys/kernel/core_pattern'
Compiler Flags:
# Essential flags for this project:
# -g : Include debug symbols (CRITICAL for meaningful backtraces)
# -pthread : Enable POSIX threads support
# -O0 : Disable optimizations (prevents inlining, reordering)
gcc -g -pthread -O0 -o mt_crash mt_crash.c
5.2 Project Structure
mt_crash/
├── mt_crash.c # Main source file
├── Makefile # Build configuration
├── run_crash.sh # Helper script to generate crash
└── analyze.gdb # GDB commands for analysis
Suggested Makefile:
CC = gcc
CFLAGS = -g -pthread -O0 -Wall
mt_crash: mt_crash.c
$(CC) $(CFLAGS) -o $@ $<
clean:
rm -f mt_crash core.*
.PHONY: clean
5.3 The Core Question You’re Answering
“When a crash occurs in one thread, how do I find out if another thread caused the problem?”
This is the fundamental question of multi-threaded debugging. The techniques you learn here—viewing all threads, switching contexts, correlating state—are the same techniques used to debug the most complex concurrent systems.
5.4 Concepts You Must Understand First
Before starting implementation, verify you can answer these questions:
- Thread Creation: How does
pthread_create()work? What is the signature of a thread function?- Reference: “The Linux Programming Interface” Ch. 29
- Shared Memory: What memory is shared between threads? What is thread-local?
- Reference: “Computer Systems: A Programmer’s Perspective” Ch. 12
- Race Conditions: What makes a data race? Why are they dangerous?
- Reference: “Rust for Rustaceans” Ch. 6 (explains even for non-Rust programmers)
- GDB Basics: Can you load a core dump in GDB? Get a backtrace? Print variables?
- Reference: Project 2 and 3 of this course
- Process/Thread Model: What is a LWP (Light Weight Process)? How does the kernel see threads?
- Reference: “Understanding the Linux Kernel” Ch. 3
5.5 Questions to Guide Your Design
About the Data Race:
- What shared variable will you race on?
- How will you ensure one thread writes before the other reads?
- What happens if the timing doesn’t work as expected?
About Thread Lifetimes:
- How long should each thread sleep?
- What happens if a thread exits before the crash?
- How do you ensure all threads appear in the core dump?
About Analysis:
- What evidence will show which thread wrote the NULL?
- Can you tell from the core dump what the pointer’s value was before the write?
- What local variables will help identify each thread’s state?
5.6 Thinking Exercise
Before writing any code, trace through this scenario mentally:
Time 0ms: main() initializes g_shared_ptr = &g_data
Time 10ms: main() creates writer_thread
Time 20ms: main() creates reader_thread
Time 30ms: main() calls sleep(2)
writer_thread() calls sleep(1)
reader_thread() calls sleep(10)
Time 1030ms: writer_thread() wakes up
Time 1031ms: writer_thread() executes: g_shared_ptr = NULL
Time 2030ms: main() wakes up
Time 2031ms: main() executes: char c = *g_shared_ptr
*g_shared_ptr is now *NULL -> SIGSEGV!
Questions to consider:
- At the moment of crash, what will each thread’s instruction pointer point to?
- What local variables will be visible in each thread’s stack frame?
- If you examine
g_shared_ptrin GDB, what will you see?
5.7 Hints in Layers
Hint 1: Starting Point (Conceptual Direction)
Start with a minimal pthread program that just creates threads and has them print messages. Verify that you can create multiple threads and they all run. Then add the shared variable and the race.
Hint 2: Next Level (More Specific Guidance)
The key insight is timing. The writer thread must:
- Sleep LESS time than the main thread
- Write to the pointer BEFORE main wakes up
- Stay alive AFTER the crash (by sleeping again)
Consider: writer sleeps 1s, main sleeps 2s. This gives a 1-second window where the pointer is NULL before main tries to use it.
Hint 3: Technical Details (Approach/Pseudocode)
// Global shared state - no synchronization!
char g_data[] = "Valid data";
char *g_shared_ptr = NULL;
void *writer_thread(void *arg) {
sleep(1); // Let main thread sleep first
g_shared_ptr = NULL; // THE BUG: unsynchronized write
sleep(5); // Stay alive for core dump
return NULL;
}
void *reader_thread(void *arg) {
sleep(10); // Just be present
return NULL;
}
int main() {
g_shared_ptr = g_data; // Initialize to valid pointer
pthread_t writer, reader;
pthread_create(&writer, NULL, writer_thread, NULL);
pthread_create(&reader, NULL, reader_thread, NULL);
sleep(2); // Sleep LONGER than writer's delay
// By now, writer has set g_shared_ptr = NULL
char c = *g_shared_ptr; // CRASH: NULL dereference
return 0;
}
Hint 4: Tools/Debugging (Verification Methods)
After the crash, verify your analysis with these GDB commands:
# 1. See all threads
info threads
# 2. Get ALL backtraces at once
thread apply all bt
# 3. Switch to the writer thread (usually thread 2)
thread 2
# 4. Go to the frame where NULL was assigned
frame 2
# 5. Print the shared pointer
p g_shared_ptr
# 6. Look at assembly to confirm the NULL write
disassemble writer_thread
5.8 The Interview Questions They’ll Ask
Question 1: “How do you debug a race condition that only happens in production?”
Expected answer: Use core dump analysis. Even if you can’t reproduce the crash, the core dump captures the state of all threads. By examining each thread’s stack and variables, you can often identify which thread caused the corruption.
Question 2: “What GDB commands are essential for multi-threaded debugging?”
Expected answer:
info threads- List all threadsthread N- Switch to thread Nthread apply all bt- Backtrace for ALL threadsthread apply all p variable- Print a variable in all thread contexts
Question 3: “The crash is in thread A, but you suspect thread B caused it. How do you prove this?”
Expected answer: Switch to thread B with thread B. Examine its local variables and the shared state it accessed. Look for evidence of the corrupting write in its stack frame or in the current value of shared variables.
Question 4: “What’s the difference between a race condition and a deadlock?”
Expected answer:
- Race condition: Threads compete for shared resources without proper synchronization, leading to unpredictable results or crashes.
- Deadlock: Threads wait for each other indefinitely, causing the program to hang (not crash).
In a core dump, a race condition shows threads running/crashed, while a deadlock shows threads blocked on lock acquisition.
Question 5: “How can you prevent this class of bugs?”
Expected answer:
- Use mutexes/locks to protect shared state
- Use atomic operations for simple counters
- Prefer message passing over shared memory
- Use thread sanitizer (TSan) during development
- Design with immutability where possible
5.9 Books That Will Help
| Topic | Book & Chapter |
|---|---|
| pthreads API | “The Linux Programming Interface” by Kerrisk, Ch. 29-30 |
| Thread synchronization | “The Linux Programming Interface” by Kerrisk, Ch. 30 |
| Memory model | “C++ Concurrency in Action” by Williams, Ch. 5 |
| Data race concepts | “Rust for Rustaceans” by Gjengset, Ch. 6 |
| GDB thread debugging | GDB Manual, “Debugging Programs with Multiple Threads” |
| Concurrent programming theory | “The Art of Multiprocessor Programming” by Herlihy & Shavit |
| Systems debugging | “Computer Systems: A Programmer’s Perspective” Ch. 12 |
5.10 Implementation Phases
Phase 1: Basic Multi-threaded Program (Day 1)
Goal: Create a program that spawns threads and they all complete successfully.
Steps:
- Write a minimal program with
pthread_create() - Have each thread print its ID and sleep briefly
- Use
pthread_join()to wait for completion - Verify all threads run with
printf()statements
Validation: Program runs without crash, prints messages from all threads.
Phase 2: Add Shared State (Day 2)
Goal: Add a global pointer that all threads can access.
Steps:
- Declare
char *g_shared_ptras a global variable - Initialize it in main to point to valid data
- Have one thread print the pointer’s value
- Have another thread print the dereferenced value
Validation: All threads see the same pointer value.
Phase 3: Create the Race (Day 3-4)
Goal: Create timing that causes a reliable crash.
Steps:
- Writer thread: sleep, then set pointer to NULL
- Main thread: sleep longer, then dereference pointer
- Adjust sleep times until crash is reliable
- Add reader thread as a “bystander”
Validation: Program crashes with SIGSEGV >90% of runs.
Phase 4: Core Dump Analysis (Day 5-7)
Goal: Master the GDB commands for thread debugging.
Steps:
- Generate core dump with crash
- Load in GDB:
gdb ./mt_crash core.12345 - Practice
info threads - Practice
thread Nandframe N - Practice
thread apply all bt - Identify the guilty thread by examining state
Validation: Can consistently identify writer thread as root cause.
Phase 5: Documentation and Variations (Day 8-14)
Goal: Solidify understanding with variations.
Steps:
- Create variations with different crash locations
- Try races with different data types (int, struct, pointer-to-pointer)
- Document your analysis process
- Create a “cheat sheet” of GDB thread commands
Validation: Can analyze any multi-threaded crash using learned techniques.
5.11 Key Implementation Decisions
Decision 1: Thread Function Signatures
Use the standard pthread signature:
void *thread_func(void *arg);
Even if you don’t use the argument, keep it. This matches the expected signature and allows future expansion.
Decision 2: Sleep Durations
Use generous sleep times (seconds, not milliseconds):
- Makes timing predictable
- Ensures core dump captures all threads
- Avoids race between thread creation and sleep
Decision 3: Global vs. Heap-allocated Shared State
Use a global variable for clarity:
char *g_shared_ptr; // Global, visible in all contexts
Heap allocation works too but adds complexity (need to pass pointer to threads).
Decision 4: No Thread Joining
Don’t call pthread_join() before the crash:
- The crash happens before threads complete
- Joining would change timing
- We WANT the threads alive in the dump
6. Testing Strategy
Test 1: Verify Thread Creation
# Add debug output to each thread
./mt_crash
# Expected: See print statements from main, writer, reader
Test 2: Verify Crash Occurs
ulimit -c unlimited
./mt_crash
# Expected: "Segmentation fault (core dumped)"
# AND a core file exists
Test 3: Verify All Threads in Dump
gdb ./mt_crash core.*
(gdb) info threads
# Expected: See 3 threads listed
Test 4: Verify Crash Location
(gdb) bt
# Expected: Frame 0 should be in main(), at the dereference line
Test 5: Verify Root Cause Identifiable
(gdb) thread 2
(gdb) frame 2
(gdb) list
# Expected: Should show the line where g_shared_ptr = NULL
Test 6: Timing Reliability
# Run 10 times
for i in {1..10}; do ./mt_crash; done
# Expected: Crashes every time (or at least 9/10)
7. Common Pitfalls & Debugging
Pitfall 1: Core Dump Not Generated
Symptom: “Segmentation fault” but no core file
Cause: Core dumps disabled or redirected
Fix:
ulimit -c unlimited
cat /proc/sys/kernel/core_pattern
# If pattern uses apport or systemd, dumps may be elsewhere
# Try: sudo sh -c 'echo core > /proc/sys/kernel/core_pattern'
Verification: ls -la core* shows new file after crash
Pitfall 2: Threads Exit Before Crash
Symptom: info threads shows only 1 or 2 threads
Cause: Worker threads completed and exited before crash
Fix: Add longer sleep at end of thread functions:
void *writer_thread(void *arg) {
sleep(1);
g_shared_ptr = NULL;
sleep(10); // Stay alive!
return NULL;
}
Verification: All 3 threads visible in info threads
Pitfall 3: Race Doesn’t Happen (No Crash)
Symptom: Program exits normally instead of crashing
Cause: Timing doesn’t create the race
Fix: Adjust sleep durations:
- Writer: sleep less before write
- Main: sleep more before read
- Ensure: main reads AFTER writer writes
Verification: Program crashes consistently
Pitfall 4: Crash Happens in Wrong Thread
Symptom: Crash is in writer thread, not main
Cause: Likely a different bug (wrong pointer, typo)
Fix: Review code carefully. The writer should only write to the pointer, never dereference it.
Verification: bt shows crash in main, not writer
Pitfall 5: No Debug Symbols
Symptom: GDB shows ?? instead of function names
Cause: Compiled without -g flag
Fix: Recompile with -g:
gcc -g -pthread -o mt_crash mt_crash.c
Verification: bt shows function names and line numbers
Pitfall 6: Thread Numbering Confusion
Symptom: Can’t find the writer thread
GDB Behavior: Thread numbers in GDB are not always predictable. Thread 1 is usually main, but others can vary.
Fix: Use the Frame information:
(gdb) info threads
# Look at the "Frame" column - it shows function names
# Find the one showing "writer_thread"
Verification: Look for function names, not thread numbers
8. Extensions & Challenges
Extension 1: Multiple Data Races
Create a program with TWO independent data races:
- Race A: Pointer becomes NULL (crash)
- Race B: Counter is corrupted (wrong value)
Analyze: Can you identify both races from a single core dump?
Extension 2: Race Condition Without NULL
Create a race where the pointer doesn’t become NULL but points to freed memory (use-after-free):
// Thread 1: free(g_data_ptr);
// Thread 2: reads *g_data_ptr (corrupted data, maybe crash)
Analyze: How does the core dump differ from a NULL dereference?
Extension 3: More Threads
Scale up to 10 threads with complex interactions:
- 3 writers modifying different globals
- 7 readers accessing various shared state
- Multiple potential crash points
Analyze: Practice navigating many threads in GDB
Extension 4: Atomic Version (Control)
Create a “fixed” version using atomics:
_Atomic char *g_shared_ptr;
Compare: Run both versions under ThreadSanitizer. The race detection should work for the buggy version.
Challenge: Real-World Multi-threaded Bug
Find an open-source multi-threaded program and introduce a bug:
- Add an unprotected shared variable
- Create access pattern that races
- Generate crash
- Analyze as if you didn’t know the bug
This simulates real-world debugging where you don’t know the cause.
9. Real-World Connections
Industry Examples
1. Database Connection Pools Databases often have race conditions around connection state:
- Thread A: marks connection as “in use”
- Thread B: reads connection as “available”
- Thread B: uses connection while A is using it
- Result: Corrupted queries, crashes
2. Web Server Request Handling Web servers share state across request handlers:
- Thread A: increments request counter
- Thread B: reads counter for logging
- Without atomics: Lost updates, wrong metrics
- With race: Potential integer overflow crash
3. GUI Applications GUI frameworks have strict threading rules:
- Background thread: updates shared data structure
- UI thread: reads same structure for display
- Race: Crash when UI reads partially-written data
- Example: Many GTK/Qt bugs stem from this
4. Operating System Kernel The Linux kernel has sophisticated locking, but bugs exist:
- CVE-2016-5195 “Dirty COW”: Race condition in memory management
- Caused by race between memory write and copy-on-write
- Allowed privilege escalation
Tools Used in Production
Thread Sanitizer (TSan)
# Compile with TSan
gcc -fsanitize=thread -g -o mt_crash mt_crash.c
# Run - TSan will report races WITHOUT crashing
./mt_crash
Output:
WARNING: ThreadSanitizer: data race (pid=12345)
Write of size 8 at 0x... by thread T1:
#0 writer_thread mt_crash.c:18
Previous read of size 8 at 0x... by main thread:
#0 main mt_crash.c:45
Helgrind (Valgrind)
valgrind --tool=helgrind ./mt_crash
Intel Inspector Commercial tool for advanced race detection in HPC/enterprise environments.
10. Resources
Essential Reading
- “The Linux Programming Interface” by Michael Kerrisk, Chapters 29-30 (POSIX Threads)
- GDB Manual: “Debugging Programs with Multiple Threads”
- “Rust for Rustaceans” by Jon Gjengset, Chapter 6 (excellent race condition explanation)
Reference Documentation
man pthreads- Overview of POSIX threadsman pthread_create- Thread creationman pthread_mutex_lock- Mutexes (for understanding what we’re NOT using)- GDB Info:
help info threads,help thread
Online Resources
Tools
- GDB (required)
- ThreadSanitizer (
-fsanitize=thread) - Helgrind (Valgrind tool)
strace -f(trace system calls across threads)
11. Self-Assessment Checklist
Before considering this project complete, verify:
Understanding:
- I can explain what a data race is and why it’s dangerous
- I understand why the crash occurs in main, not in writer_thread
- I can explain why
volatiledoesn’t fix data races - I know the difference between a race condition and a deadlock
Skills:
- I can use
info threadsto list all threads in a crashed process - I can use
thread Nto switch to a specific thread - I can use
thread apply all btto get all backtraces at once - I can identify the “guilty” thread by examining shared state
- I can navigate between stack frames in different threads
Practical:
- My program crashes reliably (>90% of runs)
- My core dump contains all 3 threads
- I can load the core dump in GDB and analyze it
- I have documented the GDB commands I used
Extension:
- I have tried at least one extension/variation
- I understand how TSan would detect this race
- I can explain how to fix the race (mutexes, atomics)
12. Submission / Completion Criteria
Your project is complete when you can demonstrate:
- Working Crash Program
mt_crash.ccompiles with-g -pthread- Running
./mt_crashproduces a core dump - The crash is a SIGSEGV in the main thread
- Complete Analysis Session
- Load core dump in GDB
- Show all threads with
info threads - Show all backtraces with
thread apply all bt - Switch to writer thread and show the NULL assignment
- Explain the timeline: which thread did what and when
- Written Documentation
- Brief explanation of the data race
- GDB commands used in analysis
- How you identified the root cause
- Verification
- Run your analysis on someone else’s machine (or a VM)
- The analysis should work the same way
Success Criteria: You can take any multi-threaded core dump, examine all threads, and identify which thread caused the problem—even when the crash occurs in a different thread.
This project bridges the gap between single-threaded debugging and the complex reality of concurrent systems. The techniques you’ve learned here—viewing all threads, correlating state, thinking about timing—are the same techniques used by engineers debugging the most challenging production systems.