Project 5: Multi-threaded Mayhem

Master the art of debugging concurrent crashes by analyzing core dumps from multi-threaded programs with data races, learning to correlate thread states and identify the root cause of elusive race conditions.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Language	C (with pthreads)
Prerequisites	Project 3, threading concepts, basic GDB
Key Topics	pthreads, data races, memory visibility, GDB thread debugging

1. Learning Objectives

By completing this project, you will be able to:

Create controlled data races that reliably cause crashes for analysis
View all threads in a crashed process using info threads
Switch between thread contexts using thread <N> to examine each thread’s state
Generate complete backtraces for all threads with thread apply all bt
Correlate thread states to identify which thread caused the corruption
Understand memory visibility issues between threads without proper synchronization
Debug production-like concurrency bugs from post-mortem analysis alone
Recognize common patterns of race condition crashes in core dumps

2. Theoretical Foundation

2.1 Core Concepts

What is a Data Race?

A data race occurs when:

Two or more threads access the same memory location
At least one access is a write
The accesses are not synchronized (no locks, atomics, or memory barriers)

Data Race Scenario:
                    Shared Memory: g_ptr = 0x1234 (valid pointer)

Time ─────────────────────────────────────────────────────────────────►

Thread 1 (Writer):                Thread 2 (Reader):

    ...                               ...
    g_ptr = NULL;  ─────────────►     value = *g_ptr;  // CRASH!
    // Writes NULL                    // Reads stale value 0x1234
                                      // Then dereferences NULL
    ...                               ...

The reader may see the OLD value (0x1234) or the NEW value (NULL)
depending on timing, CPU caching, and compiler optimizations.

Memory Visibility Without Synchronization

Modern CPUs and compilers aggressively optimize code. Without explicit synchronization:

What You Write:                   What Actually Executes:

// Thread 1                       // Thread 1 (optimized)
data = 42;                        ready = true;  // REORDERED!
ready = true;                     data = 42;

// Thread 2                       // Thread 2
if (ready) {                      if (ready) {
    use(data);  // Might see 0!       use(data);  // data not set yet!
}                                 }

The CPU store buffer, cache coherence protocols, and compiler reordering can all cause one thread to see an inconsistent view of memory.

The Thread Model in Linux

Each thread in a Linux process shares:

Address space: All threads see the same virtual memory
File descriptors: Open files, sockets, pipes
Signal handlers: Though signal masks can differ
Process ID: All threads share the same PID

Each thread has its own:

Thread ID (TID): Unique within the process
Stack: Each thread gets its own stack
Registers: Including the instruction pointer (RIP)
Signal mask: Which signals are blocked
errno: Thread-local storage for error codes

Process Memory Layout with Threads:

┌───────────────────────────────────────────────┐
│                   KERNEL                       │
├───────────────────────────────────────────────┤
│               STACK (Thread 1)                 │  ← Each thread
│  Local variables, return addresses             │    has its own
├───────────────────────────────────────────────┤    stack
│               STACK (Thread 2)                 │
│  Local variables, return addresses             │
├───────────────────────────────────────────────┤
│               STACK (Thread 3)                 │
│  Local variables, return addresses             │
├───────────────────────────────────────────────┤
│                    ↓ ↓ ↓                       │
│                 (free space)                   │
│                    ↑ ↑ ↑                       │
├───────────────────────────────────────────────┤
│                    HEAP                        │  ← SHARED
│  malloc'd data, global pointers                │    by all
├───────────────────────────────────────────────┤    threads
│                .BSS (uninitialized)            │  ← SHARED
│  Global variables like g_data                  │
├───────────────────────────────────────────────┤
│                .DATA (initialized)             │  ← SHARED
│  Global variables with initial values          │
├───────────────────────────────────────────────┤
│                .TEXT (code)                    │  ← SHARED
│  Executable instructions                       │
└───────────────────────────────────────────────┘

What Happens During a Multi-threaded Crash

When any thread causes a fatal signal (like SIGSEGV):

The kernel stops ALL threads in the process
Each thread’s state is frozen at its current instruction
The core dump captures:
- All thread stacks
- All thread register sets
- Shared memory (heap, data sections)
- Thread-local storage
The crash signal is delivered to the faulting thread

This means a core dump is a snapshot of the entire process, not just the crashing thread.

2.2 Why This Matters

Single-threaded debugging is straightforward: the backtrace shows the exact sequence of events. Multi-threaded debugging is fundamentally different:

The thread that crashes is often not the thread that caused the bug.

Consider:

Thread A sets a pointer to NULL
Thread B reads that pointer and crashes

The backtrace in Thread B shows where the crash occurred, but the root cause is in Thread A. Without examining all threads, you’ll never find it.

Real-world statistics:

70%+ of production bugs in multi-threaded systems involve race conditions
Data races are the #1 cause of intermittent failures
Most race conditions don’t crash immediately—they corrupt data silently

2.3 Historical Context

The pthread (POSIX Threads) API was standardized in 1995 (POSIX.1c). Before this:

Multi-threading was vendor-specific and non-portable
Each Unix variant had its own threading model
Programs had to be rewritten for each platform

The Linux threading implementation evolved significantly:

LinuxThreads (1996-2003): Original implementation, many compatibility issues
NPTL (2003-present): Native POSIX Thread Library, 1:1 thread-to-kernel mapping

GDB’s threading support has improved dramatically:

Early GDB had minimal threading awareness
Modern GDB can attach to processes with thousands of threads
Thread-specific breakpoints and watchpoints
Non-stop mode for debugging without stopping all threads

2.4 Common Misconceptions

Misconception 1: “My program works, so there are no races”

Reality: Data races are timing-dependent. A program can run correctly millions of times, then fail once under heavy load. The absence of crashes doesn’t mean the absence of races.

Misconception 2: “The crashing thread is always at fault”

Reality: In multi-threaded programs, Thread A can corrupt data that Thread B uses later. The crash happens in B, but the bug is in A.

Misconception 3: “Adding a mutex everywhere fixes races”

Reality: Incorrect mutex usage causes deadlocks. Over-synchronization destroys performance. The goal is correct synchronization, not more synchronization.

Misconception 4: “Volatile keyword prevents data races”

Reality: volatile only prevents compiler optimizations. It does NOT provide atomicity or memory ordering. Use proper synchronization primitives.

// WRONG: volatile doesn't prevent races
volatile int counter = 0;

void thread_func() {
    counter++;  // Still a race! Read-modify-write is not atomic
}

// CORRECT: Use atomics
_Atomic int counter = 0;

void thread_func() {
    atomic_fetch_add(&counter, 1);  // Atomic operation
}

3. Project Specification

3.1 What You Will Build

You will create a C program called mt_crash.c that:

Spawns multiple threads with clearly defined roles (writer, reader, main)
Contains a deliberate data race on a shared pointer
Crashes predictably in a thread other than the one causing the corruption
Demonstrates thread analysis techniques in GDB

You will then analyze the resulting core dump to:

View all threads and their states
Switch between threads to examine their context
Identify the “guilty” thread by correlating state
Understand the timeline of events that led to the crash

3.2 Functional Requirements

FR1: Program Structure

Create at least 3 threads: main, writer_thread, reader_thread
Use pthreads API for thread creation and management
Compile with -g -pthread for debug symbols and threading support

FR2: Data Race Implementation

Declare a global pointer initialized to valid memory
Writer thread modifies the pointer after a delay
Reader thread or main thread dereferences the pointer
No synchronization between threads (deliberate race)

FR3: Crash Behavior

The crash must occur in a thread OTHER than the writer
The crash must be a SIGSEGV (NULL pointer dereference)
Timing should be controlled via sleep() to make race predictable

FR4: Debug Information

Compile with -g flag for full debug symbols
Include meaningful function and variable names
Add comments marking the race condition

3.3 Non-Functional Requirements

NFR1: Reproducibility

The crash should occur reliably (>90% of runs)
Sleep timings should be adjustable
Should work on any modern Linux system

NFR2: Educational Clarity

Code should be simple and readable
Each thread should have a distinct, named function
Variable names should clearly indicate shared state

NFR3: Analysis Friendliness

Thread functions should not be inlined
Global variables should have descriptive names
Stack depth should be >1 frame for meaningful backtraces

3.4 Example Usage / Output

Compiling and Running:

$ gcc -g -pthread -o mt_crash mt_crash.c

$ ulimit -c unlimited

$ ./mt_crash
Main thread starting...
Writer thread starting...
Reader thread starting...
Main thread accessing shared data...
Segmentation fault (core dumped)

GDB Analysis Session:

$ gdb ./mt_crash core.12345

(gdb) info threads
  Id   Target Id                    Frame
* 1    Thread 0x7f4e8b200740 (LWP 12345) 0x000055555555519a in main ()
                                          at mt_crash.c:45
  2    Thread 0x7f4e8aa00700 (LWP 12346) 0x00007f4e8a9b7360 in writer_thread ()
                                          at mt_crash.c:18
  3    Thread 0x7f4e8a1ff700 (LWP 12347) 0x00007f4e8aab5d95 in reader_thread ()
                                          at mt_crash.c:28

(gdb) thread apply all bt

Thread 3 (Thread 0x7f4e8a1ff700 (LWP 12347)):
#0  0x00007f4e8aab5d95 in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4e8aab5c3e in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000555555555175 in reader_thread (arg=0x0) at mt_crash.c:28
#3  0x00007f4e8a800ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f4e8aabe9cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f4e8aa00700 (LWP 12346)):
#0  0x00007f4e8a9b7360 in __GI___clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4e8aab5c3e in sleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000555555555152 in writer_thread (arg=0x0) at mt_crash.c:18
#3  0x00007f4e8a800ea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f4e8aabe9cf in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f4e8b200740 (LWP 12345)):
#0  0x000055555555519a in main () at mt_crash.c:45

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f4e8aa00700 (LWP 12346))]
#0  0x00007f4e8a9b7360 in __GI___clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) frame 2
#2  0x0000555555555152 in writer_thread (arg=0x0) at mt_crash.c:18
18          g_shared_ptr = NULL;  // THE CULPRIT!

(gdb) p g_shared_ptr
$1 = (char *) 0x0

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f4e8b200740 (LWP 12345))]
#0  0x000055555555519a in main () at mt_crash.c:45
45          char c = *g_shared_ptr;  // CRASH HERE - dereferencing NULL

3.5 Real World Outcome

After completing this project, you will be able to:

Triage production crashes from multi-threaded applications
Identify data races by correlating thread states in core dumps
Navigate complex thread interactions using GDB’s thread commands
Distinguish symptoms from causes in concurrent bug reports
Write effective bug reports that identify the actual race condition

4. Solution Architecture

4.1 High-Level Design

Program Flow:

┌──────────────────────────────────────────────────────────────────────┐
│                            MAIN THREAD                                │
│                                                                       │
│  1. Initialize g_shared_ptr = "Hello"                                │
│  2. Create writer_thread                                             │
│  3. Create reader_thread                                             │
│  4. sleep(2)  ─────────────────────────────────────────────┐         │
│  5. Access *g_shared_ptr  ← CRASH HERE                     │         │
│                                                             │         │
└─────────────────────────────────────────────────────────────┼─────────┘
                                                              │
                                                              │ Time
                                                              │
┌─────────────────────────────────────────────────────────────┼─────────┐
│                          WRITER THREAD                      │          │
│                                                             │          │
│  1. sleep(1)  ◄─────────────────────────────────────────────┘          │
│  2. g_shared_ptr = NULL  ← ROOT CAUSE                                 │
│  3. sleep(5) (to keep thread alive in dump)                           │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                          READER THREAD                                  │
│                                                                         │
│  1. sleep(10) (just to be present in dump)                             │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Timeline:
─────────────────────────────────────────────────────────────────────────►
0s          1s              2s              3s
│           │               │               │
│           │               │               │
├─ Main:    │               │               │
│  Create   │               │               │
│  threads  │               ├─ Main wakes   │
│           │               │  from sleep   │
│           │               │               │
│           ├─ Writer wakes │  Accesses     │
│           │  from sleep   │  g_shared_ptr │
│           │               │               │
│           │  Sets ptr     │  CRASH!       │
│           │  to NULL      │  (NULL deref) │
│           │               │               │

4.2 Key Components

Component 1: Global Shared State

char *g_shared_ptr = NULL;  // The shared variable
char g_data[] = "Hello, World!";  // Initial valid data

Component 2: Writer Thread

Waits for 1 second (ensures main thread is blocked)
Sets g_shared_ptr to NULL
Remains alive (sleeping) to appear in core dump

Component 3: Reader Thread

Simply sleeps to demonstrate multiple threads
Shows how to view “innocent” threads in dump

Component 4: Main Thread

Initializes shared pointer to valid data
Spawns worker threads
Waits for 2 seconds (longer than writer’s 1-second delay)
Dereferences the now-NULL pointer (CRASH)

4.3 Data Structures

// The race is on this single global pointer
char *g_shared_ptr;

// Thread arguments (not strictly needed, but good practice)
typedef struct {
    int thread_id;
    const char *name;
} thread_args_t;

// pthread types
pthread_t writer_tid;
pthread_t reader_tid;

4.4 Algorithm Overview

ALGORITHM: Creating a Controlled Data Race

INPUT: None
OUTPUT: Core dump with multi-threaded crash

1. INITIALIZE:
   g_shared_ptr = &g_data[0]  // Point to valid data

2. CREATE WRITER THREAD:
   writer_thread():
     sleep(1)                   // Wait for main to sleep
     g_shared_ptr = NULL        // Corrupt shared state
     sleep(5)                   // Stay alive for dump

3. CREATE READER THREAD:
   reader_thread():
     sleep(10)                  // Just be present

4. MAIN THREAD CONTINUES:
   sleep(2)                     // Wait longer than writer's delay
   char c = *g_shared_ptr       // Dereference (now NULL) -> CRASH

5. KERNEL HANDLES SIGSEGV:
   - Stop all threads
   - Save all thread states
   - Write core dump
   - Terminate process

5. Implementation Guide

5.1 Development Environment Setup

Required Packages:

# Debian/Ubuntu
sudo apt-get install build-essential gdb

# Fedora/RHEL
sudo dnf install gcc gdb

# Verify installations
gcc --version
gdb --version

Enable Core Dumps:

# Allow unlimited core file size
ulimit -c unlimited

# Check the setting
ulimit -c

# Optionally, set core file naming pattern
sudo sh -c 'echo "core.%e.%p" > /proc/sys/kernel/core_pattern'

Compiler Flags:

# Essential flags for this project:
# -g: Include debug symbols (CRITICAL for meaningful backtraces)
# -pthread: Enable POSIX threads support
# -O0: Disable optimizations (prevents inlining, reordering)

gcc -g -pthread -O0 -o mt_crash mt_crash.c

5.2 Project Structure

mt_crash/
├── mt_crash.c          # Main source file
├── Makefile            # Build configuration
├── run_crash.sh        # Helper script to generate crash
└── analyze.gdb         # GDB commands for analysis

Suggested Makefile:

CC = gcc
CFLAGS = -g -pthread -O0 -Wall

mt_crash: mt_crash.c
	$(CC) $(CFLAGS) -o $@ $<

clean:
	rm -f mt_crash core.*

.PHONY: clean

5.3 The Core Question You’re Answering

“When a crash occurs in one thread, how do I find out if another thread caused the problem?”

This is the fundamental question of multi-threaded debugging. The techniques you learn here—viewing all threads, switching contexts, correlating state—are the same techniques used to debug the most complex concurrent systems.

5.4 Concepts You Must Understand First

Before starting implementation, verify you can answer these questions:

Thread Creation: How does pthread_create() work? What is the signature of a thread function?
- Reference: “The Linux Programming Interface” Ch. 29
Shared Memory: What memory is shared between threads? What is thread-local?
- Reference: “Computer Systems: A Programmer’s Perspective” Ch. 12
Race Conditions: What makes a data race? Why are they dangerous?
- Reference: “Rust for Rustaceans” Ch. 6 (explains even for non-Rust programmers)
GDB Basics: Can you load a core dump in GDB? Get a backtrace? Print variables?
- Reference: Project 2 and 3 of this course
Process/Thread Model: What is a LWP (Light Weight Process)? How does the kernel see threads?
- Reference: “Understanding the Linux Kernel” Ch. 3

5.5 Questions to Guide Your Design

About the Data Race:

What shared variable will you race on?
How will you ensure one thread writes before the other reads?
What happens if the timing doesn’t work as expected?

About Thread Lifetimes:

How long should each thread sleep?
What happens if a thread exits before the crash?
How do you ensure all threads appear in the core dump?

About Analysis:

What evidence will show which thread wrote the NULL?
Can you tell from the core dump what the pointer’s value was before the write?
What local variables will help identify each thread’s state?

5.6 Thinking Exercise

Before writing any code, trace through this scenario mentally:

Time 0ms:    main() initializes g_shared_ptr = &g_data
Time 10ms:   main() creates writer_thread
Time 20ms:   main() creates reader_thread
Time 30ms:   main() calls sleep(2)
             writer_thread() calls sleep(1)
             reader_thread() calls sleep(10)
Time 1030ms: writer_thread() wakes up
Time 1031ms: writer_thread() executes: g_shared_ptr = NULL
Time 2030ms: main() wakes up
Time 2031ms: main() executes: char c = *g_shared_ptr
             *g_shared_ptr is now *NULL -> SIGSEGV!

Questions to consider:

At the moment of crash, what will each thread’s instruction pointer point to?
What local variables will be visible in each thread’s stack frame?
If you examine g_shared_ptr in GDB, what will you see?

5.7 Hints in Layers

Hint 1: Starting Point (Conceptual Direction)

Start with a minimal pthread program that just creates threads and has them print messages. Verify that you can create multiple threads and they all run. Then add the shared variable and the race.

Hint 2: Next Level (More Specific Guidance)

The key insight is timing. The writer thread must:

Sleep LESS time than the main thread
Write to the pointer BEFORE main wakes up
Stay alive AFTER the crash (by sleeping again)

Consider: writer sleeps 1s, main sleeps 2s. This gives a 1-second window where the pointer is NULL before main tries to use it.

Hint 3: Technical Details (Approach/Pseudocode)

// Global shared state - no synchronization!
char g_data[] = "Valid data";
char *g_shared_ptr = NULL;

void *writer_thread(void *arg) {
    sleep(1);  // Let main thread sleep first
    g_shared_ptr = NULL;  // THE BUG: unsynchronized write
    sleep(5);  // Stay alive for core dump
    return NULL;
}

void *reader_thread(void *arg) {
    sleep(10);  // Just be present
    return NULL;
}

int main() {
    g_shared_ptr = g_data;  // Initialize to valid pointer

    pthread_t writer, reader;
    pthread_create(&writer, NULL, writer_thread, NULL);
    pthread_create(&reader, NULL, reader_thread, NULL);

    sleep(2);  // Sleep LONGER than writer's delay

    // By now, writer has set g_shared_ptr = NULL
    char c = *g_shared_ptr;  // CRASH: NULL dereference

    return 0;
}

Hint 4: Tools/Debugging (Verification Methods)

After the crash, verify your analysis with these GDB commands:

# 1. See all threads
info threads

# 2. Get ALL backtraces at once
thread apply all bt

# 3. Switch to the writer thread (usually thread 2)
thread 2

# 4. Go to the frame where NULL was assigned
frame 2

# 5. Print the shared pointer
p g_shared_ptr

# 6. Look at assembly to confirm the NULL write
disassemble writer_thread

5.8 The Interview Questions They’ll Ask

Question 1: “How do you debug a race condition that only happens in production?”

Expected answer: Use core dump analysis. Even if you can’t reproduce the crash, the core dump captures the state of all threads. By examining each thread’s stack and variables, you can often identify which thread caused the corruption.

Question 2: “What GDB commands are essential for multi-threaded debugging?”

Expected answer:

info threads - List all threads
thread N - Switch to thread N
thread apply all bt - Backtrace for ALL threads
thread apply all p variable - Print a variable in all thread contexts

Question 3: “The crash is in thread A, but you suspect thread B caused it. How do you prove this?”

Expected answer: Switch to thread B with thread B. Examine its local variables and the shared state it accessed. Look for evidence of the corrupting write in its stack frame or in the current value of shared variables.

Question 4: “What’s the difference between a race condition and a deadlock?”

Expected answer:

Race condition: Threads compete for shared resources without proper synchronization, leading to unpredictable results or crashes.
Deadlock: Threads wait for each other indefinitely, causing the program to hang (not crash).

In a core dump, a race condition shows threads running/crashed, while a deadlock shows threads blocked on lock acquisition.

Question 5: “How can you prevent this class of bugs?”

Expected answer:

Use mutexes/locks to protect shared state
Use atomic operations for simple counters
Prefer message passing over shared memory
Use thread sanitizer (TSan) during development
Design with immutability where possible

5.9 Books That Will Help

Topic	Book & Chapter
pthreads API	“The Linux Programming Interface” by Kerrisk, Ch. 29-30
Thread synchronization	“The Linux Programming Interface” by Kerrisk, Ch. 30
Memory model	“C++ Concurrency in Action” by Williams, Ch. 5
Data race concepts	“Rust for Rustaceans” by Gjengset, Ch. 6
GDB thread debugging	GDB Manual, “Debugging Programs with Multiple Threads”
Concurrent programming theory	“The Art of Multiprocessor Programming” by Herlihy & Shavit
Systems debugging	“Computer Systems: A Programmer’s Perspective” Ch. 12

5.10 Implementation Phases

Phase 1: Basic Multi-threaded Program (Day 1)

Goal: Create a program that spawns threads and they all complete successfully.

Steps:

Write a minimal program with pthread_create()
Have each thread print its ID and sleep briefly
Use pthread_join() to wait for completion
Verify all threads run with printf() statements

Validation: Program runs without crash, prints messages from all threads.

Phase 2: Add Shared State (Day 2)

Goal: Add a global pointer that all threads can access.

Steps:

Declare char *g_shared_ptr as a global variable
Initialize it in main to point to valid data
Have one thread print the pointer’s value
Have another thread print the dereferenced value

Validation: All threads see the same pointer value.

Phase 3: Create the Race (Day 3-4)

Goal: Create timing that causes a reliable crash.

Steps:

Writer thread: sleep, then set pointer to NULL
Main thread: sleep longer, then dereference pointer
Adjust sleep times until crash is reliable
Add reader thread as a “bystander”

Validation: Program crashes with SIGSEGV >90% of runs.

Phase 4: Core Dump Analysis (Day 5-7)

Goal: Master the GDB commands for thread debugging.

Steps:

Generate core dump with crash
Load in GDB: gdb ./mt_crash core.12345
Practice info threads
Practice thread N and frame N
Practice thread apply all bt
Identify the guilty thread by examining state

Validation: Can consistently identify writer thread as root cause.

Phase 5: Documentation and Variations (Day 8-14)

Goal: Solidify understanding with variations.

Steps:

Create variations with different crash locations
Try races with different data types (int, struct, pointer-to-pointer)
Document your analysis process
Create a “cheat sheet” of GDB thread commands

Validation: Can analyze any multi-threaded crash using learned techniques.

5.11 Key Implementation Decisions

Decision 1: Thread Function Signatures

Use the standard pthread signature:

void *thread_func(void *arg);

Even if you don’t use the argument, keep it. This matches the expected signature and allows future expansion.

Decision 2: Sleep Durations

Use generous sleep times (seconds, not milliseconds):

Makes timing predictable
Ensures core dump captures all threads
Avoids race between thread creation and sleep

Decision 3: Global vs. Heap-allocated Shared State

Use a global variable for clarity:

char *g_shared_ptr;  // Global, visible in all contexts

Heap allocation works too but adds complexity (need to pass pointer to threads).

Decision 4: No Thread Joining

Don’t call pthread_join() before the crash:

The crash happens before threads complete
Joining would change timing
We WANT the threads alive in the dump

6. Testing Strategy

Test 1: Verify Thread Creation

# Add debug output to each thread
./mt_crash
# Expected: See print statements from main, writer, reader

Test 2: Verify Crash Occurs

ulimit -c unlimited
./mt_crash
# Expected: "Segmentation fault (core dumped)"
# AND a core file exists

Test 3: Verify All Threads in Dump

gdb ./mt_crash core.*
(gdb) info threads
# Expected: See 3 threads listed

Test 4: Verify Crash Location

(gdb) bt
# Expected: Frame 0 should be in main(), at the dereference line

Test 5: Verify Root Cause Identifiable

(gdb) thread 2
(gdb) frame 2
(gdb) list
# Expected: Should show the line where g_shared_ptr = NULL

Test 6: Timing Reliability

# Run 10 times
for i in {1..10}; do ./mt_crash; done
# Expected: Crashes every time (or at least 9/10)

7. Common Pitfalls & Debugging

Pitfall 1: Core Dump Not Generated

Symptom: “Segmentation fault” but no core file

Cause: Core dumps disabled or redirected

Fix:

ulimit -c unlimited
cat /proc/sys/kernel/core_pattern
# If pattern uses apport or systemd, dumps may be elsewhere
# Try: sudo sh -c 'echo core > /proc/sys/kernel/core_pattern'

Verification: ls -la core* shows new file after crash

Pitfall 2: Threads Exit Before Crash

Symptom: info threads shows only 1 or 2 threads

Cause: Worker threads completed and exited before crash

Fix: Add longer sleep at end of thread functions:

void *writer_thread(void *arg) {
    sleep(1);
    g_shared_ptr = NULL;
    sleep(10);  // Stay alive!
    return NULL;
}

Verification: All 3 threads visible in info threads

Pitfall 3: Race Doesn’t Happen (No Crash)

Symptom: Program exits normally instead of crashing

Cause: Timing doesn’t create the race

Fix: Adjust sleep durations:

Writer: sleep less before write
Main: sleep more before read
Ensure: main reads AFTER writer writes

Verification: Program crashes consistently

Pitfall 4: Crash Happens in Wrong Thread

Symptom: Crash is in writer thread, not main

Cause: Likely a different bug (wrong pointer, typo)

Fix: Review code carefully. The writer should only write to the pointer, never dereference it.

Verification: bt shows crash in main, not writer

Pitfall 5: No Debug Symbols

Symptom: GDB shows ?? instead of function names

Cause: Compiled without -g flag

Fix: Recompile with -g:

gcc -g -pthread -o mt_crash mt_crash.c

Verification: bt shows function names and line numbers

Pitfall 6: Thread Numbering Confusion

Symptom: Can’t find the writer thread

GDB Behavior: Thread numbers in GDB are not always predictable. Thread 1 is usually main, but others can vary.

Fix: Use the Frame information:

(gdb) info threads
# Look at the "Frame" column - it shows function names
# Find the one showing "writer_thread"

Verification: Look for function names, not thread numbers

8. Extensions & Challenges

Extension 1: Multiple Data Races

Create a program with TWO independent data races:

Race A: Pointer becomes NULL (crash)
Race B: Counter is corrupted (wrong value)

Analyze: Can you identify both races from a single core dump?

Extension 2: Race Condition Without NULL

Create a race where the pointer doesn’t become NULL but points to freed memory (use-after-free):

// Thread 1: free(g_data_ptr);
// Thread 2: reads *g_data_ptr (corrupted data, maybe crash)

Analyze: How does the core dump differ from a NULL dereference?

Extension 3: More Threads

Scale up to 10 threads with complex interactions:

3 writers modifying different globals
7 readers accessing various shared state
Multiple potential crash points

Analyze: Practice navigating many threads in GDB

Extension 4: Atomic Version (Control)

Create a “fixed” version using atomics:

_Atomic char *g_shared_ptr;

Compare: Run both versions under ThreadSanitizer. The race detection should work for the buggy version.

Challenge: Real-World Multi-threaded Bug

Find an open-source multi-threaded program and introduce a bug:

Add an unprotected shared variable
Create access pattern that races
Generate crash
Analyze as if you didn’t know the bug

This simulates real-world debugging where you don’t know the cause.

9. Real-World Connections

Industry Examples

1. Database Connection Pools Databases often have race conditions around connection state:

Thread A: marks connection as “in use”
Thread B: reads connection as “available”
Thread B: uses connection while A is using it
Result: Corrupted queries, crashes

2. Web Server Request Handling Web servers share state across request handlers:

Thread A: increments request counter
Thread B: reads counter for logging
Without atomics: Lost updates, wrong metrics
With race: Potential integer overflow crash

3. GUI Applications GUI frameworks have strict threading rules:

Background thread: updates shared data structure
UI thread: reads same structure for display
Race: Crash when UI reads partially-written data
Example: Many GTK/Qt bugs stem from this

4. Operating System Kernel The Linux kernel has sophisticated locking, but bugs exist:

CVE-2016-5195 “Dirty COW”: Race condition in memory management
Caused by race between memory write and copy-on-write
Allowed privilege escalation

Tools Used in Production

Thread Sanitizer (TSan)

# Compile with TSan
gcc -fsanitize=thread -g -o mt_crash mt_crash.c

# Run - TSan will report races WITHOUT crashing
./mt_crash

Output:

WARNING: ThreadSanitizer: data race (pid=12345)
  Write of size 8 at 0x... by thread T1:
    #0 writer_thread mt_crash.c:18

  Previous read of size 8 at 0x... by main thread:
    #0 main mt_crash.c:45

Helgrind (Valgrind)

valgrind --tool=helgrind ./mt_crash

Intel Inspector Commercial tool for advanced race detection in HPC/enterprise environments.

10. Resources

Essential Reading

“The Linux Programming Interface” by Michael Kerrisk, Chapters 29-30 (POSIX Threads)
GDB Manual: “Debugging Programs with Multiple Threads”
“Rust for Rustaceans” by Jon Gjengset, Chapter 6 (excellent race condition explanation)

Reference Documentation

man pthreads - Overview of POSIX threads
man pthread_create - Thread creation
man pthread_mutex_lock - Mutexes (for understanding what we’re NOT using)
GDB Info: help info threads, help thread

Online Resources

Tools

GDB (required)
ThreadSanitizer (-fsanitize=thread)
Helgrind (Valgrind tool)
strace -f (trace system calls across threads)

11. Self-Assessment Checklist

Before considering this project complete, verify:

Understanding:

I can explain what a data race is and why it’s dangerous
I understand why the crash occurs in main, not in writer_thread
I can explain why volatile doesn’t fix data races
I know the difference between a race condition and a deadlock

Skills:

I can use info threads to list all threads in a crashed process
I can use thread N to switch to a specific thread
I can use thread apply all bt to get all backtraces at once
I can identify the “guilty” thread by examining shared state
I can navigate between stack frames in different threads

Practical:

My program crashes reliably (>90% of runs)
My core dump contains all 3 threads
I can load the core dump in GDB and analyze it
I have documented the GDB commands I used

Extension:

I have tried at least one extension/variation
I understand how TSan would detect this race
I can explain how to fix the race (mutexes, atomics)

12. Submission / Completion Criteria

Your project is complete when you can demonstrate:

Working Crash Program
- mt_crash.c compiles with -g -pthread
- Running ./mt_crash produces a core dump
- The crash is a SIGSEGV in the main thread
Complete Analysis Session
- Load core dump in GDB
- Show all threads with info threads
- Show all backtraces with thread apply all bt
- Switch to writer thread and show the NULL assignment
- Explain the timeline: which thread did what and when
Written Documentation
- Brief explanation of the data race
- GDB commands used in analysis
- How you identified the root cause
Verification
- Run your analysis on someone else’s machine (or a VM)
- The analysis should work the same way

Success Criteria: You can take any multi-threaded core dump, examine all threads, and identify which thread caused the problem—even when the crash occurs in a different thread.

This project bridges the gap between single-threaded debugging and the complex reality of concurrent systems. The techniques you’ve learned here—viewing all threads, correlating state, thinking about timing—are the same techniques used by engineers debugging the most challenging production systems.