P02: Work-Stealing Thread Pool

P02: Work-Stealing Thread Pool

Build a high-performance thread pool with work-stealing scheduling like Rayon or Goโ€™s runtime

Attribute Value
Main Language C++
Alternative Languages Rust, Go, Java
Difficulty Level 3: Advanced
Coolness Level Level 4: Hardcore Tech Flex
Knowledge Area Thread Pool Design, Work Scheduling
Key Tools pthreads, C++ atomics, perf
Main Book โ€œRust Atomics and Locksโ€ - Mara Bos

Learning Objectives

By completing this project, you will:

  1. Master atomic operations โ€” Understand memory ordering (acquire, release, seq_cst) and when each is needed
  2. Implement lock-free data structures โ€” Build a Chase-Lev work-stealing deque
  3. Understand cache coherency โ€” Learn why false sharing destroys performance and how to avoid it
  4. Design efficient schedulers โ€” Balance work across threads without creating bottlenecks
  5. Profile concurrent code โ€” Use perf and ThreadSanitizer to find issues
  6. Build parallel algorithms โ€” Create parallel map, reduce, and sort using your pool

The Core Question

โ€œHow do systems like Rayon achieve near-linear speedup, and why doesnโ€™t adding more threads always make things faster?โ€

Adding threads seems simple: more workers = more throughput. But reality is harsh:

  • Lock contention: Threads fighting over shared data waste cycles waiting
  • Cache invalidation: One thread modifying data invalidates anotherโ€™s cache
  • Load imbalance: Some threads finish early while others are overloaded
  • Synchronization overhead: Coordinating threads costs CPU time

Work-stealing solves these elegantly: each thread has its own work queue, and idle threads โ€œstealโ€ from busy ones. No central bottleneck, natural load balancing.

After this project, youโ€™ll understand why Goโ€™s goroutine scheduler and Rustโ€™s Rayon use work-stealing, and youโ€™ll have built one yourself.


Deep Theoretical Foundation

1. Why Work-Stealing?

The Problem with Centralized Queues

Traditional thread pools use a single shared queue:

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Task Queue  โ”‚
                    โ”‚ [T1][T2][T3]โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚                  โ”‚                  โ”‚
        โ–ผ                  โ–ผ                  โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Worker 1โ”‚       โ”‚ Worker 2โ”‚       โ”‚ Worker 3โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Every worker must lock the queue to get work. With 8 workers hitting the same lock, contention dominates.

Work-Stealing Architecture

Each worker has its own deque (double-ended queue):

   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Worker 1โ”‚       โ”‚ Worker 2โ”‚       โ”‚ Worker 3โ”‚
   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
        โ”‚                 โ”‚                 โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”
   โ”‚ Deque 1 โ”‚       โ”‚ Deque 2 โ”‚       โ”‚ Deque 3 โ”‚
   โ”‚[T1][T2] โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”‚  [T5]   โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚   []    โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ steal โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ steal โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
        โ–ผ push/pop (owner only)
  • Owner pushes/pops from the bottom (LIFO โ€” good for cache locality)
  • Thieves steal from the top (FIFO โ€” takes oldest, largest tasks)
  • Only contention is when stealing โ€” and thatโ€™s rare in balanced workloads

2. Memory Ordering and Atomics

The Chase-Lev deque requires careful use of atomic operations. Understanding memory ordering is essential.

What Are Atomics?

Atomic operations appear to happen instantaneously from other threadsโ€™ perspectives. No tearing (partial reads/writes).

std::atomic<int> counter{0};

// Thread 1
counter.fetch_add(1);  // Atomic increment

// Thread 2
int val = counter.load();  // Always sees complete value (0 or 1, never garbage)

Memory Ordering

The CPU and compiler reorder operations for performance. Memory ordering constrains this reordering.

Ordering Guarantee Use Case
relaxed Atomicity only, no ordering Counters, statistics
acquire All reads after this see all writes before a paired release Lock acquisition
release All writes before this are visible after a paired acquire Lock release
acq_rel Both acquire and release Read-modify-write in critical sections
seq_cst Total global order Simple but expensive

Example: Spinlock with Ordering

class Spinlock {
    std::atomic<bool> locked{false};
public:
    void lock() {
        while (locked.exchange(true, std::memory_order_acquire)) {
            // spin
        }
        // ACQUIRE: All reads below see all writes before unlock
    }

    void unlock() {
        // RELEASE: All writes above are visible after next acquire
        locked.store(false, std::memory_order_release);
    }
};

3. The Chase-Lev Deque

The work-stealing deque is the heart of the system. It must be:

  • Wait-free for owner (push/pop never block)
  • Lock-free for thieves (steal may retry but always progresses)
  • Cache-efficient (owner operations are local)

Deque Structure

struct Deque {
    std::atomic<int64_t> top;     // Thieves steal from here
    std::atomic<int64_t> bottom;  // Owner pushes/pops here
    std::atomic<Task**> buffer;   // Circular buffer of tasks
    int64_t capacity;
};

Visual Model

Circular buffer (capacity = 8):
     0   1   2   3   4   5   6   7
   โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”
   โ”‚   โ”‚   โ”‚ T1โ”‚ T2โ”‚ T3โ”‚   โ”‚   โ”‚   โ”‚
   โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜
             โ–ฒ           โ–ฒ
             โ”‚           โ”‚
            top=2     bottom=5

Owner pushes at bottom, pops at bottom-1
Thieves steal from top

Push Operation (Owner Only)

void push(Task* task) {
    int64_t b = bottom.load(relaxed);
    int64_t t = top.load(acquire);

    if (b - t >= capacity) {
        grow();  // Resize buffer if full
        b = bottom.load(relaxed);
    }

    buffer.load(relaxed)[b % capacity] = task;
    std::atomic_thread_fence(release);  // Ensure task visible before bottom update
    bottom.store(b + 1, relaxed);
}

Pop Operation (Owner Only)

Task* pop() {
    int64_t b = bottom.load(relaxed) - 1;
    bottom.store(b, relaxed);
    std::atomic_thread_fence(seq_cst);  // Full barrier for race with steal
    int64_t t = top.load(relaxed);

    if (t <= b) {
        Task* task = buffer.load(relaxed)[b % capacity];
        if (t == b) {
            // Last element โ€” race with steal
            if (!top.compare_exchange_strong(t, t + 1, seq_cst, relaxed)) {
                // Lost race, thief took it
                bottom.store(t + 1, relaxed);
                return nullptr;
            }
            bottom.store(t + 1, relaxed);
        }
        return task;
    } else {
        // Deque was empty
        bottom.store(t, relaxed);
        return nullptr;
    }
}

Steal Operation (Thieves)

Task* steal() {
    int64_t t = top.load(acquire);
    std::atomic_thread_fence(seq_cst);
    int64_t b = bottom.load(acquire);

    if (t < b) {
        Task* task = buffer.load(relaxed)[t % capacity];
        if (!top.compare_exchange_strong(t, t + 1, seq_cst, relaxed)) {
            // Lost race with another thief or pop
            return nullptr;  // Retry later
        }
        return task;
    }
    return nullptr;  // Empty
}

4. Cache Coherency and False Sharing

Cache Lines

Modern CPUs load memory in 64-byte cache lines. When one core writes to a cache line, all other coresโ€™ copies are invalidated.

Cache Line (64 bytes):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ [counter_thread1] [counter_thread2] [counter_thread3] [...]    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Thread 1 increments counter_thread1 โ†’ entire line invalidated on all cores!

False Sharing

Variables that could be independent share a cache line, causing โ€œfalseโ€ contention:

// BAD: False sharing
struct WorkerState {
    std::atomic<int> tasks_completed;  // 4 bytes
    // Next worker's state might be in same cache line!
};
WorkerState workers[8];  // Packed together

// GOOD: Padded to cache line
struct alignas(64) WorkerState {
    std::atomic<int> tasks_completed;
    char padding[60];  // Fill to 64 bytes
};

Measuring False Sharing

Use perf to detect:

$ perf stat -e cache-misses,cache-references ./your_program

# High cache-miss ratio with many threads = likely false sharing

5. Worker Thread Design

Thread Lifecycle

void worker_loop(WorkerState* state) {
    while (!shutdown) {
        // 1. Try to get work from own deque
        Task* task = state->deque.pop();

        if (!task) {
            // 2. No local work โ€” try to steal
            task = try_steal_from_others(state);
        }

        if (task) {
            // 3. Execute task
            task->execute();
            // Task might push more work to our deque
        } else {
            // 4. No work anywhere โ€” park thread
            park_thread(state);
        }
    }
}

Stealing Strategy

Random victim selection works well:

Task* try_steal_from_others(WorkerState* me) {
    int start = random() % num_workers;
    for (int i = 0; i < num_workers; i++) {
        int victim = (start + i) % num_workers;
        if (victim != me->id) {
            Task* stolen = workers[victim].deque.steal();
            if (stolen) return stolen;
        }
    }
    return nullptr;  // Everyone is empty
}

Thread Parking

When no work is available, threads should sleep rather than spin:

void park_thread(WorkerState* state) {
    std::unique_lock<std::mutex> lock(state->park_mutex);
    state->parked = true;

    // Wait with timeout to handle spurious wakeups
    state->park_cv.wait_for(lock, 1ms, [&] {
        return !state->parked || shutdown;
    });
}

void unpark_thread(WorkerState* state) {
    {
        std::lock_guard<std::mutex> lock(state->park_mutex);
        state->parked = false;
    }
    state->park_cv.notify_one();
}

6. Task Representation

Function Object Wrapper

class Task {
public:
    virtual void execute() = 0;
    virtual ~Task() = default;
};

template<typename F>
class FunctionTask : public Task {
    F func;
public:
    FunctionTask(F&& f) : func(std::move(f)) {}
    void execute() override { func(); }
};

Fork-Join Pattern

Work-stealing excels at divide-and-conquer:

void parallel_quicksort(int* arr, int lo, int hi) {
    if (hi - lo < THRESHOLD) {
        std::sort(arr + lo, arr + hi);
        return;
    }

    int pivot = partition(arr, lo, hi);

    // Fork: push subtask for right half
    pool.spawn([=] { parallel_quicksort(arr, pivot + 1, hi); });

    // Execute left half locally (no stealing overhead)
    parallel_quicksort(arr, lo, pivot);

    // Join: wait for right half (implicit via work-stealing)
}

Project Specification

What Youโ€™ll Build

A work-stealing thread pool library providing:

// Core API
class ThreadPool {
public:
    ThreadPool(size_t num_threads);
    ~ThreadPool();

    // Spawn a task (returns immediately)
    template<typename F>
    void spawn(F&& task);

    // Spawn and wait for result
    template<typename F>
    auto async(F&& task) -> std::future<decltype(task())>;

    // Parallel algorithms
    template<typename Iter, typename F>
    void parallel_for(Iter begin, Iter end, F&& func);

    template<typename Iter, typename T, typename F>
    T parallel_reduce(Iter begin, Iter end, T init, F&& reducer);
};

Deliverables

  1. Header-only library: threadpool.hpp
  2. Chase-Lev deque implementation: Lock-free, resizable
  3. Demo application: Parallel Mandelbrot renderer
  4. Benchmark suite: Scaling tests, comparison with std::async
  5. Documentation: API reference and design notes

Success Criteria

# Correctness: No races detected
$ g++ -fsanitize=thread -o test test.cpp
$ ./test
# No ThreadSanitizer warnings

# Performance: Near-linear scaling
$ ./mandelbrot_bench --threads 1
Single-threaded: 2000 ms

$ ./mandelbrot_bench --threads 8
8 threads: 270 ms (7.4x speedup on 8 cores)

# Efficiency: Low overhead
$ ./micro_bench
Task spawn overhead: < 100 ns
Work-stealing latency: < 1 ยตs

Solution Architecture

Component Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                          Public API                                  โ”‚
โ”‚  spawn() / async() / parallel_for() / parallel_reduce()            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         Task Scheduler                               โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚                   Worker Thread Array                        โ”‚    โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚    โ”‚
โ”‚  โ”‚  โ”‚ Worker 0 โ”‚ โ”‚ Worker 1 โ”‚ โ”‚ Worker 2 โ”‚ โ”‚ Worker 3 โ”‚  ...   โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚    โ”‚
โ”‚  โ”‚       โ”‚            โ”‚            โ”‚            โ”‚               โ”‚    โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”         โ”‚    โ”‚
โ”‚  โ”‚  โ”‚ Deque 0 โ”‚  โ”‚ Deque 1 โ”‚  โ”‚ Deque 2 โ”‚  โ”‚ Deque 3 โ”‚         โ”‚    โ”‚
โ”‚  โ”‚  โ”‚ [Tasks] โ”‚  โ”‚ [Tasks] โ”‚  โ”‚ [Tasks] โ”‚  โ”‚ [Tasks] โ”‚         โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚    โ”‚
โ”‚  โ”‚       โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€stealโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ                             โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Structures

Work-Stealing Deque

template<typename T>
class WSDeque {
    static constexpr size_t INITIAL_CAPACITY = 32;

    struct Array {
        std::atomic<T*> buffer;
        size_t capacity;
        // ... ring buffer implementation
    };

    alignas(64) std::atomic<int64_t> top;
    alignas(64) std::atomic<int64_t> bottom;
    alignas(64) std::atomic<Array*> array;

public:
    void push(T item);           // Owner only
    std::optional<T> pop();      // Owner only
    std::optional<T> steal();    // Thieves
};

Worker State (Cache-Line Aligned)

struct alignas(64) WorkerState {
    WSDeque<Task*> deque;
    std::thread thread;
    size_t id;

    // Parking mechanism
    std::mutex park_mutex;
    std::condition_variable park_cv;
    std::atomic<bool> parked{false};

    // Statistics
    std::atomic<uint64_t> tasks_executed{0};
    std::atomic<uint64_t> tasks_stolen{0};

    char padding[64];  // Ensure next worker is on different cache line
};

Key Algorithms

Task Distribution

spawn(task):
1. Get current thread's worker ID (thread_local)
2. If called from worker thread:
   - Push to current worker's deque (fast path)
3. If called from external thread:
   - Push to random worker's deque
   - Unpark a sleeping worker

Work Loop

worker_loop():
while not shutdown:
    1. task = own_deque.pop()
    2. if no task:
       - for each other worker (random order):
           task = other_deque.steal()
           if task: break
    3. if task:
       - execute(task)
       - update statistics
    4. else:
       - park_thread()
       - on wakeup, continue loop

Phased Implementation Guide

Phase 1: Basic Thread Pool (Days 1-2)

Goal: Working pool with a single shared queue.

Steps:

  1. Create worker threads that pop from a shared std::deque
  2. Protect with a mutex
  3. Implement spawn() that pushes to the queue
  4. Implement graceful shutdown

Test: Spawning 1000 tasks completes correctly.

Phase 2: Lock-Free Deque (Days 3-4)

Goal: Implement the Chase-Lev deque.

Steps:

  1. Implement push() (owner-only, relaxed atomics)
  2. Implement pop() (owner-only, handles race with steal)
  3. Implement steal() (lock-free, may fail and retry)
  4. Handle buffer growth when full

Test: Single-threaded tests pass. Multi-threaded stress test with TSan.

Phase 3: Work-Stealing Integration (Days 5-6)

Goal: Each worker has its own deque, idle workers steal.

Steps:

  1. Replace shared queue with per-worker deques
  2. Modify worker loop to try local pop first
  3. Add stealing logic (random victim selection)
  4. Add thread parking when no work

Test: Mandelbrot renders correctly with work-stealing.

Phase 4: Cache Optimization (Days 6-7)

Goal: Eliminate false sharing and improve cache performance.

Steps:

  1. Align WorkerState to cache lines
  2. Separate hot (deque) and cold (statistics) data
  3. Profile with perf for cache misses
  4. Optimize stealing pattern

Test: Scaling improves (7x+ speedup on 8 cores).

Phase 5: API and Algorithms (Days 8-10)

Goal: Ergonomic API with parallel algorithms.

Steps:

  1. Implement async() returning futures
  2. Implement parallel_for() with work-stealing
  3. Implement parallel_reduce()
  4. Add graceful shutdown and thread joining

Testing Strategy

Correctness Tests

// Test basic spawn/execute
TEST(ThreadPool, BasicSpawn) {
    ThreadPool pool(4);
    std::atomic<int> counter{0};

    for (int i = 0; i < 1000; i++) {
        pool.spawn([&] { counter++; });
    }

    pool.wait_all();
    EXPECT_EQ(counter.load(), 1000);
}

// Test work-stealing balance
TEST(ThreadPool, WorkStealing) {
    ThreadPool pool(4);
    std::array<std::atomic<int>, 4> per_thread_count{};

    // Push all work to thread 0
    for (int i = 0; i < 10000; i++) {
        pool.spawn_to(0, [&, tid = get_thread_id()] {
            per_thread_count[tid]++;
        });
    }

    pool.wait_all();

    // Verify work was stolen (all threads did some work)
    for (auto& count : per_thread_count) {
        EXPECT_GT(count.load(), 0);
    }
}

Stress Tests

// Concurrent spawn and steal
void stress_test() {
    ThreadPool pool(8);
    std::atomic<int> completed{0};

    // Spawn from multiple threads
    std::vector<std::thread> spawners;
    for (int t = 0; t < 4; t++) {
        spawners.emplace_back([&] {
            for (int i = 0; i < 10000; i++) {
                pool.spawn([&] { completed++; });
            }
        });
    }

    for (auto& t : spawners) t.join();
    pool.wait_all();

    assert(completed == 40000);
}

ThreadSanitizer

$ g++ -std=c++17 -fsanitize=thread -g -O1 -o test_tsan test.cpp
$ ./test_tsan
# Must report NO data races

Common Pitfalls and Debugging

Pitfall 1: ABA Problem in Deque

Symptom: Occasional corruption, tasks executed twice.

Cause: Between reading top and CAS, another thread steals and new tasks wrap around.

Fix: Use 64-bit indices (never wrap in practice) or add version counter:

struct alignas(16) TopCounter {
    int64_t index;
    int64_t version;
};
std::atomic<TopCounter> top;

Pitfall 2: Memory Ordering Bugs

Symptom: Works single-threaded, fails multi-threaded.

Debug:

  • Use TSan โ€” it catches most ordering bugs
  • Add assert() that check invariants
  • Start with seq_cst everywhere, then carefully weaken

Pitfall 3: Deadlock on Shutdown

Symptom: Program hangs on exit.

Cause: Workers parked waiting for work when shutdown requested.

Fix:

void shutdown() {
    shutdown_flag.store(true);
    // Wake all parked threads
    for (auto& worker : workers) {
        worker.park_cv.notify_all();
    }
    // Now join threads
}

Pitfall 4: Thundering Herd

Symptom: Performance tanks when new work arrives after idle period.

Cause: All parked threads wake simultaneously and contend.

Fix: Unpark threads incrementally:

void spawn(Task* task) {
    push_to_deque(task);
    // Only unpark one thread
    for (auto& w : workers) {
        if (w.parked.load()) {
            unpark_thread(&w);
            break;  // Only one!
        }
    }
}

Extensions and Challenges

Extension 1: Task Dependencies

Support task graphs:

auto a = pool.async([] { return compute_a(); });
auto b = pool.async([] { return compute_b(); });
auto c = pool.async([&] { return a.get() + b.get(); });  // Waits for a and b

Extension 2: Priority Queues

Some tasks are more important:

pool.spawn(task, Priority::High);    // Processed first
pool.spawn(task, Priority::Low);     // Processed when nothing else

Extension 3: Affinity and NUMA

Pin workers to cores, allocate from local memory:

ThreadPool pool(8, ThreadPool::AffinityMode::PinToCores);
pool.spawn_on_core(3, task);  // Run specifically on core 3

Extension 4: Continuations

Chain tasks without blocking:

pool.async(compute_a)
    .then([](int a) { return compute_b(a); })
    .then([](int b) { std::cout << b; });

Real-World Connections

Where Work-Stealing is Used

  1. Rayon (Rust): Powers parallel iterators with work-stealing
  2. Go Runtime: Goroutine scheduler uses work-stealing across Pโ€™s
  3. Intel TBB: parallel_for uses work-stealing
  4. Java ForkJoinPool: Standard library work-stealing executor

Industry Patterns

Pattern Work-Stealing Benefit
Parallel Sort Recursive subdivision naturally balances
Game Physics Particle systems with variable work per particle
Build Systems Compile tasks with dependencies
Ray Tracing Rays with variable depth naturally balance

Resources

Essential Reading

  1. โ€œRust Atomics and Locksโ€ by Mara Bos โ€” Best explanation of memory ordering
  2. โ€œThe Art of Multiprocessor Programmingโ€ by Herlihy & Shavit โ€” Lock-free algorithms
  3. โ€œC++ Concurrency in Actionโ€ by Anthony Williams โ€” Practical C++ threading

Papers

  • โ€œScheduling Multithreaded Computations by Work Stealingโ€ โ€” Blumofe & Leiserson (original algorithm)
  • โ€œDynamic Circular Work-Stealing Dequeโ€ โ€” Chase & Lev (the deque youโ€™ll implement)
  • โ€œCorrect and Efficient Work-Stealing for Weak Memory Modelsโ€ โ€” Lรช et al.

Code to Study


Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

  • I can explain acquire/release semantics and when each is needed
  • I can describe why the Chase-Lev deque uses specific memory orderings
  • I can explain what false sharing is and identify it in code
  • I can explain why work-stealing achieves better load balance than a shared queue
  • I can describe the ABA problem and how to prevent it

Implementation

  • My deque passes single-threaded tests
  • ThreadSanitizer reports no races
  • My pool achieves 6x+ speedup on 8 cores for the Mandelbrot benchmark
  • Thread parking works (CPU usage drops when pool is idle)
  • Shutdown is clean (no hung threads)

Verification

  • Iโ€™ve stress-tested with millions of tasks
  • Iโ€™ve tested with more threads than cores (oversubscription)
  • Iโ€™ve profiled with perf and understand where time is spent
  • Iโ€™ve measured cache-miss rates before and after optimization

Work-stealing thread pools power some of the most important parallel computing systems. Building one teaches you concurrency at the deepest level.