P02: Work-Stealing Thread Pool

Build a high-performance thread pool with work-stealing scheduling like Rayon or Go’s runtime

Attribute	Value
Main Language	C++
Alternative Languages	Rust, Go, Java
Difficulty	Level 3: Advanced
Coolness Level	Level 4: Hardcore Tech Flex
Knowledge Area	Thread Pool Design, Work Scheduling
Key Tools	pthreads, C++ atomics, perf
Main Book	“Rust Atomics and Locks” - Mara Bos

Learning Objectives

By completing this project, you will:

Master atomic operations — Understand memory ordering (acquire, release, seq_cst) and when each is needed
Implement lock-free data structures — Build a Chase-Lev work-stealing deque
Understand cache coherency — Learn why false sharing destroys performance and how to avoid it
Design efficient schedulers — Balance work across threads without creating bottlenecks
Profile concurrent code — Use perf and ThreadSanitizer to find issues
Build parallel algorithms — Create parallel map, reduce, and sort using your pool

The Core Question

“How do systems like Rayon achieve near-linear speedup, and why doesn’t adding more threads always make things faster?”

Adding threads seems simple: more workers = more throughput. But reality is harsh:

Lock contention: Threads fighting over shared data waste cycles waiting
Cache invalidation: One thread modifying data invalidates another’s cache
Load imbalance: Some threads finish early while others are overloaded
Synchronization overhead: Coordinating threads costs CPU time

Work-stealing solves these elegantly: each thread has its own work queue, and idle threads “steal” from busy ones. No central bottleneck, natural load balancing.

After this project, you’ll understand why Go’s goroutine scheduler and Rust’s Rayon use work-stealing, and you’ll have built one yourself.

Deep Theoretical Foundation

1. Why Work-Stealing?

The Problem with Centralized Queues

Traditional thread pools use a single shared queue:

                    ┌─────────────┐
                    │ Task Queue  │
                    │ [T1][T2][T3]│
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
        ▼                  ▼                  ▼
   ┌─────────┐       ┌─────────┐       ┌─────────┐
   │ Worker 1│       │ Worker 2│       │ Worker 3│
   └─────────┘       └─────────┘       └─────────┘

Every worker must lock the queue to get work. With 8 workers hitting the same lock, contention dominates.

Work-Stealing Architecture

Each worker has its own deque (double-ended queue):

   ┌─────────┐       ┌─────────┐       ┌─────────┐
   │ Worker 1│       │ Worker 2│       │ Worker 3│
   └────┬────┘       └────┬────┘       └────┬────┘
        │                 │                 │
   ┌────▼────┐       ┌────▼────┐       ┌────▼────┐
   │ Deque 1 │       │ Deque 2 │       │ Deque 3 │
   │[T1][T2] │◄──────│  [T5]   │──────►│   []    │
   └─────────┘ steal └─────────┘ steal └─────────┘
        │
        ▼ push/pop (owner only)

Owner pushes/pops from the bottom (LIFO — good for cache locality)
Thieves steal from the top (FIFO — takes oldest, largest tasks)
Only contention is when stealing — and that’s rare in balanced workloads

2. Memory Ordering and Atomics

The Chase-Lev deque requires careful use of atomic operations. Understanding memory ordering is essential.

What Are Atomics?

Atomic operations appear to happen instantaneously from other threads’ perspectives. No tearing (partial reads/writes).

std::atomic<int> counter{0};

// Thread 1
counter.fetch_add(1);  // Atomic increment

// Thread 2
int val = counter.load();  // Always sees complete value (0 or 1, never garbage)

Memory Ordering

The CPU and compiler reorder operations for performance. Memory ordering constrains this reordering.

Ordering	Guarantee	Use Case
`relaxed`	Atomicity only, no ordering	Counters, statistics
`acquire`	All reads after this see all writes before a paired release	Lock acquisition
`release`	All writes before this are visible after a paired acquire	Lock release
`acq_rel`	Both acquire and release	Read-modify-write in critical sections
`seq_cst`	Total global order	Simple but expensive

Example: Spinlock with Ordering

class Spinlock {
    std::atomic<bool> locked{false};
public:
    void lock() {
        while (locked.exchange(true, std::memory_order_acquire)) {
            // spin
        }
        // ACQUIRE: All reads below see all writes before unlock
    }

    void unlock() {
        // RELEASE: All writes above are visible after next acquire
        locked.store(false, std::memory_order_release);
    }
};

3. The Chase-Lev Deque

The work-stealing deque is the heart of the system. It must be:

Wait-free for owner (push/pop never block)
Lock-free for thieves (steal may retry but always progresses)
Cache-efficient (owner operations are local)

Deque Structure

struct Deque {
    std::atomic<int64_t> top;     // Thieves steal from here
    std::atomic<int64_t> bottom;  // Owner pushes/pops here
    std::atomic<Task**> buffer;   // Circular buffer of tasks
    int64_t capacity;
};

Visual Model

Circular buffer (capacity = 8):
     0   1   2   3   4   5   6   7
   ┌───┬───┬───┬───┬───┬───┬───┬───┐
   │   │   │ T1│ T2│ T3│   │   │   │
   └───┴───┴───┴───┴───┴───┴───┴───┘
             ▲           ▲
             │           │
            top=2     bottom=5

Owner pushes at bottom, pops at bottom-1
Thieves steal from top

Push Operation (Owner Only)

void push(Task* task) {
    int64_t b = bottom.load(relaxed);
    int64_t t = top.load(acquire);

    if (b - t >= capacity) {
        grow();  // Resize buffer if full
        b = bottom.load(relaxed);
    }

    buffer.load(relaxed)[b % capacity] = task;
    std::atomic_thread_fence(release);  // Ensure task visible before bottom update
    bottom.store(b + 1, relaxed);
}

Pop Operation (Owner Only)

Task* pop() {
    int64_t b = bottom.load(relaxed) - 1;
    bottom.store(b, relaxed);
    std::atomic_thread_fence(seq_cst);  // Full barrier for race with steal
    int64_t t = top.load(relaxed);

    if (t <= b) {
        Task* task = buffer.load(relaxed)[b % capacity];
        if (t == b) {
            // Last element — race with steal
            if (!top.compare_exchange_strong(t, t + 1, seq_cst, relaxed)) {
                // Lost race, thief took it
                bottom.store(t + 1, relaxed);
                return nullptr;
            }
            bottom.store(t + 1, relaxed);
        }
        return task;
    } else {
        // Deque was empty
        bottom.store(t, relaxed);
        return nullptr;
    }
}

Steal Operation (Thieves)

Task* steal() {
    int64_t t = top.load(acquire);
    std::atomic_thread_fence(seq_cst);
    int64_t b = bottom.load(acquire);

    if (t < b) {
        Task* task = buffer.load(relaxed)[t % capacity];
        if (!top.compare_exchange_strong(t, t + 1, seq_cst, relaxed)) {
            // Lost race with another thief or pop
            return nullptr;  // Retry later
        }
        return task;
    }
    return nullptr;  // Empty
}

Cache Lines

Modern CPUs load memory in 64-byte cache lines. When one core writes to a cache line, all other cores’ copies are invalidated.

Cache Line (64 bytes):
┌────────────────────────────────────────────────────────────────┐
│ [counter_thread1] [counter_thread2] [counter_thread3] [...]    │
└────────────────────────────────────────────────────────────────┘

Thread 1 increments counter_thread1 → entire line invalidated on all cores!

False Sharing

Variables that could be independent share a cache line, causing “false” contention:

// BAD: False sharing
struct WorkerState {
    std::atomic<int> tasks_completed;  // 4 bytes
    // Next worker's state might be in same cache line!
};
WorkerState workers[8];  // Packed together

// GOOD: Padded to cache line
struct alignas(64) WorkerState {
    std::atomic<int> tasks_completed;
    char padding[60];  // Fill to 64 bytes
};

Measuring False Sharing

Use perf to detect:

$ perf stat -e cache-misses,cache-references ./your_program

# High cache-miss ratio with many threads = likely false sharing

5. Worker Thread Design

Thread Lifecycle

void worker_loop(WorkerState* state) {
    while (!shutdown) {
        // 1. Try to get work from own deque
        Task* task = state->deque.pop();

        if (!task) {
            // 2. No local work — try to steal
            task = try_steal_from_others(state);
        }

        if (task) {
            // 3. Execute task
            task->execute();
            // Task might push more work to our deque
        } else {
            // 4. No work anywhere — park thread
            park_thread(state);
        }
    }
}

Stealing Strategy

Random victim selection works well:

Task* try_steal_from_others(WorkerState* me) {
    int start = random() % num_workers;
    for (int i = 0; i < num_workers; i++) {
        int victim = (start + i) % num_workers;
        if (victim != me->id) {
            Task* stolen = workers[victim].deque.steal();
            if (stolen) return stolen;
        }
    }
    return nullptr;  // Everyone is empty
}

Thread Parking

When no work is available, threads should sleep rather than spin:

void park_thread(WorkerState* state) {
    std::unique_lock<std::mutex> lock(state->park_mutex);
    state->parked = true;

    // Wait with timeout to handle spurious wakeups
    state->park_cv.wait_for(lock, 1ms, [&] {
        return !state->parked || shutdown;
    });
}

void unpark_thread(WorkerState* state) {
    {
        std::lock_guard<std::mutex> lock(state->park_mutex);
        state->parked = false;
    }
    state->park_cv.notify_one();
}

6. Task Representation

Function Object Wrapper

class Task {
public:
    virtual void execute() = 0;
    virtual ~Task() = default;
};

template<typename F>
class FunctionTask : public Task {
    F func;
public:
    FunctionTask(F&& f) : func(std::move(f)) {}
    void execute() override { func(); }
};

Fork-Join Pattern

Work-stealing excels at divide-and-conquer:

void parallel_quicksort(int* arr, int lo, int hi) {
    if (hi - lo < THRESHOLD) {
        std::sort(arr + lo, arr + hi);
        return;
    }

    int pivot = partition(arr, lo, hi);

    // Fork: push subtask for right half
    pool.spawn([=] { parallel_quicksort(arr, pivot + 1, hi); });

    // Execute left half locally (no stealing overhead)
    parallel_quicksort(arr, lo, pivot);

    // Join: wait for right half (implicit via work-stealing)
}

Project Specification

What You’ll Build

A work-stealing thread pool library providing:

// Core API
class ThreadPool {
public:
    ThreadPool(size_t num_threads);
    ~ThreadPool();

    // Spawn a task (returns immediately)
    template<typename F>
    void spawn(F&& task);

    // Spawn and wait for result
    template<typename F>
    auto async(F&& task) -> std::future<decltype(task())>;

    // Parallel algorithms
    template<typename Iter, typename F>
    void parallel_for(Iter begin, Iter end, F&& func);

    template<typename Iter, typename T, typename F>
    T parallel_reduce(Iter begin, Iter end, T init, F&& reducer);
};

Deliverables

Header-only library: threadpool.hpp
Chase-Lev deque implementation: Lock-free, resizable
Demo application: Parallel Mandelbrot renderer
Benchmark suite: Scaling tests, comparison with std::async
Documentation: API reference and design notes

Success Criteria

# Correctness: No races detected
$ g++ -fsanitize=thread -o test test.cpp
$ ./test
# No ThreadSanitizer warnings

# Performance: Near-linear scaling
$ ./mandelbrot_bench --threads 1
Single-threaded: 2000 ms

$ ./mandelbrot_bench --threads 8
8 threads: 270 ms (7.4x speedup on 8 cores)

# Efficiency: Low overhead
$ ./micro_bench
Task spawn overhead: < 100 ns
Work-stealing latency: < 1 µs

Solution Architecture

Component Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          Public API                                  │
│  spawn() / async() / parallel_for() / parallel_reduce()            │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
┌────────────────────────────────▼────────────────────────────────────┐
│                         Task Scheduler                               │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Worker Thread Array                        │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        │    │
│  │  │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │  ...   │    │
│  │  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘        │    │
│  │       │            │            │            │               │    │
│  │  ┌────▼────┐  ┌────▼────┐  ┌────▼────┐  ┌────▼────┐         │    │
│  │  │ Deque 0 │  │ Deque 1 │  │ Deque 2 │  │ Deque 3 │         │    │
│  │  │ [Tasks] │  │ [Tasks] │  │ [Tasks] │  │ [Tasks] │         │    │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘         │    │
│  │       ◄────────steal───────────►                             │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Data Structures

Work-Stealing Deque

template<typename T>
class WSDeque {
    static constexpr size_t INITIAL_CAPACITY = 32;

    struct Array {
        std::atomic<T*> buffer;
        size_t capacity;
        // ... ring buffer implementation
    };

    alignas(64) std::atomic<int64_t> top;
    alignas(64) std::atomic<int64_t> bottom;
    alignas(64) std::atomic<Array*> array;

public:
    void push(T item);           // Owner only
    std::optional<T> pop();      // Owner only
    std::optional<T> steal();    // Thieves
};

Worker State (Cache-Line Aligned)

struct alignas(64) WorkerState {
    WSDeque<Task*> deque;
    std::thread thread;
    size_t id;

    // Parking mechanism
    std::mutex park_mutex;
    std::condition_variable park_cv;
    std::atomic<bool> parked{false};

    // Statistics
    std::atomic<uint64_t> tasks_executed{0};
    std::atomic<uint64_t> tasks_stolen{0};

    char padding[64];  // Ensure next worker is on different cache line
};

Key Algorithms

Task Distribution

spawn(task):
1. Get current thread's worker ID (thread_local)
2. If called from worker thread:
   - Push to current worker's deque (fast path)
3. If called from external thread:
   - Push to random worker's deque
   - Unpark a sleeping worker

Work Loop

worker_loop():
while not shutdown:
    1. task = own_deque.pop()
    2. if no task:
       - for each other worker (random order):
           task = other_deque.steal()
           if task: break
    3. if task:
       - execute(task)
       - update statistics
    4. else:
       - park_thread()
       - on wakeup, continue loop

Phased Implementation Guide

Phase 1: Basic Thread Pool (Days 1-2)

Goal: Working pool with a single shared queue.

Steps:

Create worker threads that pop from a shared std::deque
Protect with a mutex
Implement spawn() that pushes to the queue
Implement graceful shutdown

Test: Spawning 1000 tasks completes correctly.

Phase 2: Lock-Free Deque (Days 3-4)

Goal: Implement the Chase-Lev deque.

Steps:

Implement push() (owner-only, relaxed atomics)
Implement pop() (owner-only, handles race with steal)
Implement steal() (lock-free, may fail and retry)
Handle buffer growth when full

Test: Single-threaded tests pass. Multi-threaded stress test with TSan.

Phase 3: Work-Stealing Integration (Days 5-6)

Goal: Each worker has its own deque, idle workers steal.

Steps:

Replace shared queue with per-worker deques
Modify worker loop to try local pop first
Add stealing logic (random victim selection)
Add thread parking when no work

Test: Mandelbrot renders correctly with work-stealing.

Phase 4: Cache Optimization (Days 6-7)

Goal: Eliminate false sharing and improve cache performance.

Steps:

Align WorkerState to cache lines
Separate hot (deque) and cold (statistics) data
Profile with perf for cache misses
Optimize stealing pattern

Test: Scaling improves (7x+ speedup on 8 cores).

Phase 5: API and Algorithms (Days 8-10)

Goal: Ergonomic API with parallel algorithms.

Steps:

Implement async() returning futures
Implement parallel_for() with work-stealing
Implement parallel_reduce()
Add graceful shutdown and thread joining

Testing Strategy

Correctness Tests

// Test basic spawn/execute
TEST(ThreadPool, BasicSpawn) {
    ThreadPool pool(4);
    std::atomic<int> counter{0};

    for (int i = 0; i < 1000; i++) {
        pool.spawn([&] { counter++; });
    }

    pool.wait_all();
    EXPECT_EQ(counter.load(), 1000);
}

// Test work-stealing balance
TEST(ThreadPool, WorkStealing) {
    ThreadPool pool(4);
    std::array<std::atomic<int>, 4> per_thread_count{};

    // Push all work to thread 0
    for (int i = 0; i < 10000; i++) {
        pool.spawn_to(0, [&, tid = get_thread_id()] {
            per_thread_count[tid]++;
        });
    }

    pool.wait_all();

    // Verify work was stolen (all threads did some work)
    for (auto& count : per_thread_count) {
        EXPECT_GT(count.load(), 0);
    }
}

Stress Tests

// Concurrent spawn and steal
void stress_test() {
    ThreadPool pool(8);
    std::atomic<int> completed{0};

    // Spawn from multiple threads
    std::vector<std::thread> spawners;
    for (int t = 0; t < 4; t++) {
        spawners.emplace_back([&] {
            for (int i = 0; i < 10000; i++) {
                pool.spawn([&] { completed++; });
            }
        });
    }

    for (auto& t : spawners) t.join();
    pool.wait_all();

    assert(completed == 40000);
}

ThreadSanitizer

$ g++ -std=c++17 -fsanitize=thread -g -O1 -o test_tsan test.cpp
$ ./test_tsan
# Must report NO data races

Common Pitfalls and Debugging

Pitfall 1: ABA Problem in Deque

Symptom: Occasional corruption, tasks executed twice.

Cause: Between reading top and CAS, another thread steals and new tasks wrap around.

Fix: Use 64-bit indices (never wrap in practice) or add version counter:

struct alignas(16) TopCounter {
    int64_t index;
    int64_t version;
};
std::atomic<TopCounter> top;

Pitfall 2: Memory Ordering Bugs

Symptom: Works single-threaded, fails multi-threaded.

Debug:

Use TSan — it catches most ordering bugs
Add assert() that check invariants
Start with seq_cst everywhere, then carefully weaken

Pitfall 3: Deadlock on Shutdown

Symptom: Program hangs on exit.

Cause: Workers parked waiting for work when shutdown requested.

Fix:

void shutdown() {
    shutdown_flag.store(true);
    // Wake all parked threads
    for (auto& worker : workers) {
        worker.park_cv.notify_all();
    }
    // Now join threads
}

Pitfall 4: Thundering Herd

Symptom: Performance tanks when new work arrives after idle period.

Cause: All parked threads wake simultaneously and contend.

Fix: Unpark threads incrementally:

void spawn(Task* task) {
    push_to_deque(task);
    // Only unpark one thread
    for (auto& w : workers) {
        if (w.parked.load()) {
            unpark_thread(&w);
            break;  // Only one!
        }
    }
}

Extensions and Challenges

Extension 1: Task Dependencies

Support task graphs:

auto a = pool.async([] { return compute_a(); });
auto b = pool.async([] { return compute_b(); });
auto c = pool.async([&] { return a.get() + b.get(); });  // Waits for a and b

Extension 2: Priority Queues

Some tasks are more important:

pool.spawn(task, Priority::High);    // Processed first
pool.spawn(task, Priority::Low);     // Processed when nothing else

Extension 3: Affinity and NUMA

Pin workers to cores, allocate from local memory:

ThreadPool pool(8, ThreadPool::AffinityMode::PinToCores);
pool.spawn_on_core(3, task);  // Run specifically on core 3

Extension 4: Continuations

Chain tasks without blocking:

pool.async(compute_a)
    .then([](int a) { return compute_b(a); })
    .then([](int b) { std::cout << b; });

Real-World Connections

Where Work-Stealing is Used

Rayon (Rust): Powers parallel iterators with work-stealing
Go Runtime: Goroutine scheduler uses work-stealing across P’s
Intel TBB: parallel_for uses work-stealing
Java ForkJoinPool: Standard library work-stealing executor

Industry Patterns

Pattern	Work-Stealing Benefit
Parallel Sort	Recursive subdivision naturally balances
Game Physics	Particle systems with variable work per particle
Build Systems	Compile tasks with dependencies
Ray Tracing	Rays with variable depth naturally balance

Resources

Essential Reading

“Rust Atomics and Locks” by Mara Bos — Best explanation of memory ordering
“The Art of Multiprocessor Programming” by Herlihy & Shavit — Lock-free algorithms
“C++ Concurrency in Action” by Anthony Williams — Practical C++ threading

Papers

“Scheduling Multithreaded Computations by Work Stealing” — Blumofe & Leiserson (original algorithm)
“Dynamic Circular Work-Stealing Deque” — Chase & Lev (the deque you’ll implement)
“Correct and Efficient Work-Stealing for Weak Memory Models” — Lê et al.

Code to Study

Rayon source code — Production work-stealing in Rust
Go runtime scheduler — Work-stealing for goroutines
crossbeam-deque — Excellent Rust deque implementation

Self-Assessment Checklist

Before considering this project complete, verify:

Understanding

I can explain acquire/release semantics and when each is needed
I can describe why the Chase-Lev deque uses specific memory orderings
I can explain what false sharing is and identify it in code
I can explain why work-stealing achieves better load balance than a shared queue
I can describe the ABA problem and how to prevent it

Implementation

My deque passes single-threaded tests
ThreadSanitizer reports no races
My pool achieves 6x+ speedup on 8 cores for the Mandelbrot benchmark
Thread parking works (CPU usage drops when pool is idle)
Shutdown is clean (no hung threads)

Verification

I’ve stress-tested with millions of tasks
I’ve tested with more threads than cores (oversubscription)
I’ve profiled with perf and understand where time is spent
I’ve measured cache-miss rates before and after optimization

Work-stealing thread pools power some of the most important parallel computing systems. Building one teaches you concurrency at the deepest level.

P02: Work-Stealing Thread Pool

Learning Objectives

The Core Question

Deep Theoretical Foundation

1. Why Work-Stealing?

2. Memory Ordering and Atomics

3. The Chase-Lev Deque

4. Cache Coherency and False Sharing

5. Worker Thread Design

6. Task Representation

Project Specification

What You’ll Build

Deliverables

Success Criteria

Solution Architecture

Component Overview

Data Structures

Key Algorithms

Phased Implementation Guide

Phase 1: Basic Thread Pool (Days 1-2)

Phase 2: Lock-Free Deque (Days 3-4)

Phase 3: Work-Stealing Integration (Days 5-6)

Phase 4: Cache Optimization (Days 6-7)

Phase 5: API and Algorithms (Days 8-10)

Testing Strategy

Correctness Tests

Stress Tests

ThreadSanitizer

Common Pitfalls and Debugging

Pitfall 1: ABA Problem in Deque

Pitfall 2: Memory Ordering Bugs

Pitfall 3: Deadlock on Shutdown

Pitfall 4: Thundering Herd

Extensions and Challenges

Extension 1: Task Dependencies

Extension 2: Priority Queues

Extension 3: Affinity and NUMA

Extension 4: Continuations

Real-World Connections

Where Work-Stealing is Used

Industry Patterns

Resources

Essential Reading

Papers

Code to Study

Self-Assessment Checklist

Understanding

Implementation

Verification