Project 3: Work-Stealing Thread Pool
Build a production-quality thread pool with per-thread work queues and work-stealing for load balancing. This is what powers Intel TBB, Rust’s Tokio, and Go’s runtime scheduler.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time | 2-4 weeks |
| Language | C++17 or later |
| Alt Languages | Rust, Go, Java |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Value | Open Core Infrastructure |
| Knowledge Area | Thread Pool Design / Work Scheduling |
| Key Tools | std::atomic, std::thread, perf, ThreadSanitizer |
| Main Book | “C++ Concurrency in Action” by Anthony Williams |
Learning Objectives
By completing this project, you will:
- Master work-stealing architecture - Understand why per-thread queues with stealing outperform shared queues at scale
- Implement the Chase-Lev deque - Build the lock-free data structure that powers industrial work-stealers
- Understand cache locality - Learn why LIFO execution preserves hot cache lines while FIFO stealing balances load
- Apply memory ordering correctly - Use acquire/release semantics to build correct concurrent algorithms
- Avoid false sharing - Structure data to prevent cache line ping-pong between cores
- Build type-erased task submission - Use
std::packaged_taskand templates for genericsubmit()that returns futures - Design graceful shutdown - Drain work queues and coordinate thread termination without deadlock
- Benchmark parallel performance - Measure throughput, latency, and scaling efficiency
Theoretical Foundation
Core Concepts
The Problem: Contention in Shared Queues
Traditional thread pools use a single shared task queue:
┌─────────────────────┐
│ Global Queue │
│ [T1][T2][T3][T4]... │
└──────────┬──────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Worker 0 │ │Worker 1 │ │Worker 2 │
└─────────┘ └─────────┘ └─────────┘
│ │ │
└──────────────────┴──────────────────┘
│
CONTENTION!
(All workers fight for lock)
The problem: Every worker must acquire a lock to get or submit work. With 8+ cores, threads spend more time contending than computing. This is called lock convoy - workers serialize behind a single lock.
Measured impact:
- 2 threads: ~10% overhead from contention
- 8 threads: ~40% overhead from contention
- 16+ threads: Often slower than single-threaded execution
The Solution: Work-Stealing Architecture
Each worker maintains its own deque (double-ended queue):
┌─────────────────────────────────────────────────────────────────────┐
│ Work-Stealing Thread Pool │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │ │
│ │ │ Deque 0 │ │ │ │ Deque 1 │ │ │ │ Deque 2 │ │ │
│ │ │[T1][T2] │ │ │ │ [T5] │ │ │ │ [] │ │ │
│ │ └────┬────┘ │ │ └────┬────┘ │ │ └────┬────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ BOTTOM │ │ BOTTOM │ │ BOTTOM │ │
│ │ (private) │ │ (private) │ │ (private) │ │
│ └──────┼──────┘ └──────┼──────┘ └──────┼──────┘ │
│ │ │ │ │
│ │ ┌─────────────┴──────────────────┤ │
│ │ │ │ │
│ └────┼────────────────────────────────┘ │
│ │ │
│ ◄────┴────► STEALING HAPPENS HERE ◄────► │
│ (from TOP of victim's deque) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key insight:
- Owner pushes/pops from BOTTOM (LIFO - cache-hot tasks)
- Thieves steal from TOP (FIFO - oldest, likely largest tasks)
- Owner operations: nearly lock-free (fast path has no contention)
- Stealing: rare, and only contends with other thieves
Why this works:
- Owner operations are fast: Push/pop from the private end rarely contend
- Cache locality preserved: LIFO execution keeps recently touched data in cache
- Natural load balancing: Idle workers steal from busy ones
- Stealing takes large tasks: FIFO stealing grabs the oldest (often largest) tasks, maximizing work transfer per steal
Cache Locality and Why It Matters
When a task spawns subtasks, those subtasks reference the same data:
parallel_quicksort(array, 0, 1000000)
|
+-- spawns: parallel_quicksort(array, 0, 500000) // Left half
|
+-- spawns: parallel_quicksort(array, 500001, 1000000) // Right half
The left subtask was just pushed. Executing it immediately (LIFO) means:
- Array's left portion is still in L1/L2 cache
- Cache miss rate: LOW
If we used FIFO (like a regular queue):
- By the time we get to left subtask, array data is evicted
- Cache miss rate: HIGH
Benchmark evidence: LIFO execution can be 2-3x faster than FIFO for recursive algorithms due to cache effects alone.
Contention Reduction Metrics
| Architecture | Lock Operations/Task | Cache Coherence Traffic |
|---|---|---|
| Single shared queue | 2 (push + pop) | High (all cores invalidate) |
| Per-thread + stealing | 0.05 (rare steals) | Low (mostly local access) |
Work-stealing achieves near-zero contention because:
- 95%+ of operations are local (no lock, no coherence traffic)
- Stealing only happens when workers run out of local work
- When load is balanced, almost no stealing occurs
Why This Matters
Work-stealing is the architecture behind:
- Intel TBB (Threading Building Blocks): Powers
parallel_for,parallel_reducein C++ - Rust’s Rayon: The
par_iter()you use for parallel iteration - Go’s runtime: Goroutine scheduling uses work-stealing across P’s
- Java’s ForkJoinPool: Standard library work-stealing executor
- Tokio (Rust async runtime): Work-stealing for async tasks
Understanding this architecture lets you:
- Use these libraries effectively (know when they’ll perform well)
- Debug scaling problems in parallel code
- Build custom schedulers for specialized workloads
- Reason about performance of concurrent systems
Historical Context
1999: Cilk at MIT (Blumofe, Leiserson) proved work-stealing is provably optimal for fork-join parallelism. The total time is at most T_1/P + O(T_inf), where T_1 is sequential time, P is processors, and T_inf is critical path length.
2005: The Chase-Lev paper introduced a practical lock-free deque that became the standard implementation.
2010s: TBB, Rayon, Go, and Tokio all adopted work-stealing as their core scheduling strategy.
Today: Work-stealing is considered the gold standard for fork-join and task-parallel scheduling.
Common Misconceptions
Misconception 1: “Work-stealing is just a thread pool with better load balancing” Reality: It’s a fundamentally different architecture. The per-thread deques, the LIFO/FIFO split, and the stealing protocol create qualitatively different behavior.
Misconception 2: “Stealing is expensive because of contention” Reality: When workloads are balanced, almost no stealing occurs. When imbalanced, stealing is rare per-task but transfers significant work per operation.
Misconception 3: “You need lock-free data structures” Reality: You can start with mutex-protected deques. Lock-free (Chase-Lev) improves performance but mutex version works correctly.
Misconception 4: “More stealing = better load balance” Reality: Frequent stealing indicates workload imbalance. Optimal case: zero steals (perfectly distributed work).
Project Specification
What You Will Build
A production-quality work-stealing thread pool library providing:
class WorkStealingPool {
public:
// Create pool with specified number of workers (default: hardware_concurrency)
explicit WorkStealingPool(size_t num_workers = 0);
// Destructor drains queues and joins threads
~WorkStealingPool();
// Submit a task and get a future for its result
template<typename F, typename... Args>
auto submit(F&& func, Args&&... args)
-> std::future<std::invoke_result_t<F, Args...>>;
// Spawn a fire-and-forget task (no result needed)
template<typename F>
void spawn(F&& func);
// Wait for all currently submitted tasks to complete
void wait_all();
// Query pool state
size_t num_workers() const;
size_t pending_tasks() const;
// Statistics
struct Stats {
uint64_t tasks_executed;
uint64_t tasks_stolen;
uint64_t steal_attempts;
uint64_t successful_steals;
};
Stats stats() const;
};
Deliverables
- Header-only library:
work_stealing_pool.hpp(or split header/source) - Chase-Lev deque implementation:
work_stealing_deque.hpp - Demo applications:
- Parallel Mandelbrot renderer
- Parallel merge sort
- Parallel tree traversal
- Benchmark suite: Scaling tests, comparison with
std::asyncand simple thread pool - Documentation: API reference, design rationale, performance analysis
Example Output
$ ./thread_pool_benchmark --threads 8
================================================================================
WORK-STEALING THREAD POOL BENCHMARK
================================================================================
Configuration:
Workers: 8 (matching hardware_concurrency)
Queue type: Lock-free Chase-Lev deque with work-stealing
CPU: Apple M1 Pro (8 cores)
--------------------------------------------------------------------------------
[BENCHMARK 1] Micro-tasks: 100,000 tasks (1us each)
--------------------------------------------------------------------------------
Submission time: 12.3 ms (8.1M tasks/sec submission rate)
Completion time: 145 ms
Throughput: 689,655 tasks/second
Work distribution:
Worker 0: 12,847 tasks (12.8%)
Worker 1: 12,523 tasks (12.5%)
Worker 2: 12,891 tasks (12.9%)
Worker 3: 12,445 tasks (12.4%)
Worker 4: 12,789 tasks (12.8%)
Worker 5: 12,612 tasks (12.6%)
Worker 6: 12,467 tasks (12.5%)
Worker 7: 11,426 tasks (11.4%)
Stealing statistics:
Total steal attempts: 1,247
Successful steals: 892 (71.5% success rate)
Work items stolen: 892
Comparison:
vs std::async: 2.3x faster
vs simple thread pool: 1.8x faster
--------------------------------------------------------------------------------
[BENCHMARK 2] Fork-join: Parallel quicksort of 10M integers
--------------------------------------------------------------------------------
Sequential time: 890 ms
Parallel time: 118 ms
Speedup: 7.54x (theoretical max: 8x)
Efficiency: 94.3%
Task tree depth: 20 levels
Total subtasks: 131,071
Stealing:
Steals performed: 2,847
Work balance: Excellent (max worker deviation: 3.2%)
--------------------------------------------------------------------------------
[BENCHMARK 3] Heterogeneous workload (mixed 1us, 100us, 10ms tasks)
--------------------------------------------------------------------------------
Task mix: 80% micro (1us), 15% medium (100us), 5% large (10ms)
Total tasks: 10,000
Completion time: 524 ms
Worker utilization:
Average active time: 96.2%
Max idle gap: 1.2 ms
Stealing adapts to workload:
- Micro tasks: minimal stealing (local execution)
- Large tasks: stolen efficiently to balance
vs Round-robin distribution: 1.6x faster
--------------------------------------------------------------------------------
[BENCHMARK 4] Stress test: 1M tasks from 8 external threads
--------------------------------------------------------------------------------
Concurrent submitters: 8 threads
Tasks per submitter: 125,000
All tasks completed: 1.2 seconds
No races detected: (TSan clean)
No deadlocks: (clean shutdown)
================================================================================
SUMMARY
================================================================================
Scaling efficiency at 8 cores: 94%+
Task spawn overhead: ~80 ns
Work-stealing latency: ~200 ns
Memory overhead per task: ~64 bytes
Recommendation: This pool suitable for production use.
================================================================================
Solution Architecture
High-Level Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ WorkStealingPool │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Public API Layer │ │
│ │ submit<F,Args>() spawn<F>() wait_all() stats() │ │
│ └────────────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────▼────────────────────────────────────┐ │
│ │ Task Distribution Layer │ │
│ │ │ │
│ │ External submit → Random worker selection │ │
│ │ Internal spawn → Push to current worker's deque │ │
│ │ │ │
│ └────────────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────▼────────────────────────────────────┐ │
│ │ Worker Threads │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Worker Array │ │ │
│ │ │ │ │ │
│ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
│ │ │ │Worker 0 │ │Worker 1 │ │Worker 2 │ │Worker 3 │ ... │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ thread │ │ thread │ │ thread │ │ thread │ │ │ │
│ │ │ │ deque │ │ deque │ │ deque │ │ deque │ │ │ │
│ │ │ │ stats │ │ stats │ │ stats │ │ stats │ │ │ │
│ │ │ │ parking │ │ parking │ │ parking │ │ parking │ │ │ │
│ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ └────────────┴────────────┴────────────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ STEALING │ │ │
│ │ │ (idle workers steal from busy ones) │ │ │
│ │ │ │ │ │
│ │ └────────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Coordination Layer │ │
│ │ shutdown_flag pending_count park/unpark mechanisms │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Components
1. Work-Stealing Deque (Chase-Lev)
template<typename T>
class WSDeque {
// Circular buffer with grow-on-full semantics
struct Buffer {
std::atomic<T*>* tasks;
size_t capacity;
size_t mask; // capacity - 1, for modular indexing
Buffer(size_t cap);
T* get(int64_t index);
void put(int64_t index, T* task);
Buffer* grow(int64_t bottom, int64_t top);
};
// Indices:
// bottom: owned exclusively by this worker (push/pop)
// top: shared with thieves (steal)
alignas(64) std::atomic<int64_t> bottom{0};
alignas(64) std::atomic<int64_t> top{0};
alignas(64) std::atomic<Buffer*> buffer;
public:
// Owner operations (single-threaded, fast)
void push(T* task); // O(1), may grow buffer
T* pop(); // O(1), returns nullptr if empty
// Thief operation (concurrent, lock-free)
T* steal(); // O(1), returns nullptr if empty or lost race
// Queries
bool empty() const;
size_t size() const;
};
Deque invariant: 0 <= top <= bottom <= top + capacity
Visual representation:
Circular buffer (capacity = 8):
indices: 0 1 2 3 4 5 6 7
┌───┬───┬───┬───┬───┬───┬───┬───┐
buffer: │ │ │ T1│ T2│ T3│ T4│ │ │
└───┴───┴───┴───┴───┴───┴───┴───┘
▲ ▲
│ │
top=2 bottom=6
Tasks in deque: T1, T2, T3, T4 (4 tasks)
Owner pushes at bottom:
buffer[6] = T5; bottom = 7;
Owner pops from bottom-1:
bottom = 5; return buffer[5]; // returns T4
Thief steals from top:
if (CAS(top, 2, 3)) return buffer[2]; // returns T1
2. Worker State (Cache-Line Aligned)
struct alignas(64) WorkerState {
// Hot data: accessed every task
WSDeque<Task> deque;
std::thread thread;
size_t id;
// Parking mechanism (for idle workers)
std::mutex park_mutex;
std::condition_variable park_cv;
std::atomic<bool> parked{false};
// Cold data: statistics (rarely accessed)
struct alignas(64) Statistics {
std::atomic<uint64_t> tasks_executed{0};
std::atomic<uint64_t> tasks_stolen{0};
std::atomic<uint64_t> steal_attempts{0};
} stats;
// Padding to prevent false sharing with next worker
char padding[64 - sizeof(Statistics) % 64];
};
Why alignas(64)? Cache lines are typically 64 bytes. Without alignment:
- Worker 0’s stats and Worker 1’s deque could share a cache line
- When Worker 1 modifies its deque, Worker 0’s stats are invalidated
- This “false sharing” causes cache misses even though data is independent
3. Task Representation
// Type-erased task wrapper
class Task {
public:
virtual void execute() = 0;
virtual ~Task() = default;
};
// Concrete task holding any callable
template<typename Func>
class FunctionTask : public Task {
Func func;
public:
explicit FunctionTask(Func&& f) : func(std::forward<Func>(f)) {}
void execute() override { func(); }
};
// Task with result (for submit())
template<typename Func>
class PackagedTask : public Task {
std::packaged_task<std::invoke_result_t<Func>()> task;
public:
explicit PackagedTask(Func&& f) : task(std::forward<Func>(f)) {}
void execute() override { task(); }
auto get_future() { return task.get_future(); }
};
Data Structures Summary
| Structure | Purpose | Thread Safety |
|---|---|---|
WSDeque<Task*> |
Per-worker task storage | Lock-free (owner + thieves) |
WorkerState |
Worker thread context | Owned by one thread |
Task (type-erased) |
Holds callable + result promise | Immutable after creation |
std::atomic<bool> shutdown |
Termination signal | Relaxed atomic |
std::atomic<size_t> pending |
Outstanding task count | Seq-cst atomic |
Implementation Guide
Phase 1: Basic Thread Pool with Shared Queue (Days 1-2)
Goal: Get a working pool first, optimize later.
Steps:
- Create shared task queue:
std::queue<std::unique_ptr<Task>> queue_; std::mutex queue_mutex_; std::condition_variable queue_cv_; - Implement worker loop:
void worker_loop() { while (!shutdown_) { std::unique_ptr<Task> task; { std::unique_lock lock(queue_mutex_); queue_cv_.wait(lock, [&] { return shutdown_ || !queue_.empty(); }); if (shutdown_ && queue_.empty()) return; task = std::move(queue_.front()); queue_.pop(); } task->execute(); } } - Implement submit():
template<typename F, typename... Args> auto submit(F&& f, Args&&... args) { using ReturnType = std::invoke_result_t<F, Args...>; auto task = std::make_shared<std::packaged_task<ReturnType()>>( std::bind(std::forward<F>(f), std::forward<Args>(args)...) ); auto future = task->get_future(); { std::lock_guard lock(queue_mutex_); queue_.push([task]() { (*task)(); }); } queue_cv_.notify_one(); return future; }
Checkpoint: 1000 tasks complete correctly. Measure baseline throughput.
Phase 2: Lock-Free Deque (Days 3-5)
Goal: Implement the Chase-Lev work-stealing deque.
Key algorithm for push (owner only):
void push(T* task) {
int64_t b = bottom_.load(std::memory_order_relaxed);
int64_t t = top_.load(std::memory_order_acquire);
Buffer* buf = buffer_.load(std::memory_order_relaxed);
if (b - t >= static_cast<int64_t>(buf->capacity)) {
// Buffer full, grow it
buf = buf->grow(b, t);
buffer_.store(buf, std::memory_order_release);
}
buf->put(b, task);
std::atomic_thread_fence(std::memory_order_release);
bottom_.store(b + 1, std::memory_order_relaxed);
}
Key algorithm for pop (owner only):
T* pop() {
int64_t b = bottom_.load(std::memory_order_relaxed) - 1;
Buffer* buf = buffer_.load(std::memory_order_relaxed);
bottom_.store(b, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
int64_t t = top_.load(std::memory_order_relaxed);
if (t <= b) {
// Non-empty
T* task = buf->get(b);
if (t == b) {
// Last element, race with steal
if (!top_.compare_exchange_strong(
t, t + 1,
std::memory_order_seq_cst,
std::memory_order_relaxed)) {
// Lost race
bottom_.store(t + 1, std::memory_order_relaxed);
return nullptr;
}
bottom_.store(t + 1, std::memory_order_relaxed);
}
return task;
} else {
// Empty
bottom_.store(t, std::memory_order_relaxed);
return nullptr;
}
}
Key algorithm for steal (thieves):
T* steal() {
int64_t t = top_.load(std::memory_order_acquire);
std::atomic_thread_fence(std::memory_order_seq_cst);
int64_t b = bottom_.load(std::memory_order_acquire);
if (t < b) {
// Non-empty
Buffer* buf = buffer_.load(std::memory_order_relaxed);
T* task = buf->get(t);
if (!top_.compare_exchange_strong(
t, t + 1,
std::memory_order_seq_cst,
std::memory_order_relaxed)) {
// Lost race with another thief or pop
return nullptr;
}
return task;
}
return nullptr;
}
Checkpoint: Single-threaded tests pass. Multi-threaded stress test with TSan reports no races.
Phase 3: Work-Stealing Integration (Days 6-8)
Goal: Each worker has own deque, idle workers steal.
Worker loop with stealing:
void worker_loop(size_t worker_id) {
// Thread-local worker ID for spawn() to find current deque
thread_local_worker_id = worker_id;
while (!shutdown_.load(std::memory_order_relaxed)) {
Task* task = nullptr;
// 1. Try local deque first (hot path)
task = workers_[worker_id].deque.pop();
// 2. If empty, try stealing
if (!task) {
task = try_steal(worker_id);
}
if (task) {
task->execute();
delete task;
workers_[worker_id].stats.tasks_executed.fetch_add(
1, std::memory_order_relaxed);
} else {
// 3. No work anywhere, park thread
park_worker(worker_id);
}
}
}
Stealing strategy (random victim):
Task* try_steal(size_t my_id) {
// Start at random victim to avoid herding
size_t start = random_() % num_workers_;
for (size_t i = 0; i < num_workers_; ++i) {
size_t victim = (start + i) % num_workers_;
if (victim == my_id) continue;
workers_[my_id].stats.steal_attempts.fetch_add(
1, std::memory_order_relaxed);
Task* stolen = workers_[victim].deque.steal();
if (stolen) {
workers_[my_id].stats.tasks_stolen.fetch_add(
1, std::memory_order_relaxed);
return stolen;
}
}
return nullptr;
}
Task distribution:
template<typename F>
void spawn(F&& func) {
Task* task = new FunctionTask(std::forward<F>(func));
if (thread_local_worker_id >= 0) {
// Called from worker thread: push to local deque (fast path)
workers_[thread_local_worker_id].deque.push(task);
} else {
// Called from external thread: random worker
size_t target = random_() % num_workers_;
workers_[target].deque.push(task);
}
// Wake one parked worker if any
unpark_one();
}
Checkpoint: Mandelbrot renders correctly. Work is distributed across workers.
Phase 4: Cache Optimization (Days 9-10)
Goal: Eliminate false sharing, optimize memory layout.
1. Ensure cache-line alignment:
struct alignas(64) WorkerState {
// Put hot data (deque) first
WSDeque<Task*> deque;
// Cold data (statistics) on separate cache line
struct alignas(64) Stats { ... } stats;
// Padding to ensure next WorkerState starts on new cache line
char padding[...];
};
2. Avoid false sharing in buffer:
// BAD: Adjacent task pointers share cache line
T* buffer[N];
// GOOD: Pad each slot to cache line (only if tasks are large)
struct alignas(64) Slot { T* task; };
Slot buffer[N];
3. Profile with perf:
$ perf stat -e cache-misses,cache-references ./benchmark
# Before: 15% cache miss rate
# After: 3% cache miss rate
Checkpoint: Scaling improves to 7x+ speedup on 8 cores.
Phase 5: API Completion (Days 11-14)
Goal: Complete public API with futures, wait_all, shutdown.
submit() with future:
template<typename F, typename... Args>
auto submit(F&& f, Args&&... args)
-> std::future<std::invoke_result_t<F, Args...>>
{
using ReturnType = std::invoke_result_t<F, Args...>;
// Bind arguments
auto bound = std::bind(std::forward<F>(f),
std::forward<Args>(args)...);
// Wrap in packaged_task for future
auto packaged = std::make_unique<std::packaged_task<ReturnType()>>(
std::move(bound));
auto future = packaged->get_future();
// Create type-erased task
Task* task = new PackagedTaskWrapper<ReturnType>(std::move(packaged));
pending_count_.fetch_add(1, std::memory_order_seq_cst);
// Submit to appropriate deque
distribute_task(task);
return future;
}
wait_all():
void wait_all() {
// Participate in work-stealing while waiting
while (pending_count_.load(std::memory_order_seq_cst) > 0) {
// Try to help complete work
Task* task = try_steal_any();
if (task) {
task->execute();
delete task;
pending_count_.fetch_sub(1, std::memory_order_seq_cst);
} else {
std::this_thread::yield();
}
}
}
Graceful shutdown:
~WorkStealingPool() {
// Signal shutdown
shutdown_.store(true, std::memory_order_release);
// Wake all parked workers
for (auto& worker : workers_) {
unpark_worker(worker);
}
// Join all threads
for (auto& worker : workers_) {
if (worker.thread.joinable()) {
worker.thread.join();
}
}
// Clean up any remaining tasks
for (auto& worker : workers_) {
while (Task* task = worker.deque.pop()) {
delete task;
}
}
}
Testing Strategy
Unit Tests
TEST(WSDeque, SingleThreadedPushPop) {
WSDeque<int*> deque;
int values[100];
for (int i = 0; i < 100; ++i) {
deque.push(&values[i]);
}
EXPECT_EQ(deque.size(), 100);
// LIFO order
for (int i = 99; i >= 0; --i) {
EXPECT_EQ(deque.pop(), &values[i]);
}
EXPECT_TRUE(deque.empty());
}
TEST(WSDeque, ConcurrentSteal) {
WSDeque<int*> deque;
int values[10000];
std::atomic<int> stolen_count{0};
// Owner pushes
for (int i = 0; i < 10000; ++i) {
deque.push(&values[i]);
}
// Multiple thieves steal concurrently
std::vector<std::thread> thieves;
for (int t = 0; t < 4; ++t) {
thieves.emplace_back([&] {
while (int* v = deque.steal()) {
stolen_count.fetch_add(1);
}
});
}
for (auto& t : thieves) t.join();
EXPECT_EQ(stolen_count.load(), 10000);
}
TEST(WorkStealingPool, BasicSubmit) {
WorkStealingPool pool(4);
std::atomic<int> counter{0};
std::vector<std::future<void>> futures;
for (int i = 0; i < 1000; ++i) {
futures.push_back(pool.submit([&counter] {
counter.fetch_add(1);
}));
}
for (auto& f : futures) f.wait();
EXPECT_EQ(counter.load(), 1000);
}
TEST(WorkStealingPool, ResultPropagation) {
WorkStealingPool pool(4);
auto future = pool.submit([](int a, int b) { return a + b; }, 20, 22);
EXPECT_EQ(future.get(), 42);
}
TEST(WorkStealingPool, WorkStealing) {
WorkStealingPool pool(4);
std::array<std::atomic<int>, 4> per_thread_count{};
// Submit many tasks to worker 0
for (int i = 0; i < 10000; ++i) {
pool.submit([&per_thread_count] {
size_t worker_id = get_current_worker_id();
per_thread_count[worker_id].fetch_add(1);
// Simulate work
std::this_thread::sleep_for(std::chrono::microseconds(10));
});
}
pool.wait_all();
// Verify work was distributed (stolen)
for (auto& count : per_thread_count) {
EXPECT_GT(count.load(), 0); // Each worker did some work
}
}
Stress Tests
TEST(WorkStealingPool, StressTest) {
WorkStealingPool pool(8);
std::atomic<int> completed{0};
constexpr int NUM_TASKS = 1'000'000;
// Hammer from multiple external threads
std::vector<std::thread> submitters;
for (int t = 0; t < 8; ++t) {
submitters.emplace_back([&pool, &completed] {
for (int i = 0; i < NUM_TASKS / 8; ++i) {
pool.spawn([&completed] {
completed.fetch_add(1);
});
}
});
}
for (auto& t : submitters) t.join();
pool.wait_all();
EXPECT_EQ(completed.load(), NUM_TASKS);
}
TEST(WorkStealingPool, ForkJoinRecursive) {
WorkStealingPool pool(8);
std::function<int64_t(int)> fib = [&pool, &fib](int n) -> int64_t {
if (n < 2) return n;
if (n < 20) {
// Below threshold, compute sequentially
return fib(n - 1) + fib(n - 2);
}
// Fork
auto f1 = pool.submit([&fib, n] { return fib(n - 1); });
auto f2 = pool.submit([&fib, n] { return fib(n - 2); });
// Join
return f1.get() + f2.get();
};
EXPECT_EQ(fib(35), 9227465);
}
ThreadSanitizer
$ clang++ -std=c++17 -fsanitize=thread -g -O1 -o test_tsan tests/*.cpp
$ ./test_tsan
# MUST report: NO data races detected
# Common issues TSan catches:
# - Missing memory order in atomics
# - Data race on task execution
# - Use-after-free on task deletion
Benchmarking
void benchmark_scaling() {
std::vector<size_t> thread_counts = {1, 2, 4, 8, 16};
for (size_t threads : thread_counts) {
WorkStealingPool pool(threads);
auto start = std::chrono::high_resolution_clock::now();
// Parallel quicksort of 10M integers
std::vector<int> data(10'000'000);
std::iota(data.begin(), data.end(), 0);
std::shuffle(data.begin(), data.end(), std::mt19937{42});
parallel_quicksort(pool, data.data(), 0, data.size());
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration<double, std::milli>(end - start).count();
std::cout << threads << " threads: " << ms << " ms\n";
}
}
Common Pitfalls & Debugging
Pitfall 1: ABA Problem in Lock-Free Deque
Symptom: Occasional corruption, tasks executed twice, or crashes after many iterations.
Cause: Classic ABA scenario in CAS loop:
- Thread A reads
top = 5, gets task pointer P - Thread B steals successfully,
topbecomes 6 - More operations wrap around,
topbecomes 5 again, but with different task Q at that slot - Thread A’s CAS succeeds (5 == 5), returns stale pointer P
Solutions:
// Solution 1: Use 64-bit indices that never wrap (recommended)
// With 64-bit, you'd need 2^64 / num_tasks years to wrap
std::atomic<int64_t> top, bottom; // int64_t, not int32_t
// Solution 2: Tagged pointer with version counter
struct TaggedTop {
int64_t index;
int64_t version;
};
std::atomic<TaggedTop> top;
// Solution 3: Hazard pointers (complex, rarely needed for deque)
Pitfall 2: Starvation Under Heavy Stealing
Symptom: Some workers consistently have zero work while others are overloaded.
Cause: Round-robin stealing always starts from worker 0, causing “hot spot”.
Solution: Random victim selection:
// BAD: Always starts from 0
for (size_t victim = 0; victim < num_workers_; ++victim) {
if (victim != my_id) try_steal(victim);
}
// GOOD: Random start
size_t start = rng() % num_workers_;
for (size_t i = 0; i < num_workers_; ++i) {
size_t victim = (start + i) % num_workers_;
if (victim != my_id) try_steal(victim);
}
Pitfall 3: False Sharing Between Workers
Symptom: Scaling stops at 4x even with 8 cores; perf shows high cache miss rate.
Cause: Worker state structures are adjacent in memory, sharing cache lines.
Debugging:
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./benchmark
# Before fix: 15% miss rate
# After fix: 3% miss rate
$ perf c2c record ./benchmark
$ perf c2c report
# Shows "false sharing" hot spots
Solution: Cache-line alignment:
// BAD: Workers share cache lines
struct WorkerState {
std::atomic<int> counter; // 4 bytes
WSDeque<Task*> deque; // starts immediately after
};
WorkerState workers[8]; // Adjacent, false sharing!
// GOOD: Each worker on own cache line(s)
struct alignas(64) WorkerState {
std::atomic<int> counter;
WSDeque<Task*> deque;
char padding[64 - ...]; // Ensure isolation
};
Pitfall 4: Deadlock on Shutdown
Symptom: Program hangs on exit; workers never join.
Cause: Workers are parked (waiting on condition variable) when shutdown is requested.
Solution: Wake all workers after setting shutdown flag:
~WorkStealingPool() {
shutdown_.store(true, std::memory_order_release);
// CRITICAL: Wake ALL parked workers
for (auto& worker : workers_) {
{
std::lock_guard lock(worker.park_mutex);
worker.parked = false;
}
worker.park_cv.notify_all(); // notify_all, not notify_one!
}
for (auto& worker : workers_) {
if (worker.thread.joinable()) {
worker.thread.join();
}
}
}
Pitfall 5: Memory Order Too Weak
Symptom: Works in debug mode, fails in release; works on x86, fails on ARM.
Cause: ARM has weaker memory model than x86. Code relying on x86’s strong ordering breaks.
Debugging:
# Run with ThreadSanitizer
$ clang++ -fsanitize=thread -g -O1 ...
# TSan will report missing synchronization
WARNING: ThreadSanitizer: data race
Write of size 8 at ... by thread T1:
Previous read of size 8 at ... by thread T2:
Solution: Start with memory_order_seq_cst everywhere, then carefully weaken:
// 1. Get it working with seq_cst
bottom_.store(b + 1, std::memory_order_seq_cst);
// 2. Analyze which orderings are actually needed
// 3. Weaken to minimum required (acquire/release usually sufficient)
bottom_.store(b + 1, std::memory_order_release);
// 4. Verify with TSan on both x86 and ARM (or use QEMU)
Pitfall 6: Thundering Herd on Task Submission
Symptom: Latency spike when submitting tasks after idle period.
Cause: All parked workers wake simultaneously, contend on stealing.
Solution: Wake workers incrementally:
void unpark_one() {
for (auto& worker : workers_) {
if (worker.parked.load(std::memory_order_acquire)) {
{
std::lock_guard lock(worker.park_mutex);
worker.parked = false;
}
worker.park_cv.notify_one();
return; // Only wake ONE worker
}
}
}
Extensions & Challenges
Extension 1: Task Priorities
Support priority levels for scheduling:
enum class Priority { Low, Normal, High, Critical };
pool.submit(Priority::High, []{ /* important work */ });
// Implementation: Multiple deques per worker, check high-priority first
Extension 2: Task Dependencies (DAG Scheduling)
Support task graphs with dependencies:
auto a = pool.submit([]{ return compute_a(); });
auto b = pool.submit([]{ return compute_b(); });
auto c = pool.when_all(a, b).then([](int ra, int rb) {
return ra + rb;
});
Extension 3: NUMA Awareness
On multi-socket systems, prefer stealing from same NUMA node:
// Steal order: same-socket workers first, then cross-socket
std::vector<size_t> steal_order = get_numa_aware_order(worker_id);
for (size_t victim : steal_order) {
if (Task* t = try_steal(victim)) return t;
}
Extension 4: Continuations (Non-Blocking Then)
Add continuation support without blocking on get():
pool.submit(compute_a)
.then([](int a) { return compute_b(a); })
.then([](int b) { log_result(b); });
Extension 5: Coroutine Integration (C++20)
Make pool usable with co_await:
Task<int> async_computation(WorkStealingPool& pool) {
int a = co_await pool.submit(compute_a);
int b = co_await pool.submit(compute_b);
co_return a + b;
}
Challenge: Implement Wait-Free Stealing
The Chase-Lev deque is lock-free but not wait-free (thieves may fail repeatedly). Implement bounded wait-free stealing using:
- Helping mechanism: thieves help each other complete
- Combining: batch multiple steal requests
This is research-level difficulty.
Resources
Essential Reading
- “C++ Concurrency in Action” Chapter 9 - Anthony Williams
- Thread pool implementation, task submission, futures
- “The Art of Multiprocessor Programming” Chapter 16 - Herlihy & Shavit
- Work-stealing algorithms, Chase-Lev deque correctness
- “Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson (1999)
- Original paper proving optimality of work-stealing
- “Dynamic Circular Work-Stealing Deque” - Chase & Lev (2005)
- The lock-free deque algorithm you will implement
- “Correct and Efficient Work-Stealing for Weak Memory Models” - Le et al. (2013)
- Memory ordering correctness for Chase-Lev
Code to Study
| Project | Language | Notes |
|---|---|---|
| BS::thread_pool | C++ | Modern, header-only, excellent docs |
| Rayon | Rust | Production work-stealing, par_iter |
| crossbeam-deque | Rust | Clean Chase-Lev implementation |
| Go runtime | Go | Goroutine work-stealing |
| Intel TBB | C++ | Industrial strength, complex |
Tools
- ThreadSanitizer: Compile with
-fsanitize=threadto detect races - perf: Profile cache behavior, find false sharing
- Valgrind/Helgrind: Alternative race detector
- QEMU: Test on ARM to verify memory ordering
Self-Assessment Checklist
Understanding
- I can explain why work-stealing outperforms shared queues at scale
- I can describe the Chase-Lev deque invariants and why they ensure correctness
- I can explain acquire/release semantics and identify where they’re needed in the deque
- I can explain what false sharing is and how to detect/fix it
- I can describe the ABA problem and how 64-bit indices avoid it
- I understand why LIFO execution preserves cache locality
- I can explain when and how stealing happens in a balanced vs imbalanced workload
Implementation
- My deque passes single-threaded push/pop tests
- My deque passes concurrent stealing tests
- ThreadSanitizer reports no data races
submit()correctly returns futures with results- Graceful shutdown works (no hung threads, no leaked tasks)
wait_all()correctly blocks until completion
Performance
- Pool achieves 6x+ speedup on 8 cores for parallel quicksort
- Task spawn overhead is < 200 ns
- Cache miss rate is < 5% under load
- CPU usage drops to near zero when pool is idle (parking works)
- Performance matches or beats
std::asyncfor compute-bound work
Verification
- Stress tested with 1M+ tasks
- Tested with concurrent external submitters
- Tested fork-join recursive patterns (parallel fib, quicksort)
- Tested on both x86 and ARM (or verified with TSan)
- Profiled with perf to verify no false sharing
Submission / Completion Criteria
Minimum Viable Completion
- Basic thread pool with work-stealing deques
spawn()andsubmit()work correctly- Tasks execute on worker threads
- ThreadSanitizer reports no races
- Clean shutdown without deadlock
Full Completion
- Lock-free Chase-Lev deque implementation
submit()returns working futureswait_all()correctly synchronizes- Cache-line aligned worker state
- Benchmark showing 7x+ speedup on 8 cores
- Statistics collection (tasks executed, steals)
Excellence (Going Above & Beyond)
- Comparison benchmark against std::async, simple pool
- Parallel algorithms (parallel_for, parallel_reduce)
- Task priorities
- Continuations (.then())
- NUMA-aware stealing
- Detailed performance analysis with perf
Work-stealing thread pools power the most important parallel computing systems. Building one teaches you concurrency at the deepest level – from cache coherence to memory ordering to load balancing. After this project, you will understand what happens inside Rayon, Go’s runtime, and Intel TBB.
This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the project index.