P02: Work-Stealing Thread Pool
P02: Work-Stealing Thread Pool
Build a high-performance thread pool with work-stealing scheduling like Rayon or Goโs runtime
| Attribute | Value |
|---|---|
| Main Language | C++ |
| Alternative Languages | Rust, Go, Java |
| Difficulty | Level 3: Advanced |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Knowledge Area | Thread Pool Design, Work Scheduling |
| Key Tools | pthreads, C++ atomics, perf |
| Main Book | โRust Atomics and Locksโ - Mara Bos |
Learning Objectives
By completing this project, you will:
- Master atomic operations โ Understand memory ordering (acquire, release, seq_cst) and when each is needed
- Implement lock-free data structures โ Build a Chase-Lev work-stealing deque
- Understand cache coherency โ Learn why false sharing destroys performance and how to avoid it
- Design efficient schedulers โ Balance work across threads without creating bottlenecks
- Profile concurrent code โ Use perf and ThreadSanitizer to find issues
- Build parallel algorithms โ Create parallel map, reduce, and sort using your pool
The Core Question
โHow do systems like Rayon achieve near-linear speedup, and why doesnโt adding more threads always make things faster?โ
Adding threads seems simple: more workers = more throughput. But reality is harsh:
- Lock contention: Threads fighting over shared data waste cycles waiting
- Cache invalidation: One thread modifying data invalidates anotherโs cache
- Load imbalance: Some threads finish early while others are overloaded
- Synchronization overhead: Coordinating threads costs CPU time
Work-stealing solves these elegantly: each thread has its own work queue, and idle threads โstealโ from busy ones. No central bottleneck, natural load balancing.
After this project, youโll understand why Goโs goroutine scheduler and Rustโs Rayon use work-stealing, and youโll have built one yourself.
Deep Theoretical Foundation
1. Why Work-Stealing?
The Problem with Centralized Queues
Traditional thread pools use a single shared queue:
โโโโโโโโโโโโโโโ
โ Task Queue โ
โ [T1][T2][T3]โ
โโโโโโโโฌโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Worker 1โ โ Worker 2โ โ Worker 3โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Every worker must lock the queue to get work. With 8 workers hitting the same lock, contention dominates.
Work-Stealing Architecture
Each worker has its own deque (double-ended queue):
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โ Worker 1โ โ Worker 2โ โ Worker 3โ
โโโโโโฌโโโโโ โโโโโโฌโโโโโ โโโโโโฌโโโโโ
โ โ โ
โโโโโโผโโโโโ โโโโโโผโโโโโ โโโโโโผโโโโโ
โ Deque 1 โ โ Deque 2 โ โ Deque 3 โ
โ[T1][T2] โโโโโโโโโ [T5] โโโโโโโโบโ [] โ
โโโโโโโโโโโ steal โโโโโโโโโโโ steal โโโโโโโโโโโ
โ
โผ push/pop (owner only)
- Owner pushes/pops from the bottom (LIFO โ good for cache locality)
- Thieves steal from the top (FIFO โ takes oldest, largest tasks)
- Only contention is when stealing โ and thatโs rare in balanced workloads
2. Memory Ordering and Atomics
The Chase-Lev deque requires careful use of atomic operations. Understanding memory ordering is essential.
What Are Atomics?
Atomic operations appear to happen instantaneously from other threadsโ perspectives. No tearing (partial reads/writes).
std::atomic<int> counter{0};
// Thread 1
counter.fetch_add(1); // Atomic increment
// Thread 2
int val = counter.load(); // Always sees complete value (0 or 1, never garbage)
Memory Ordering
The CPU and compiler reorder operations for performance. Memory ordering constrains this reordering.
| Ordering | Guarantee | Use Case |
|---|---|---|
relaxed |
Atomicity only, no ordering | Counters, statistics |
acquire |
All reads after this see all writes before a paired release | Lock acquisition |
release |
All writes before this are visible after a paired acquire | Lock release |
acq_rel |
Both acquire and release | Read-modify-write in critical sections |
seq_cst |
Total global order | Simple but expensive |
Example: Spinlock with Ordering
class Spinlock {
std::atomic<bool> locked{false};
public:
void lock() {
while (locked.exchange(true, std::memory_order_acquire)) {
// spin
}
// ACQUIRE: All reads below see all writes before unlock
}
void unlock() {
// RELEASE: All writes above are visible after next acquire
locked.store(false, std::memory_order_release);
}
};
3. The Chase-Lev Deque
The work-stealing deque is the heart of the system. It must be:
- Wait-free for owner (push/pop never block)
- Lock-free for thieves (steal may retry but always progresses)
- Cache-efficient (owner operations are local)
Deque Structure
struct Deque {
std::atomic<int64_t> top; // Thieves steal from here
std::atomic<int64_t> bottom; // Owner pushes/pops here
std::atomic<Task**> buffer; // Circular buffer of tasks
int64_t capacity;
};
Visual Model
Circular buffer (capacity = 8):
0 1 2 3 4 5 6 7
โโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโ
โ โ โ T1โ T2โ T3โ โ โ โ
โโโโโดโโโโดโโโโดโโโโดโโโโดโโโโดโโโโดโโโโ
โฒ โฒ
โ โ
top=2 bottom=5
Owner pushes at bottom, pops at bottom-1
Thieves steal from top
Push Operation (Owner Only)
void push(Task* task) {
int64_t b = bottom.load(relaxed);
int64_t t = top.load(acquire);
if (b - t >= capacity) {
grow(); // Resize buffer if full
b = bottom.load(relaxed);
}
buffer.load(relaxed)[b % capacity] = task;
std::atomic_thread_fence(release); // Ensure task visible before bottom update
bottom.store(b + 1, relaxed);
}
Pop Operation (Owner Only)
Task* pop() {
int64_t b = bottom.load(relaxed) - 1;
bottom.store(b, relaxed);
std::atomic_thread_fence(seq_cst); // Full barrier for race with steal
int64_t t = top.load(relaxed);
if (t <= b) {
Task* task = buffer.load(relaxed)[b % capacity];
if (t == b) {
// Last element โ race with steal
if (!top.compare_exchange_strong(t, t + 1, seq_cst, relaxed)) {
// Lost race, thief took it
bottom.store(t + 1, relaxed);
return nullptr;
}
bottom.store(t + 1, relaxed);
}
return task;
} else {
// Deque was empty
bottom.store(t, relaxed);
return nullptr;
}
}
Steal Operation (Thieves)
Task* steal() {
int64_t t = top.load(acquire);
std::atomic_thread_fence(seq_cst);
int64_t b = bottom.load(acquire);
if (t < b) {
Task* task = buffer.load(relaxed)[t % capacity];
if (!top.compare_exchange_strong(t, t + 1, seq_cst, relaxed)) {
// Lost race with another thief or pop
return nullptr; // Retry later
}
return task;
}
return nullptr; // Empty
}
4. Cache Coherency and False Sharing
Cache Lines
Modern CPUs load memory in 64-byte cache lines. When one core writes to a cache line, all other coresโ copies are invalidated.
Cache Line (64 bytes):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ [counter_thread1] [counter_thread2] [counter_thread3] [...] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Thread 1 increments counter_thread1 โ entire line invalidated on all cores!
False Sharing
Variables that could be independent share a cache line, causing โfalseโ contention:
// BAD: False sharing
struct WorkerState {
std::atomic<int> tasks_completed; // 4 bytes
// Next worker's state might be in same cache line!
};
WorkerState workers[8]; // Packed together
// GOOD: Padded to cache line
struct alignas(64) WorkerState {
std::atomic<int> tasks_completed;
char padding[60]; // Fill to 64 bytes
};
Measuring False Sharing
Use perf to detect:
$ perf stat -e cache-misses,cache-references ./your_program
# High cache-miss ratio with many threads = likely false sharing
5. Worker Thread Design
Thread Lifecycle
void worker_loop(WorkerState* state) {
while (!shutdown) {
// 1. Try to get work from own deque
Task* task = state->deque.pop();
if (!task) {
// 2. No local work โ try to steal
task = try_steal_from_others(state);
}
if (task) {
// 3. Execute task
task->execute();
// Task might push more work to our deque
} else {
// 4. No work anywhere โ park thread
park_thread(state);
}
}
}
Stealing Strategy
Random victim selection works well:
Task* try_steal_from_others(WorkerState* me) {
int start = random() % num_workers;
for (int i = 0; i < num_workers; i++) {
int victim = (start + i) % num_workers;
if (victim != me->id) {
Task* stolen = workers[victim].deque.steal();
if (stolen) return stolen;
}
}
return nullptr; // Everyone is empty
}
Thread Parking
When no work is available, threads should sleep rather than spin:
void park_thread(WorkerState* state) {
std::unique_lock<std::mutex> lock(state->park_mutex);
state->parked = true;
// Wait with timeout to handle spurious wakeups
state->park_cv.wait_for(lock, 1ms, [&] {
return !state->parked || shutdown;
});
}
void unpark_thread(WorkerState* state) {
{
std::lock_guard<std::mutex> lock(state->park_mutex);
state->parked = false;
}
state->park_cv.notify_one();
}
6. Task Representation
Function Object Wrapper
class Task {
public:
virtual void execute() = 0;
virtual ~Task() = default;
};
template<typename F>
class FunctionTask : public Task {
F func;
public:
FunctionTask(F&& f) : func(std::move(f)) {}
void execute() override { func(); }
};
Fork-Join Pattern
Work-stealing excels at divide-and-conquer:
void parallel_quicksort(int* arr, int lo, int hi) {
if (hi - lo < THRESHOLD) {
std::sort(arr + lo, arr + hi);
return;
}
int pivot = partition(arr, lo, hi);
// Fork: push subtask for right half
pool.spawn([=] { parallel_quicksort(arr, pivot + 1, hi); });
// Execute left half locally (no stealing overhead)
parallel_quicksort(arr, lo, pivot);
// Join: wait for right half (implicit via work-stealing)
}
Project Specification
What Youโll Build
A work-stealing thread pool library providing:
// Core API
class ThreadPool {
public:
ThreadPool(size_t num_threads);
~ThreadPool();
// Spawn a task (returns immediately)
template<typename F>
void spawn(F&& task);
// Spawn and wait for result
template<typename F>
auto async(F&& task) -> std::future<decltype(task())>;
// Parallel algorithms
template<typename Iter, typename F>
void parallel_for(Iter begin, Iter end, F&& func);
template<typename Iter, typename T, typename F>
T parallel_reduce(Iter begin, Iter end, T init, F&& reducer);
};
Deliverables
- Header-only library:
threadpool.hpp - Chase-Lev deque implementation: Lock-free, resizable
- Demo application: Parallel Mandelbrot renderer
- Benchmark suite: Scaling tests, comparison with std::async
- Documentation: API reference and design notes
Success Criteria
# Correctness: No races detected
$ g++ -fsanitize=thread -o test test.cpp
$ ./test
# No ThreadSanitizer warnings
# Performance: Near-linear scaling
$ ./mandelbrot_bench --threads 1
Single-threaded: 2000 ms
$ ./mandelbrot_bench --threads 8
8 threads: 270 ms (7.4x speedup on 8 cores)
# Efficiency: Low overhead
$ ./micro_bench
Task spawn overhead: < 100 ns
Work-stealing latency: < 1 ยตs
Solution Architecture
Component Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Public API โ
โ spawn() / async() / parallel_for() / parallel_reduce() โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Task Scheduler โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Worker Thread Array โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ Worker 0 โ โ Worker 1 โ โ Worker 2 โ โ Worker 3 โ ... โ โ
โ โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โ โ
โ โ โ โ โ โ โ โ
โ โ โโโโโโผโโโโโ โโโโโโผโโโโโ โโโโโโผโโโโโ โโโโโโผโโโโโ โ โ
โ โ โ Deque 0 โ โ Deque 1 โ โ Deque 2 โ โ Deque 3 โ โ โ
โ โ โ [Tasks] โ โ [Tasks] โ โ [Tasks] โ โ [Tasks] โ โ โ
โ โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ โ
โ โ โโโโโโโโโstealโโโโโโโโโโโโบ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Structures
Work-Stealing Deque
template<typename T>
class WSDeque {
static constexpr size_t INITIAL_CAPACITY = 32;
struct Array {
std::atomic<T*> buffer;
size_t capacity;
// ... ring buffer implementation
};
alignas(64) std::atomic<int64_t> top;
alignas(64) std::atomic<int64_t> bottom;
alignas(64) std::atomic<Array*> array;
public:
void push(T item); // Owner only
std::optional<T> pop(); // Owner only
std::optional<T> steal(); // Thieves
};
Worker State (Cache-Line Aligned)
struct alignas(64) WorkerState {
WSDeque<Task*> deque;
std::thread thread;
size_t id;
// Parking mechanism
std::mutex park_mutex;
std::condition_variable park_cv;
std::atomic<bool> parked{false};
// Statistics
std::atomic<uint64_t> tasks_executed{0};
std::atomic<uint64_t> tasks_stolen{0};
char padding[64]; // Ensure next worker is on different cache line
};
Key Algorithms
Task Distribution
spawn(task):
1. Get current thread's worker ID (thread_local)
2. If called from worker thread:
- Push to current worker's deque (fast path)
3. If called from external thread:
- Push to random worker's deque
- Unpark a sleeping worker
Work Loop
worker_loop():
while not shutdown:
1. task = own_deque.pop()
2. if no task:
- for each other worker (random order):
task = other_deque.steal()
if task: break
3. if task:
- execute(task)
- update statistics
4. else:
- park_thread()
- on wakeup, continue loop
Phased Implementation Guide
Phase 1: Basic Thread Pool (Days 1-2)
Goal: Working pool with a single shared queue.
Steps:
- Create worker threads that pop from a shared
std::deque - Protect with a mutex
- Implement
spawn()that pushes to the queue - Implement graceful shutdown
Test: Spawning 1000 tasks completes correctly.
Phase 2: Lock-Free Deque (Days 3-4)
Goal: Implement the Chase-Lev deque.
Steps:
- Implement
push()(owner-only, relaxed atomics) - Implement
pop()(owner-only, handles race with steal) - Implement
steal()(lock-free, may fail and retry) - Handle buffer growth when full
Test: Single-threaded tests pass. Multi-threaded stress test with TSan.
Phase 3: Work-Stealing Integration (Days 5-6)
Goal: Each worker has its own deque, idle workers steal.
Steps:
- Replace shared queue with per-worker deques
- Modify worker loop to try local pop first
- Add stealing logic (random victim selection)
- Add thread parking when no work
Test: Mandelbrot renders correctly with work-stealing.
Phase 4: Cache Optimization (Days 6-7)
Goal: Eliminate false sharing and improve cache performance.
Steps:
- Align WorkerState to cache lines
- Separate hot (deque) and cold (statistics) data
- Profile with
perffor cache misses - Optimize stealing pattern
Test: Scaling improves (7x+ speedup on 8 cores).
Phase 5: API and Algorithms (Days 8-10)
Goal: Ergonomic API with parallel algorithms.
Steps:
- Implement
async()returning futures - Implement
parallel_for()with work-stealing - Implement
parallel_reduce() - Add graceful shutdown and thread joining
Testing Strategy
Correctness Tests
// Test basic spawn/execute
TEST(ThreadPool, BasicSpawn) {
ThreadPool pool(4);
std::atomic<int> counter{0};
for (int i = 0; i < 1000; i++) {
pool.spawn([&] { counter++; });
}
pool.wait_all();
EXPECT_EQ(counter.load(), 1000);
}
// Test work-stealing balance
TEST(ThreadPool, WorkStealing) {
ThreadPool pool(4);
std::array<std::atomic<int>, 4> per_thread_count{};
// Push all work to thread 0
for (int i = 0; i < 10000; i++) {
pool.spawn_to(0, [&, tid = get_thread_id()] {
per_thread_count[tid]++;
});
}
pool.wait_all();
// Verify work was stolen (all threads did some work)
for (auto& count : per_thread_count) {
EXPECT_GT(count.load(), 0);
}
}
Stress Tests
// Concurrent spawn and steal
void stress_test() {
ThreadPool pool(8);
std::atomic<int> completed{0};
// Spawn from multiple threads
std::vector<std::thread> spawners;
for (int t = 0; t < 4; t++) {
spawners.emplace_back([&] {
for (int i = 0; i < 10000; i++) {
pool.spawn([&] { completed++; });
}
});
}
for (auto& t : spawners) t.join();
pool.wait_all();
assert(completed == 40000);
}
ThreadSanitizer
$ g++ -std=c++17 -fsanitize=thread -g -O1 -o test_tsan test.cpp
$ ./test_tsan
# Must report NO data races
Common Pitfalls and Debugging
Pitfall 1: ABA Problem in Deque
Symptom: Occasional corruption, tasks executed twice.
Cause: Between reading top and CAS, another thread steals and new tasks wrap around.
Fix: Use 64-bit indices (never wrap in practice) or add version counter:
struct alignas(16) TopCounter {
int64_t index;
int64_t version;
};
std::atomic<TopCounter> top;
Pitfall 2: Memory Ordering Bugs
Symptom: Works single-threaded, fails multi-threaded.
Debug:
- Use TSan โ it catches most ordering bugs
- Add
assert()that check invariants - Start with
seq_csteverywhere, then carefully weaken
Pitfall 3: Deadlock on Shutdown
Symptom: Program hangs on exit.
Cause: Workers parked waiting for work when shutdown requested.
Fix:
void shutdown() {
shutdown_flag.store(true);
// Wake all parked threads
for (auto& worker : workers) {
worker.park_cv.notify_all();
}
// Now join threads
}
Pitfall 4: Thundering Herd
Symptom: Performance tanks when new work arrives after idle period.
Cause: All parked threads wake simultaneously and contend.
Fix: Unpark threads incrementally:
void spawn(Task* task) {
push_to_deque(task);
// Only unpark one thread
for (auto& w : workers) {
if (w.parked.load()) {
unpark_thread(&w);
break; // Only one!
}
}
}
Extensions and Challenges
Extension 1: Task Dependencies
Support task graphs:
auto a = pool.async([] { return compute_a(); });
auto b = pool.async([] { return compute_b(); });
auto c = pool.async([&] { return a.get() + b.get(); }); // Waits for a and b
Extension 2: Priority Queues
Some tasks are more important:
pool.spawn(task, Priority::High); // Processed first
pool.spawn(task, Priority::Low); // Processed when nothing else
Extension 3: Affinity and NUMA
Pin workers to cores, allocate from local memory:
ThreadPool pool(8, ThreadPool::AffinityMode::PinToCores);
pool.spawn_on_core(3, task); // Run specifically on core 3
Extension 4: Continuations
Chain tasks without blocking:
pool.async(compute_a)
.then([](int a) { return compute_b(a); })
.then([](int b) { std::cout << b; });
Real-World Connections
Where Work-Stealing is Used
- Rayon (Rust): Powers parallel iterators with work-stealing
- Go Runtime: Goroutine scheduler uses work-stealing across Pโs
- Intel TBB:
parallel_foruses work-stealing - Java ForkJoinPool: Standard library work-stealing executor
Industry Patterns
| Pattern | Work-Stealing Benefit |
|---|---|
| Parallel Sort | Recursive subdivision naturally balances |
| Game Physics | Particle systems with variable work per particle |
| Build Systems | Compile tasks with dependencies |
| Ray Tracing | Rays with variable depth naturally balance |
Resources
Essential Reading
- โRust Atomics and Locksโ by Mara Bos โ Best explanation of memory ordering
- โThe Art of Multiprocessor Programmingโ by Herlihy & Shavit โ Lock-free algorithms
- โC++ Concurrency in Actionโ by Anthony Williams โ Practical C++ threading
Papers
- โScheduling Multithreaded Computations by Work Stealingโ โ Blumofe & Leiserson (original algorithm)
- โDynamic Circular Work-Stealing Dequeโ โ Chase & Lev (the deque youโll implement)
- โCorrect and Efficient Work-Stealing for Weak Memory Modelsโ โ Lรช et al.
Code to Study
- Rayon source code โ Production work-stealing in Rust
- Go runtime scheduler โ Work-stealing for goroutines
- crossbeam-deque โ Excellent Rust deque implementation
Self-Assessment Checklist
Before considering this project complete, verify:
Understanding
- I can explain acquire/release semantics and when each is needed
- I can describe why the Chase-Lev deque uses specific memory orderings
- I can explain what false sharing is and identify it in code
- I can explain why work-stealing achieves better load balance than a shared queue
- I can describe the ABA problem and how to prevent it
Implementation
- My deque passes single-threaded tests
- ThreadSanitizer reports no races
- My pool achieves 6x+ speedup on 8 cores for the Mandelbrot benchmark
- Thread parking works (CPU usage drops when pool is idle)
- Shutdown is clean (no hung threads)
Verification
- Iโve stress-tested with millions of tasks
- Iโve tested with more threads than cores (oversubscription)
- Iโve profiled with perf and understand where time is spent
- Iโve measured cache-miss rates before and after optimization
Work-stealing thread pools power some of the most important parallel computing systems. Building one teaches you concurrency at the deepest level.