Project 3: Work-Stealing Thread Pool

Build a production-quality thread pool with per-thread work queues and work-stealing for load balancing. This is what powers Intel TBB, Rust’s Tokio, and Go’s runtime scheduler.


Quick Reference

Attribute Value
Difficulty Advanced
Time 2-4 weeks
Language C++17 or later
Alt Languages Rust, Go, Java
Coolness Level Level 4: Hardcore Tech Flex
Business Value Open Core Infrastructure
Knowledge Area Thread Pool Design / Work Scheduling
Key Tools std::atomic, std::thread, perf, ThreadSanitizer
Main Book “C++ Concurrency in Action” by Anthony Williams

Learning Objectives

By completing this project, you will:

  1. Master work-stealing architecture - Understand why per-thread queues with stealing outperform shared queues at scale
  2. Implement the Chase-Lev deque - Build the lock-free data structure that powers industrial work-stealers
  3. Understand cache locality - Learn why LIFO execution preserves hot cache lines while FIFO stealing balances load
  4. Apply memory ordering correctly - Use acquire/release semantics to build correct concurrent algorithms
  5. Avoid false sharing - Structure data to prevent cache line ping-pong between cores
  6. Build type-erased task submission - Use std::packaged_task and templates for generic submit() that returns futures
  7. Design graceful shutdown - Drain work queues and coordinate thread termination without deadlock
  8. Benchmark parallel performance - Measure throughput, latency, and scaling efficiency

Theoretical Foundation

Core Concepts

The Problem: Contention in Shared Queues

Traditional thread pools use a single shared task queue:

                    ┌─────────────────────┐
                    │    Global Queue     │
                    │ [T1][T2][T3][T4]... │
                    └──────────┬──────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                  │
       ┌────▼────┐        ┌────▼────┐        ┌────▼────┐
       │Worker 0 │        │Worker 1 │        │Worker 2 │
       └─────────┘        └─────────┘        └─────────┘
            │                  │                  │
            └──────────────────┴──────────────────┘
                               │
                          CONTENTION!
                    (All workers fight for lock)

The problem: Every worker must acquire a lock to get or submit work. With 8+ cores, threads spend more time contending than computing. This is called lock convoy - workers serialize behind a single lock.

Measured impact:

  • 2 threads: ~10% overhead from contention
  • 8 threads: ~40% overhead from contention
  • 16+ threads: Often slower than single-threaded execution

The Solution: Work-Stealing Architecture

Each worker maintains its own deque (double-ended queue):

┌─────────────────────────────────────────────────────────────────────┐
│                    Work-Stealing Thread Pool                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐             │
│   │  Worker 0   │    │  Worker 1   │    │  Worker 2   │             │
│   │             │    │             │    │             │             │
│   │ ┌─────────┐ │    │ ┌─────────┐ │    │ ┌─────────┐ │             │
│   │ │ Deque 0 │ │    │ │ Deque 1 │ │    │ │ Deque 2 │ │             │
│   │ │[T1][T2] │ │    │ │  [T5]   │ │    │ │   []    │ │             │
│   │ └────┬────┘ │    │ └────┬────┘ │    │ └────┬────┘ │             │
│   │      │      │    │      │      │    │      │      │             │
│   │   BOTTOM    │    │   BOTTOM    │    │   BOTTOM    │             │
│   │  (private)  │    │  (private)  │    │  (private)  │             │
│   └──────┼──────┘    └──────┼──────┘    └──────┼──────┘             │
│          │                  │                  │                     │
│          │    ┌─────────────┴──────────────────┤                     │
│          │    │                                │                     │
│          └────┼────────────────────────────────┘                     │
│               │                                                      │
│          ◄────┴────►  STEALING HAPPENS HERE  ◄────►                  │
│                       (from TOP of victim's deque)                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key insight:
  - Owner pushes/pops from BOTTOM (LIFO - cache-hot tasks)
  - Thieves steal from TOP (FIFO - oldest, likely largest tasks)
  - Owner operations: nearly lock-free (fast path has no contention)
  - Stealing: rare, and only contends with other thieves

Why this works:

  1. Owner operations are fast: Push/pop from the private end rarely contend
  2. Cache locality preserved: LIFO execution keeps recently touched data in cache
  3. Natural load balancing: Idle workers steal from busy ones
  4. Stealing takes large tasks: FIFO stealing grabs the oldest (often largest) tasks, maximizing work transfer per steal

Cache Locality and Why It Matters

When a task spawns subtasks, those subtasks reference the same data:

parallel_quicksort(array, 0, 1000000)
    |
    +-- spawns: parallel_quicksort(array, 0, 500000)   // Left half
    |
    +-- spawns: parallel_quicksort(array, 500001, 1000000)  // Right half

The left subtask was just pushed. Executing it immediately (LIFO) means:
  - Array's left portion is still in L1/L2 cache
  - Cache miss rate: LOW

If we used FIFO (like a regular queue):
  - By the time we get to left subtask, array data is evicted
  - Cache miss rate: HIGH

Benchmark evidence: LIFO execution can be 2-3x faster than FIFO for recursive algorithms due to cache effects alone.

Contention Reduction Metrics

Architecture Lock Operations/Task Cache Coherence Traffic
Single shared queue 2 (push + pop) High (all cores invalidate)
Per-thread + stealing 0.05 (rare steals) Low (mostly local access)

Work-stealing achieves near-zero contention because:

  • 95%+ of operations are local (no lock, no coherence traffic)
  • Stealing only happens when workers run out of local work
  • When load is balanced, almost no stealing occurs

Why This Matters

Work-stealing is the architecture behind:

  • Intel TBB (Threading Building Blocks): Powers parallel_for, parallel_reduce in C++
  • Rust’s Rayon: The par_iter() you use for parallel iteration
  • Go’s runtime: Goroutine scheduling uses work-stealing across P’s
  • Java’s ForkJoinPool: Standard library work-stealing executor
  • Tokio (Rust async runtime): Work-stealing for async tasks

Understanding this architecture lets you:

  1. Use these libraries effectively (know when they’ll perform well)
  2. Debug scaling problems in parallel code
  3. Build custom schedulers for specialized workloads
  4. Reason about performance of concurrent systems

Historical Context

1999: Cilk at MIT (Blumofe, Leiserson) proved work-stealing is provably optimal for fork-join parallelism. The total time is at most T_1/P + O(T_inf), where T_1 is sequential time, P is processors, and T_inf is critical path length.

2005: The Chase-Lev paper introduced a practical lock-free deque that became the standard implementation.

2010s: TBB, Rayon, Go, and Tokio all adopted work-stealing as their core scheduling strategy.

Today: Work-stealing is considered the gold standard for fork-join and task-parallel scheduling.

Common Misconceptions

Misconception 1: “Work-stealing is just a thread pool with better load balancing” Reality: It’s a fundamentally different architecture. The per-thread deques, the LIFO/FIFO split, and the stealing protocol create qualitatively different behavior.

Misconception 2: “Stealing is expensive because of contention” Reality: When workloads are balanced, almost no stealing occurs. When imbalanced, stealing is rare per-task but transfers significant work per operation.

Misconception 3: “You need lock-free data structures” Reality: You can start with mutex-protected deques. Lock-free (Chase-Lev) improves performance but mutex version works correctly.

Misconception 4: “More stealing = better load balance” Reality: Frequent stealing indicates workload imbalance. Optimal case: zero steals (perfectly distributed work).


Project Specification

What You Will Build

A production-quality work-stealing thread pool library providing:

class WorkStealingPool {
public:
    // Create pool with specified number of workers (default: hardware_concurrency)
    explicit WorkStealingPool(size_t num_workers = 0);

    // Destructor drains queues and joins threads
    ~WorkStealingPool();

    // Submit a task and get a future for its result
    template<typename F, typename... Args>
    auto submit(F&& func, Args&&... args)
        -> std::future<std::invoke_result_t<F, Args...>>;

    // Spawn a fire-and-forget task (no result needed)
    template<typename F>
    void spawn(F&& func);

    // Wait for all currently submitted tasks to complete
    void wait_all();

    // Query pool state
    size_t num_workers() const;
    size_t pending_tasks() const;

    // Statistics
    struct Stats {
        uint64_t tasks_executed;
        uint64_t tasks_stolen;
        uint64_t steal_attempts;
        uint64_t successful_steals;
    };
    Stats stats() const;
};

Deliverables

  1. Header-only library: work_stealing_pool.hpp (or split header/source)
  2. Chase-Lev deque implementation: work_stealing_deque.hpp
  3. Demo applications:
    • Parallel Mandelbrot renderer
    • Parallel merge sort
    • Parallel tree traversal
  4. Benchmark suite: Scaling tests, comparison with std::async and simple thread pool
  5. Documentation: API reference, design rationale, performance analysis

Example Output

$ ./thread_pool_benchmark --threads 8

================================================================================
                 WORK-STEALING THREAD POOL BENCHMARK
================================================================================

Configuration:
  Workers: 8 (matching hardware_concurrency)
  Queue type: Lock-free Chase-Lev deque with work-stealing
  CPU: Apple M1 Pro (8 cores)

--------------------------------------------------------------------------------
[BENCHMARK 1] Micro-tasks: 100,000 tasks (1us each)
--------------------------------------------------------------------------------
  Submission time:     12.3 ms (8.1M tasks/sec submission rate)
  Completion time:     145 ms
  Throughput:          689,655 tasks/second

  Work distribution:
    Worker 0: 12,847 tasks (12.8%)
    Worker 1: 12,523 tasks (12.5%)
    Worker 2: 12,891 tasks (12.9%)
    Worker 3: 12,445 tasks (12.4%)
    Worker 4: 12,789 tasks (12.8%)
    Worker 5: 12,612 tasks (12.6%)
    Worker 6: 12,467 tasks (12.5%)
    Worker 7: 11,426 tasks (11.4%)

  Stealing statistics:
    Total steal attempts: 1,247
    Successful steals:    892 (71.5% success rate)
    Work items stolen:    892

  Comparison:
    vs std::async:        2.3x faster
    vs simple thread pool: 1.8x faster

--------------------------------------------------------------------------------
[BENCHMARK 2] Fork-join: Parallel quicksort of 10M integers
--------------------------------------------------------------------------------
  Sequential time:  890 ms
  Parallel time:    118 ms
  Speedup:          7.54x (theoretical max: 8x)
  Efficiency:       94.3%

  Task tree depth:  20 levels
  Total subtasks:   131,071

  Stealing:
    Steals performed: 2,847
    Work balance:     Excellent (max worker deviation: 3.2%)

--------------------------------------------------------------------------------
[BENCHMARK 3] Heterogeneous workload (mixed 1us, 100us, 10ms tasks)
--------------------------------------------------------------------------------
  Task mix: 80% micro (1us), 15% medium (100us), 5% large (10ms)
  Total tasks: 10,000

  Completion time:  524 ms

  Worker utilization:
    Average active time: 96.2%
    Max idle gap:        1.2 ms

  Stealing adapts to workload:
    - Micro tasks: minimal stealing (local execution)
    - Large tasks: stolen efficiently to balance

  vs Round-robin distribution: 1.6x faster

--------------------------------------------------------------------------------
[BENCHMARK 4] Stress test: 1M tasks from 8 external threads
--------------------------------------------------------------------------------
  Concurrent submitters: 8 threads
  Tasks per submitter:   125,000

  All tasks completed:   1.2 seconds
  No races detected:     (TSan clean)
  No deadlocks:          (clean shutdown)

================================================================================
SUMMARY
================================================================================
  Scaling efficiency at 8 cores: 94%+
  Task spawn overhead:           ~80 ns
  Work-stealing latency:         ~200 ns
  Memory overhead per task:      ~64 bytes

  Recommendation: This pool suitable for production use.
================================================================================

Solution Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        WorkStealingPool                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                       Public API Layer                               │   │
│   │  submit<F,Args>()  spawn<F>()  wait_all()  stats()                  │   │
│   └────────────────────────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│   ┌────────────────────────────────▼────────────────────────────────────┐   │
│   │                     Task Distribution Layer                          │   │
│   │                                                                      │   │
│   │  External submit → Random worker selection                           │   │
│   │  Internal spawn  → Push to current worker's deque                    │   │
│   │                                                                      │   │
│   └────────────────────────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│   ┌────────────────────────────────▼────────────────────────────────────┐   │
│   │                        Worker Threads                                │   │
│   │                                                                      │   │
│   │  ┌────────────────────────────────────────────────────────────────┐ │   │
│   │  │                      Worker Array                              │ │   │
│   │  │                                                                │ │   │
│   │  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐           │ │   │
│   │  │  │Worker 0 │  │Worker 1 │  │Worker 2 │  │Worker 3 │  ...      │ │   │
│   │  │  │         │  │         │  │         │  │         │           │ │   │
│   │  │  │ thread  │  │ thread  │  │ thread  │  │ thread  │           │ │   │
│   │  │  │ deque   │  │ deque   │  │ deque   │  │ deque   │           │ │   │
│   │  │  │ stats   │  │ stats   │  │ stats   │  │ stats   │           │ │   │
│   │  │  │ parking │  │ parking │  │ parking │  │ parking │           │ │   │
│   │  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘           │ │   │
│   │  │       │            │            │            │                 │ │   │
│   │  │       └────────────┴────────────┴────────────┘                 │ │   │
│   │  │                         │                                      │ │   │
│   │  │                    STEALING                                    │ │   │
│   │  │           (idle workers steal from busy ones)                  │ │   │
│   │  │                                                                │ │   │
│   │  └────────────────────────────────────────────────────────────────┘ │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                    Coordination Layer                                │   │
│   │  shutdown_flag    pending_count    park/unpark mechanisms            │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Components

1. Work-Stealing Deque (Chase-Lev)

template<typename T>
class WSDeque {
    // Circular buffer with grow-on-full semantics
    struct Buffer {
        std::atomic<T*>* tasks;
        size_t capacity;
        size_t mask;  // capacity - 1, for modular indexing

        Buffer(size_t cap);
        T* get(int64_t index);
        void put(int64_t index, T* task);
        Buffer* grow(int64_t bottom, int64_t top);
    };

    // Indices:
    //   bottom: owned exclusively by this worker (push/pop)
    //   top: shared with thieves (steal)
    alignas(64) std::atomic<int64_t> bottom{0};
    alignas(64) std::atomic<int64_t> top{0};
    alignas(64) std::atomic<Buffer*> buffer;

public:
    // Owner operations (single-threaded, fast)
    void push(T* task);           // O(1), may grow buffer
    T* pop();                      // O(1), returns nullptr if empty

    // Thief operation (concurrent, lock-free)
    T* steal();                    // O(1), returns nullptr if empty or lost race

    // Queries
    bool empty() const;
    size_t size() const;
};

Deque invariant: 0 <= top <= bottom <= top + capacity

Visual representation:

Circular buffer (capacity = 8):

    indices:   0   1   2   3   4   5   6   7
             ┌───┬───┬───┬───┬───┬───┬───┬───┐
    buffer:  │   │   │ T1│ T2│ T3│ T4│   │   │
             └───┴───┴───┴───┴───┴───┴───┴───┘
                       ▲               ▲
                       │               │
                      top=2         bottom=6

    Tasks in deque: T1, T2, T3, T4 (4 tasks)

    Owner pushes at bottom:
      buffer[6] = T5; bottom = 7;

    Owner pops from bottom-1:
      bottom = 5; return buffer[5]; // returns T4

    Thief steals from top:
      if (CAS(top, 2, 3)) return buffer[2]; // returns T1

2. Worker State (Cache-Line Aligned)

struct alignas(64) WorkerState {
    // Hot data: accessed every task
    WSDeque<Task> deque;
    std::thread thread;
    size_t id;

    // Parking mechanism (for idle workers)
    std::mutex park_mutex;
    std::condition_variable park_cv;
    std::atomic<bool> parked{false};

    // Cold data: statistics (rarely accessed)
    struct alignas(64) Statistics {
        std::atomic<uint64_t> tasks_executed{0};
        std::atomic<uint64_t> tasks_stolen{0};
        std::atomic<uint64_t> steal_attempts{0};
    } stats;

    // Padding to prevent false sharing with next worker
    char padding[64 - sizeof(Statistics) % 64];
};

Why alignas(64)? Cache lines are typically 64 bytes. Without alignment:

  • Worker 0’s stats and Worker 1’s deque could share a cache line
  • When Worker 1 modifies its deque, Worker 0’s stats are invalidated
  • This “false sharing” causes cache misses even though data is independent

3. Task Representation

// Type-erased task wrapper
class Task {
public:
    virtual void execute() = 0;
    virtual ~Task() = default;
};

// Concrete task holding any callable
template<typename Func>
class FunctionTask : public Task {
    Func func;
public:
    explicit FunctionTask(Func&& f) : func(std::forward<Func>(f)) {}
    void execute() override { func(); }
};

// Task with result (for submit())
template<typename Func>
class PackagedTask : public Task {
    std::packaged_task<std::invoke_result_t<Func>()> task;
public:
    explicit PackagedTask(Func&& f) : task(std::forward<Func>(f)) {}
    void execute() override { task(); }
    auto get_future() { return task.get_future(); }
};

Data Structures Summary

Structure Purpose Thread Safety
WSDeque<Task*> Per-worker task storage Lock-free (owner + thieves)
WorkerState Worker thread context Owned by one thread
Task (type-erased) Holds callable + result promise Immutable after creation
std::atomic<bool> shutdown Termination signal Relaxed atomic
std::atomic<size_t> pending Outstanding task count Seq-cst atomic

Implementation Guide

Phase 1: Basic Thread Pool with Shared Queue (Days 1-2)

Goal: Get a working pool first, optimize later.

Steps:

  1. Create shared task queue:
    std::queue<std::unique_ptr<Task>> queue_;
    std::mutex queue_mutex_;
    std::condition_variable queue_cv_;
    
  2. Implement worker loop:
    void worker_loop() {
        while (!shutdown_) {
            std::unique_ptr<Task> task;
            {
                std::unique_lock lock(queue_mutex_);
                queue_cv_.wait(lock, [&] {
                    return shutdown_ || !queue_.empty();
                });
                if (shutdown_ && queue_.empty()) return;
                task = std::move(queue_.front());
                queue_.pop();
            }
            task->execute();
        }
    }
    
  3. Implement submit():
    template<typename F, typename... Args>
    auto submit(F&& f, Args&&... args) {
        using ReturnType = std::invoke_result_t<F, Args...>;
        auto task = std::make_shared<std::packaged_task<ReturnType()>>(
            std::bind(std::forward<F>(f), std::forward<Args>(args)...)
        );
        auto future = task->get_future();
        {
            std::lock_guard lock(queue_mutex_);
            queue_.push([task]() { (*task)(); });
        }
        queue_cv_.notify_one();
        return future;
    }
    

Checkpoint: 1000 tasks complete correctly. Measure baseline throughput.

Phase 2: Lock-Free Deque (Days 3-5)

Goal: Implement the Chase-Lev work-stealing deque.

Key algorithm for push (owner only):

void push(T* task) {
    int64_t b = bottom_.load(std::memory_order_relaxed);
    int64_t t = top_.load(std::memory_order_acquire);
    Buffer* buf = buffer_.load(std::memory_order_relaxed);

    if (b - t >= static_cast<int64_t>(buf->capacity)) {
        // Buffer full, grow it
        buf = buf->grow(b, t);
        buffer_.store(buf, std::memory_order_release);
    }

    buf->put(b, task);
    std::atomic_thread_fence(std::memory_order_release);
    bottom_.store(b + 1, std::memory_order_relaxed);
}

Key algorithm for pop (owner only):

T* pop() {
    int64_t b = bottom_.load(std::memory_order_relaxed) - 1;
    Buffer* buf = buffer_.load(std::memory_order_relaxed);
    bottom_.store(b, std::memory_order_relaxed);

    std::atomic_thread_fence(std::memory_order_seq_cst);

    int64_t t = top_.load(std::memory_order_relaxed);

    if (t <= b) {
        // Non-empty
        T* task = buf->get(b);
        if (t == b) {
            // Last element, race with steal
            if (!top_.compare_exchange_strong(
                    t, t + 1,
                    std::memory_order_seq_cst,
                    std::memory_order_relaxed)) {
                // Lost race
                bottom_.store(t + 1, std::memory_order_relaxed);
                return nullptr;
            }
            bottom_.store(t + 1, std::memory_order_relaxed);
        }
        return task;
    } else {
        // Empty
        bottom_.store(t, std::memory_order_relaxed);
        return nullptr;
    }
}

Key algorithm for steal (thieves):

T* steal() {
    int64_t t = top_.load(std::memory_order_acquire);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    int64_t b = bottom_.load(std::memory_order_acquire);

    if (t < b) {
        // Non-empty
        Buffer* buf = buffer_.load(std::memory_order_relaxed);
        T* task = buf->get(t);

        if (!top_.compare_exchange_strong(
                t, t + 1,
                std::memory_order_seq_cst,
                std::memory_order_relaxed)) {
            // Lost race with another thief or pop
            return nullptr;
        }
        return task;
    }
    return nullptr;
}

Checkpoint: Single-threaded tests pass. Multi-threaded stress test with TSan reports no races.

Phase 3: Work-Stealing Integration (Days 6-8)

Goal: Each worker has own deque, idle workers steal.

Worker loop with stealing:

void worker_loop(size_t worker_id) {
    // Thread-local worker ID for spawn() to find current deque
    thread_local_worker_id = worker_id;

    while (!shutdown_.load(std::memory_order_relaxed)) {
        Task* task = nullptr;

        // 1. Try local deque first (hot path)
        task = workers_[worker_id].deque.pop();

        // 2. If empty, try stealing
        if (!task) {
            task = try_steal(worker_id);
        }

        if (task) {
            task->execute();
            delete task;
            workers_[worker_id].stats.tasks_executed.fetch_add(
                1, std::memory_order_relaxed);
        } else {
            // 3. No work anywhere, park thread
            park_worker(worker_id);
        }
    }
}

Stealing strategy (random victim):

Task* try_steal(size_t my_id) {
    // Start at random victim to avoid herding
    size_t start = random_() % num_workers_;

    for (size_t i = 0; i < num_workers_; ++i) {
        size_t victim = (start + i) % num_workers_;
        if (victim == my_id) continue;

        workers_[my_id].stats.steal_attempts.fetch_add(
            1, std::memory_order_relaxed);

        Task* stolen = workers_[victim].deque.steal();
        if (stolen) {
            workers_[my_id].stats.tasks_stolen.fetch_add(
                1, std::memory_order_relaxed);
            return stolen;
        }
    }
    return nullptr;
}

Task distribution:

template<typename F>
void spawn(F&& func) {
    Task* task = new FunctionTask(std::forward<F>(func));

    if (thread_local_worker_id >= 0) {
        // Called from worker thread: push to local deque (fast path)
        workers_[thread_local_worker_id].deque.push(task);
    } else {
        // Called from external thread: random worker
        size_t target = random_() % num_workers_;
        workers_[target].deque.push(task);
    }

    // Wake one parked worker if any
    unpark_one();
}

Checkpoint: Mandelbrot renders correctly. Work is distributed across workers.

Phase 4: Cache Optimization (Days 9-10)

Goal: Eliminate false sharing, optimize memory layout.

1. Ensure cache-line alignment:

struct alignas(64) WorkerState {
    // Put hot data (deque) first
    WSDeque<Task*> deque;

    // Cold data (statistics) on separate cache line
    struct alignas(64) Stats { ... } stats;

    // Padding to ensure next WorkerState starts on new cache line
    char padding[...];
};

2. Avoid false sharing in buffer:

// BAD: Adjacent task pointers share cache line
T* buffer[N];

// GOOD: Pad each slot to cache line (only if tasks are large)
struct alignas(64) Slot { T* task; };
Slot buffer[N];

3. Profile with perf:

$ perf stat -e cache-misses,cache-references ./benchmark
# Before: 15% cache miss rate
# After:  3% cache miss rate

Checkpoint: Scaling improves to 7x+ speedup on 8 cores.

Phase 5: API Completion (Days 11-14)

Goal: Complete public API with futures, wait_all, shutdown.

submit() with future:

template<typename F, typename... Args>
auto submit(F&& f, Args&&... args)
    -> std::future<std::invoke_result_t<F, Args...>>
{
    using ReturnType = std::invoke_result_t<F, Args...>;

    // Bind arguments
    auto bound = std::bind(std::forward<F>(f),
                           std::forward<Args>(args)...);

    // Wrap in packaged_task for future
    auto packaged = std::make_unique<std::packaged_task<ReturnType()>>(
        std::move(bound));

    auto future = packaged->get_future();

    // Create type-erased task
    Task* task = new PackagedTaskWrapper<ReturnType>(std::move(packaged));

    pending_count_.fetch_add(1, std::memory_order_seq_cst);

    // Submit to appropriate deque
    distribute_task(task);

    return future;
}

wait_all():

void wait_all() {
    // Participate in work-stealing while waiting
    while (pending_count_.load(std::memory_order_seq_cst) > 0) {
        // Try to help complete work
        Task* task = try_steal_any();
        if (task) {
            task->execute();
            delete task;
            pending_count_.fetch_sub(1, std::memory_order_seq_cst);
        } else {
            std::this_thread::yield();
        }
    }
}

Graceful shutdown:

~WorkStealingPool() {
    // Signal shutdown
    shutdown_.store(true, std::memory_order_release);

    // Wake all parked workers
    for (auto& worker : workers_) {
        unpark_worker(worker);
    }

    // Join all threads
    for (auto& worker : workers_) {
        if (worker.thread.joinable()) {
            worker.thread.join();
        }
    }

    // Clean up any remaining tasks
    for (auto& worker : workers_) {
        while (Task* task = worker.deque.pop()) {
            delete task;
        }
    }
}

Testing Strategy

Unit Tests

TEST(WSDeque, SingleThreadedPushPop) {
    WSDeque<int*> deque;
    int values[100];

    for (int i = 0; i < 100; ++i) {
        deque.push(&values[i]);
    }
    EXPECT_EQ(deque.size(), 100);

    // LIFO order
    for (int i = 99; i >= 0; --i) {
        EXPECT_EQ(deque.pop(), &values[i]);
    }
    EXPECT_TRUE(deque.empty());
}

TEST(WSDeque, ConcurrentSteal) {
    WSDeque<int*> deque;
    int values[10000];
    std::atomic<int> stolen_count{0};

    // Owner pushes
    for (int i = 0; i < 10000; ++i) {
        deque.push(&values[i]);
    }

    // Multiple thieves steal concurrently
    std::vector<std::thread> thieves;
    for (int t = 0; t < 4; ++t) {
        thieves.emplace_back([&] {
            while (int* v = deque.steal()) {
                stolen_count.fetch_add(1);
            }
        });
    }

    for (auto& t : thieves) t.join();
    EXPECT_EQ(stolen_count.load(), 10000);
}

TEST(WorkStealingPool, BasicSubmit) {
    WorkStealingPool pool(4);
    std::atomic<int> counter{0};

    std::vector<std::future<void>> futures;
    for (int i = 0; i < 1000; ++i) {
        futures.push_back(pool.submit([&counter] {
            counter.fetch_add(1);
        }));
    }

    for (auto& f : futures) f.wait();
    EXPECT_EQ(counter.load(), 1000);
}

TEST(WorkStealingPool, ResultPropagation) {
    WorkStealingPool pool(4);

    auto future = pool.submit([](int a, int b) { return a + b; }, 20, 22);
    EXPECT_EQ(future.get(), 42);
}

TEST(WorkStealingPool, WorkStealing) {
    WorkStealingPool pool(4);
    std::array<std::atomic<int>, 4> per_thread_count{};

    // Submit many tasks to worker 0
    for (int i = 0; i < 10000; ++i) {
        pool.submit([&per_thread_count] {
            size_t worker_id = get_current_worker_id();
            per_thread_count[worker_id].fetch_add(1);

            // Simulate work
            std::this_thread::sleep_for(std::chrono::microseconds(10));
        });
    }

    pool.wait_all();

    // Verify work was distributed (stolen)
    for (auto& count : per_thread_count) {
        EXPECT_GT(count.load(), 0);  // Each worker did some work
    }
}

Stress Tests

TEST(WorkStealingPool, StressTest) {
    WorkStealingPool pool(8);
    std::atomic<int> completed{0};
    constexpr int NUM_TASKS = 1'000'000;

    // Hammer from multiple external threads
    std::vector<std::thread> submitters;
    for (int t = 0; t < 8; ++t) {
        submitters.emplace_back([&pool, &completed] {
            for (int i = 0; i < NUM_TASKS / 8; ++i) {
                pool.spawn([&completed] {
                    completed.fetch_add(1);
                });
            }
        });
    }

    for (auto& t : submitters) t.join();
    pool.wait_all();

    EXPECT_EQ(completed.load(), NUM_TASKS);
}

TEST(WorkStealingPool, ForkJoinRecursive) {
    WorkStealingPool pool(8);

    std::function<int64_t(int)> fib = [&pool, &fib](int n) -> int64_t {
        if (n < 2) return n;
        if (n < 20) {
            // Below threshold, compute sequentially
            return fib(n - 1) + fib(n - 2);
        }

        // Fork
        auto f1 = pool.submit([&fib, n] { return fib(n - 1); });
        auto f2 = pool.submit([&fib, n] { return fib(n - 2); });

        // Join
        return f1.get() + f2.get();
    };

    EXPECT_EQ(fib(35), 9227465);
}

ThreadSanitizer

$ clang++ -std=c++17 -fsanitize=thread -g -O1 -o test_tsan tests/*.cpp
$ ./test_tsan

# MUST report: NO data races detected
# Common issues TSan catches:
#   - Missing memory order in atomics
#   - Data race on task execution
#   - Use-after-free on task deletion

Benchmarking

void benchmark_scaling() {
    std::vector<size_t> thread_counts = {1, 2, 4, 8, 16};

    for (size_t threads : thread_counts) {
        WorkStealingPool pool(threads);

        auto start = std::chrono::high_resolution_clock::now();

        // Parallel quicksort of 10M integers
        std::vector<int> data(10'000'000);
        std::iota(data.begin(), data.end(), 0);
        std::shuffle(data.begin(), data.end(), std::mt19937{42});

        parallel_quicksort(pool, data.data(), 0, data.size());

        auto end = std::chrono::high_resolution_clock::now();
        auto ms = std::chrono::duration<double, std::milli>(end - start).count();

        std::cout << threads << " threads: " << ms << " ms\n";
    }
}

Common Pitfalls & Debugging

Pitfall 1: ABA Problem in Lock-Free Deque

Symptom: Occasional corruption, tasks executed twice, or crashes after many iterations.

Cause: Classic ABA scenario in CAS loop:

  1. Thread A reads top = 5, gets task pointer P
  2. Thread B steals successfully, top becomes 6
  3. More operations wrap around, top becomes 5 again, but with different task Q at that slot
  4. Thread A’s CAS succeeds (5 == 5), returns stale pointer P

Solutions:

// Solution 1: Use 64-bit indices that never wrap (recommended)
// With 64-bit, you'd need 2^64 / num_tasks years to wrap
std::atomic<int64_t> top, bottom;  // int64_t, not int32_t

// Solution 2: Tagged pointer with version counter
struct TaggedTop {
    int64_t index;
    int64_t version;
};
std::atomic<TaggedTop> top;

// Solution 3: Hazard pointers (complex, rarely needed for deque)

Pitfall 2: Starvation Under Heavy Stealing

Symptom: Some workers consistently have zero work while others are overloaded.

Cause: Round-robin stealing always starts from worker 0, causing “hot spot”.

Solution: Random victim selection:

// BAD: Always starts from 0
for (size_t victim = 0; victim < num_workers_; ++victim) {
    if (victim != my_id) try_steal(victim);
}

// GOOD: Random start
size_t start = rng() % num_workers_;
for (size_t i = 0; i < num_workers_; ++i) {
    size_t victim = (start + i) % num_workers_;
    if (victim != my_id) try_steal(victim);
}

Pitfall 3: False Sharing Between Workers

Symptom: Scaling stops at 4x even with 8 cores; perf shows high cache miss rate.

Cause: Worker state structures are adjacent in memory, sharing cache lines.

Debugging:

$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./benchmark

# Before fix: 15% miss rate
# After fix:  3% miss rate

$ perf c2c record ./benchmark
$ perf c2c report
# Shows "false sharing" hot spots

Solution: Cache-line alignment:

// BAD: Workers share cache lines
struct WorkerState {
    std::atomic<int> counter;  // 4 bytes
    WSDeque<Task*> deque;      // starts immediately after
};
WorkerState workers[8];  // Adjacent, false sharing!

// GOOD: Each worker on own cache line(s)
struct alignas(64) WorkerState {
    std::atomic<int> counter;
    WSDeque<Task*> deque;
    char padding[64 - ...];  // Ensure isolation
};

Pitfall 4: Deadlock on Shutdown

Symptom: Program hangs on exit; workers never join.

Cause: Workers are parked (waiting on condition variable) when shutdown is requested.

Solution: Wake all workers after setting shutdown flag:

~WorkStealingPool() {
    shutdown_.store(true, std::memory_order_release);

    // CRITICAL: Wake ALL parked workers
    for (auto& worker : workers_) {
        {
            std::lock_guard lock(worker.park_mutex);
            worker.parked = false;
        }
        worker.park_cv.notify_all();  // notify_all, not notify_one!
    }

    for (auto& worker : workers_) {
        if (worker.thread.joinable()) {
            worker.thread.join();
        }
    }
}

Pitfall 5: Memory Order Too Weak

Symptom: Works in debug mode, fails in release; works on x86, fails on ARM.

Cause: ARM has weaker memory model than x86. Code relying on x86’s strong ordering breaks.

Debugging:

# Run with ThreadSanitizer
$ clang++ -fsanitize=thread -g -O1 ...

# TSan will report missing synchronization
WARNING: ThreadSanitizer: data race
  Write of size 8 at ... by thread T1:
  Previous read of size 8 at ... by thread T2:

Solution: Start with memory_order_seq_cst everywhere, then carefully weaken:

// 1. Get it working with seq_cst
bottom_.store(b + 1, std::memory_order_seq_cst);

// 2. Analyze which orderings are actually needed
// 3. Weaken to minimum required (acquire/release usually sufficient)
bottom_.store(b + 1, std::memory_order_release);

// 4. Verify with TSan on both x86 and ARM (or use QEMU)

Pitfall 6: Thundering Herd on Task Submission

Symptom: Latency spike when submitting tasks after idle period.

Cause: All parked workers wake simultaneously, contend on stealing.

Solution: Wake workers incrementally:

void unpark_one() {
    for (auto& worker : workers_) {
        if (worker.parked.load(std::memory_order_acquire)) {
            {
                std::lock_guard lock(worker.park_mutex);
                worker.parked = false;
            }
            worker.park_cv.notify_one();
            return;  // Only wake ONE worker
        }
    }
}

Extensions & Challenges

Extension 1: Task Priorities

Support priority levels for scheduling:

enum class Priority { Low, Normal, High, Critical };

pool.submit(Priority::High, []{ /* important work */ });

// Implementation: Multiple deques per worker, check high-priority first

Extension 2: Task Dependencies (DAG Scheduling)

Support task graphs with dependencies:

auto a = pool.submit([]{ return compute_a(); });
auto b = pool.submit([]{ return compute_b(); });
auto c = pool.when_all(a, b).then([](int ra, int rb) {
    return ra + rb;
});

Extension 3: NUMA Awareness

On multi-socket systems, prefer stealing from same NUMA node:

// Steal order: same-socket workers first, then cross-socket
std::vector<size_t> steal_order = get_numa_aware_order(worker_id);
for (size_t victim : steal_order) {
    if (Task* t = try_steal(victim)) return t;
}

Extension 4: Continuations (Non-Blocking Then)

Add continuation support without blocking on get():

pool.submit(compute_a)
    .then([](int a) { return compute_b(a); })
    .then([](int b) { log_result(b); });

Extension 5: Coroutine Integration (C++20)

Make pool usable with co_await:

Task<int> async_computation(WorkStealingPool& pool) {
    int a = co_await pool.submit(compute_a);
    int b = co_await pool.submit(compute_b);
    co_return a + b;
}

Challenge: Implement Wait-Free Stealing

The Chase-Lev deque is lock-free but not wait-free (thieves may fail repeatedly). Implement bounded wait-free stealing using:

  • Helping mechanism: thieves help each other complete
  • Combining: batch multiple steal requests

This is research-level difficulty.


Resources

Essential Reading

  1. “C++ Concurrency in Action” Chapter 9 - Anthony Williams
    • Thread pool implementation, task submission, futures
  2. “The Art of Multiprocessor Programming” Chapter 16 - Herlihy & Shavit
    • Work-stealing algorithms, Chase-Lev deque correctness
  3. “Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson (1999)
    • Original paper proving optimality of work-stealing
  4. “Dynamic Circular Work-Stealing Deque” - Chase & Lev (2005)
    • The lock-free deque algorithm you will implement
  5. “Correct and Efficient Work-Stealing for Weak Memory Models” - Le et al. (2013)
    • Memory ordering correctness for Chase-Lev

Code to Study

Project Language Notes
BS::thread_pool C++ Modern, header-only, excellent docs
Rayon Rust Production work-stealing, par_iter
crossbeam-deque Rust Clean Chase-Lev implementation
Go runtime Go Goroutine work-stealing
Intel TBB C++ Industrial strength, complex

Tools

  • ThreadSanitizer: Compile with -fsanitize=thread to detect races
  • perf: Profile cache behavior, find false sharing
  • Valgrind/Helgrind: Alternative race detector
  • QEMU: Test on ARM to verify memory ordering

Self-Assessment Checklist

Understanding

  • I can explain why work-stealing outperforms shared queues at scale
  • I can describe the Chase-Lev deque invariants and why they ensure correctness
  • I can explain acquire/release semantics and identify where they’re needed in the deque
  • I can explain what false sharing is and how to detect/fix it
  • I can describe the ABA problem and how 64-bit indices avoid it
  • I understand why LIFO execution preserves cache locality
  • I can explain when and how stealing happens in a balanced vs imbalanced workload

Implementation

  • My deque passes single-threaded push/pop tests
  • My deque passes concurrent stealing tests
  • ThreadSanitizer reports no data races
  • submit() correctly returns futures with results
  • Graceful shutdown works (no hung threads, no leaked tasks)
  • wait_all() correctly blocks until completion

Performance

  • Pool achieves 6x+ speedup on 8 cores for parallel quicksort
  • Task spawn overhead is < 200 ns
  • Cache miss rate is < 5% under load
  • CPU usage drops to near zero when pool is idle (parking works)
  • Performance matches or beats std::async for compute-bound work

Verification

  • Stress tested with 1M+ tasks
  • Tested with concurrent external submitters
  • Tested fork-join recursive patterns (parallel fib, quicksort)
  • Tested on both x86 and ARM (or verified with TSan)
  • Profiled with perf to verify no false sharing

Submission / Completion Criteria

Minimum Viable Completion

  • Basic thread pool with work-stealing deques
  • spawn() and submit() work correctly
  • Tasks execute on worker threads
  • ThreadSanitizer reports no races
  • Clean shutdown without deadlock

Full Completion

  • Lock-free Chase-Lev deque implementation
  • submit() returns working futures
  • wait_all() correctly synchronizes
  • Cache-line aligned worker state
  • Benchmark showing 7x+ speedup on 8 cores
  • Statistics collection (tasks executed, steals)

Excellence (Going Above & Beyond)

  • Comparison benchmark against std::async, simple pool
  • Parallel algorithms (parallel_for, parallel_reduce)
  • Task priorities
  • Continuations (.then())
  • NUMA-aware stealing
  • Detailed performance analysis with perf

Work-stealing thread pools power the most important parallel computing systems. Building one teaches you concurrency at the deepest level – from cache coherence to memory ordering to load balancing. After this project, you will understand what happens inside Rayon, Go’s runtime, and Intel TBB.


This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the project index.