Project 3: Work-Stealing Thread Pool

Build a production-quality thread pool with per-thread work queues and work-stealing for load balancing. This is what powers Intel TBB, Rust’s Tokio, and Go’s runtime scheduler.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time	2-4 weeks
Language	C++17 or later
Alt Languages	Rust, Go, Java
Coolness Level	Level 4: Hardcore Tech Flex
Business Value	Open Core Infrastructure
Knowledge Area	Thread Pool Design / Work Scheduling
Key Tools	std::atomic, std::thread, perf, ThreadSanitizer
Main Book	“C++ Concurrency in Action” by Anthony Williams

Learning Objectives

By completing this project, you will:

Master work-stealing architecture - Understand why per-thread queues with stealing outperform shared queues at scale
Implement the Chase-Lev deque - Build the lock-free data structure that powers industrial work-stealers
Understand cache locality - Learn why LIFO execution preserves hot cache lines while FIFO stealing balances load
Apply memory ordering correctly - Use acquire/release semantics to build correct concurrent algorithms
Avoid false sharing - Structure data to prevent cache line ping-pong between cores
Build type-erased task submission - Use std::packaged_task and templates for generic submit() that returns futures
Design graceful shutdown - Drain work queues and coordinate thread termination without deadlock
Benchmark parallel performance - Measure throughput, latency, and scaling efficiency

Theoretical Foundation

Core Concepts

The Problem: Contention in Shared Queues

Traditional thread pools use a single shared task queue:

                    ┌─────────────────────┐
                    │    Global Queue     │
                    │ [T1][T2][T3][T4]... │
                    └──────────┬──────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                  │
       ┌────▼────┐        ┌────▼────┐        ┌────▼────┐
       │Worker 0 │        │Worker 1 │        │Worker 2 │
       └─────────┘        └─────────┘        └─────────┘
            │                  │                  │
            └──────────────────┴──────────────────┘
                               │
                          CONTENTION!
                    (All workers fight for lock)

The problem: Every worker must acquire a lock to get or submit work. With 8+ cores, threads spend more time contending than computing. This is called lock convoy - workers serialize behind a single lock.

Measured impact:

2 threads: ~10% overhead from contention
8 threads: ~40% overhead from contention
16+ threads: Often slower than single-threaded execution

The Solution: Work-Stealing Architecture

Each worker maintains its own deque (double-ended queue):

┌─────────────────────────────────────────────────────────────────────┐
│                    Work-Stealing Thread Pool                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐             │
│   │  Worker 0   │    │  Worker 1   │    │  Worker 2   │             │
│   │             │    │             │    │             │             │
│   │ ┌─────────┐ │    │ ┌─────────┐ │    │ ┌─────────┐ │             │
│   │ │ Deque 0 │ │    │ │ Deque 1 │ │    │ │ Deque 2 │ │             │
│   │ │[T1][T2] │ │    │ │  [T5]   │ │    │ │   []    │ │             │
│   │ └────┬────┘ │    │ └────┬────┘ │    │ └────┬────┘ │             │
│   │      │      │    │      │      │    │      │      │             │
│   │   BOTTOM    │    │   BOTTOM    │    │   BOTTOM    │             │
│   │  (private)  │    │  (private)  │    │  (private)  │             │
│   └──────┼──────┘    └──────┼──────┘    └──────┼──────┘             │
│          │                  │                  │                     │
│          │    ┌─────────────┴──────────────────┤                     │
│          │    │                                │                     │
│          └────┼────────────────────────────────┘                     │
│               │                                                      │
│          ◄────┴────►  STEALING HAPPENS HERE  ◄────►                  │
│                       (from TOP of victim's deque)                   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key insight:
  - Owner pushes/pops from BOTTOM (LIFO - cache-hot tasks)
  - Thieves steal from TOP (FIFO - oldest, likely largest tasks)
  - Owner operations: nearly lock-free (fast path has no contention)
  - Stealing: rare, and only contends with other thieves

Why this works:

Owner operations are fast: Push/pop from the private end rarely contend
Cache locality preserved: LIFO execution keeps recently touched data in cache
Natural load balancing: Idle workers steal from busy ones
Stealing takes large tasks: FIFO stealing grabs the oldest (often largest) tasks, maximizing work transfer per steal

Cache Locality and Why It Matters

When a task spawns subtasks, those subtasks reference the same data:

parallel_quicksort(array, 0, 1000000)
    |
    +-- spawns: parallel_quicksort(array, 0, 500000)   // Left half
    |
    +-- spawns: parallel_quicksort(array, 500001, 1000000)  // Right half

The left subtask was just pushed. Executing it immediately (LIFO) means:
  - Array's left portion is still in L1/L2 cache
  - Cache miss rate: LOW

If we used FIFO (like a regular queue):
  - By the time we get to left subtask, array data is evicted
  - Cache miss rate: HIGH

Benchmark evidence: LIFO execution can be 2-3x faster than FIFO for recursive algorithms due to cache effects alone.

Contention Reduction Metrics

Architecture	Lock Operations/Task	Cache Coherence Traffic
Single shared queue	2 (push + pop)	High (all cores invalidate)
Per-thread + stealing	0.05 (rare steals)	Low (mostly local access)

Work-stealing achieves near-zero contention because:

95%+ of operations are local (no lock, no coherence traffic)
Stealing only happens when workers run out of local work
When load is balanced, almost no stealing occurs

Why This Matters

Work-stealing is the architecture behind:

Intel TBB (Threading Building Blocks): Powers parallel_for, parallel_reduce in C++
Rust’s Rayon: The par_iter() you use for parallel iteration
Go’s runtime: Goroutine scheduling uses work-stealing across P’s
Java’s ForkJoinPool: Standard library work-stealing executor
Tokio (Rust async runtime): Work-stealing for async tasks

Understanding this architecture lets you:

Use these libraries effectively (know when they’ll perform well)
Debug scaling problems in parallel code
Build custom schedulers for specialized workloads
Reason about performance of concurrent systems

Historical Context

1999: Cilk at MIT (Blumofe, Leiserson) proved work-stealing is provably optimal for fork-join parallelism. The total time is at most T_1/P + O(T_inf), where T_1 is sequential time, P is processors, and T_inf is critical path length.

2005: The Chase-Lev paper introduced a practical lock-free deque that became the standard implementation.

2010s: TBB, Rayon, Go, and Tokio all adopted work-stealing as their core scheduling strategy.

Today: Work-stealing is considered the gold standard for fork-join and task-parallel scheduling.

Common Misconceptions

Misconception 1: “Work-stealing is just a thread pool with better load balancing” Reality: It’s a fundamentally different architecture. The per-thread deques, the LIFO/FIFO split, and the stealing protocol create qualitatively different behavior.

Misconception 2: “Stealing is expensive because of contention” Reality: When workloads are balanced, almost no stealing occurs. When imbalanced, stealing is rare per-task but transfers significant work per operation.

Misconception 3: “You need lock-free data structures” Reality: You can start with mutex-protected deques. Lock-free (Chase-Lev) improves performance but mutex version works correctly.

Misconception 4: “More stealing = better load balance” Reality: Frequent stealing indicates workload imbalance. Optimal case: zero steals (perfectly distributed work).

Project Specification

What You Will Build

A production-quality work-stealing thread pool library providing:

class WorkStealingPool {
public:
    // Create pool with specified number of workers (default: hardware_concurrency)
    explicit WorkStealingPool(size_t num_workers = 0);

    // Destructor drains queues and joins threads
    ~WorkStealingPool();

    // Submit a task and get a future for its result
    template<typename F, typename... Args>
    auto submit(F&& func, Args&&... args)
        -> std::future<std::invoke_result_t<F, Args...>>;

    // Spawn a fire-and-forget task (no result needed)
    template<typename F>
    void spawn(F&& func);

    // Wait for all currently submitted tasks to complete
    void wait_all();

    // Query pool state
    size_t num_workers() const;
    size_t pending_tasks() const;

    // Statistics
    struct Stats {
        uint64_t tasks_executed;
        uint64_t tasks_stolen;
        uint64_t steal_attempts;
        uint64_t successful_steals;
    };
    Stats stats() const;
};

Deliverables

Header-only library: work_stealing_pool.hpp (or split header/source)
Chase-Lev deque implementation: work_stealing_deque.hpp
Demo applications:
- Parallel Mandelbrot renderer
- Parallel merge sort
- Parallel tree traversal
Benchmark suite: Scaling tests, comparison with std::async and simple thread pool
Documentation: API reference, design rationale, performance analysis

Example Output

$ ./thread_pool_benchmark --threads 8

================================================================================
                 WORK-STEALING THREAD POOL BENCHMARK
================================================================================

Configuration:
  Workers: 8 (matching hardware_concurrency)
  Queue type: Lock-free Chase-Lev deque with work-stealing
  CPU: Apple M1 Pro (8 cores)

--------------------------------------------------------------------------------
[BENCHMARK 1] Micro-tasks: 100,000 tasks (1us each)
--------------------------------------------------------------------------------
  Submission time:     12.3 ms (8.1M tasks/sec submission rate)
  Completion time:     145 ms
  Throughput:          689,655 tasks/second

  Work distribution:
    Worker 0: 12,847 tasks (12.8%)
    Worker 1: 12,523 tasks (12.5%)
    Worker 2: 12,891 tasks (12.9%)
    Worker 3: 12,445 tasks (12.4%)
    Worker 4: 12,789 tasks (12.8%)
    Worker 5: 12,612 tasks (12.6%)
    Worker 6: 12,467 tasks (12.5%)
    Worker 7: 11,426 tasks (11.4%)

  Stealing statistics:
    Total steal attempts: 1,247
    Successful steals:    892 (71.5% success rate)
    Work items stolen:    892

  Comparison:
    vs std::async:        2.3x faster
    vs simple thread pool: 1.8x faster

--------------------------------------------------------------------------------
[BENCHMARK 2] Fork-join: Parallel quicksort of 10M integers
--------------------------------------------------------------------------------
  Sequential time:  890 ms
  Parallel time:    118 ms
  Speedup:          7.54x (theoretical max: 8x)
  Efficiency:       94.3%

  Task tree depth:  20 levels
  Total subtasks:   131,071

  Stealing:
    Steals performed: 2,847
    Work balance:     Excellent (max worker deviation: 3.2%)

--------------------------------------------------------------------------------
[BENCHMARK 3] Heterogeneous workload (mixed 1us, 100us, 10ms tasks)
--------------------------------------------------------------------------------
  Task mix: 80% micro (1us), 15% medium (100us), 5% large (10ms)
  Total tasks: 10,000

  Completion time:  524 ms

  Worker utilization:
    Average active time: 96.2%
    Max idle gap:        1.2 ms

  Stealing adapts to workload:
    - Micro tasks: minimal stealing (local execution)
    - Large tasks: stolen efficiently to balance

  vs Round-robin distribution: 1.6x faster

--------------------------------------------------------------------------------
[BENCHMARK 4] Stress test: 1M tasks from 8 external threads
--------------------------------------------------------------------------------
  Concurrent submitters: 8 threads
  Tasks per submitter:   125,000

  All tasks completed:   1.2 seconds
  No races detected:     (TSan clean)
  No deadlocks:          (clean shutdown)

================================================================================
SUMMARY
================================================================================
  Scaling efficiency at 8 cores: 94%+
  Task spawn overhead:           ~80 ns
  Work-stealing latency:         ~200 ns
  Memory overhead per task:      ~64 bytes

  Recommendation: This pool suitable for production use.
================================================================================

Solution Architecture

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        WorkStealingPool                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                       Public API Layer                               │   │
│   │  submit<F,Args>()  spawn<F>()  wait_all()  stats()                  │   │
│   └────────────────────────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│   ┌────────────────────────────────▼────────────────────────────────────┐   │
│   │                     Task Distribution Layer                          │   │
│   │                                                                      │   │
│   │  External submit → Random worker selection                           │   │
│   │  Internal spawn  → Push to current worker's deque                    │   │
│   │                                                                      │   │
│   └────────────────────────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│   ┌────────────────────────────────▼────────────────────────────────────┐   │
│   │                        Worker Threads                                │   │
│   │                                                                      │   │
│   │  ┌────────────────────────────────────────────────────────────────┐ │   │
│   │  │                      Worker Array                              │ │   │
│   │  │                                                                │ │   │
│   │  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐           │ │   │
│   │  │  │Worker 0 │  │Worker 1 │  │Worker 2 │  │Worker 3 │  ...      │ │   │
│   │  │  │         │  │         │  │         │  │         │           │ │   │
│   │  │  │ thread  │  │ thread  │  │ thread  │  │ thread  │           │ │   │
│   │  │  │ deque   │  │ deque   │  │ deque   │  │ deque   │           │ │   │
│   │  │  │ stats   │  │ stats   │  │ stats   │  │ stats   │           │ │   │
│   │  │  │ parking │  │ parking │  │ parking │  │ parking │           │ │   │
│   │  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘           │ │   │
│   │  │       │            │            │            │                 │ │   │
│   │  │       └────────────┴────────────┴────────────┘                 │ │   │
│   │  │                         │                                      │ │   │
│   │  │                    STEALING                                    │ │   │
│   │  │           (idle workers steal from busy ones)                  │ │   │
│   │  │                                                                │ │   │
│   │  └────────────────────────────────────────────────────────────────┘ │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                    Coordination Layer                                │   │
│   │  shutdown_flag    pending_count    park/unpark mechanisms            │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Components

1. Work-Stealing Deque (Chase-Lev)

template<typename T>
class WSDeque {
    // Circular buffer with grow-on-full semantics
    struct Buffer {
        std::atomic<T*>* tasks;
        size_t capacity;
        size_t mask;  // capacity - 1, for modular indexing

        Buffer(size_t cap);
        T* get(int64_t index);
        void put(int64_t index, T* task);
        Buffer* grow(int64_t bottom, int64_t top);
    };

    // Indices:
    //   bottom: owned exclusively by this worker (push/pop)
    //   top: shared with thieves (steal)
    alignas(64) std::atomic<int64_t> bottom{0};
    alignas(64) std::atomic<int64_t> top{0};
    alignas(64) std::atomic<Buffer*> buffer;

public:
    // Owner operations (single-threaded, fast)
    void push(T* task);           // O(1), may grow buffer
    T* pop();                      // O(1), returns nullptr if empty

    // Thief operation (concurrent, lock-free)
    T* steal();                    // O(1), returns nullptr if empty or lost race

    // Queries
    bool empty() const;
    size_t size() const;
};

Deque invariant: 0 <= top <= bottom <= top + capacity

Visual representation:

Circular buffer (capacity = 8):

    indices:   0   1   2   3   4   5   6   7
             ┌───┬───┬───┬───┬───┬───┬───┬───┐
    buffer:  │   │   │ T1│ T2│ T3│ T4│   │   │
             └───┴───┴───┴───┴───┴───┴───┴───┘
                       ▲               ▲
                       │               │
                      top=2         bottom=6

    Tasks in deque: T1, T2, T3, T4 (4 tasks)

    Owner pushes at bottom:
      buffer[6] = T5; bottom = 7;

    Owner pops from bottom-1:
      bottom = 5; return buffer[5]; // returns T4

    Thief steals from top:
      if (CAS(top, 2, 3)) return buffer[2]; // returns T1

2. Worker State (Cache-Line Aligned)

struct alignas(64) WorkerState {
    // Hot data: accessed every task
    WSDeque<Task> deque;
    std::thread thread;
    size_t id;

    // Parking mechanism (for idle workers)
    std::mutex park_mutex;
    std::condition_variable park_cv;
    std::atomic<bool> parked{false};

    // Cold data: statistics (rarely accessed)
    struct alignas(64) Statistics {
        std::atomic<uint64_t> tasks_executed{0};
        std::atomic<uint64_t> tasks_stolen{0};
        std::atomic<uint64_t> steal_attempts{0};
    } stats;

    // Padding to prevent false sharing with next worker
    char padding[64 - sizeof(Statistics) % 64];
};

Why alignas(64)? Cache lines are typically 64 bytes. Without alignment:

Worker 0’s stats and Worker 1’s deque could share a cache line
When Worker 1 modifies its deque, Worker 0’s stats are invalidated
This “false sharing” causes cache misses even though data is independent

3. Task Representation

// Type-erased task wrapper
class Task {
public:
    virtual void execute() = 0;
    virtual ~Task() = default;
};

// Concrete task holding any callable
template<typename Func>
class FunctionTask : public Task {
    Func func;
public:
    explicit FunctionTask(Func&& f) : func(std::forward<Func>(f)) {}
    void execute() override { func(); }
};

// Task with result (for submit())
template<typename Func>
class PackagedTask : public Task {
    std::packaged_task<std::invoke_result_t<Func>()> task;
public:
    explicit PackagedTask(Func&& f) : task(std::forward<Func>(f)) {}
    void execute() override { task(); }
    auto get_future() { return task.get_future(); }
};

Data Structures Summary

Structure	Purpose	Thread Safety
`WSDeque<Task*>`	Per-worker task storage	Lock-free (owner + thieves)
`WorkerState`	Worker thread context	Owned by one thread
`Task` (type-erased)	Holds callable + result promise	Immutable after creation
`std::atomic<bool> shutdown`	Termination signal	Relaxed atomic
`std::atomic<size_t> pending`	Outstanding task count	Seq-cst atomic

Implementation Guide

Phase 1: Basic Thread Pool with Shared Queue (Days 1-2)

Goal: Get a working pool first, optimize later.

Steps:

Create shared task queue:

std::queue<std::unique_ptr<Task>> queue_;
std::mutex queue_mutex_;
std::condition_variable queue_cv_;

Implement worker loop:

void worker_loop() {
    while (!shutdown_) {
        std::unique_ptr<Task> task;
        {
            std::unique_lock lock(queue_mutex_);
            queue_cv_.wait(lock, [&] {
                return shutdown_ || !queue_.empty();
            });
            if (shutdown_ && queue_.empty()) return;
            task = std::move(queue_.front());
            queue_.pop();
        }
        task->execute();
    }
}

Implement submit():

template<typename F, typename... Args>
auto submit(F&& f, Args&&... args) {
    using ReturnType = std::invoke_result_t<F, Args...>;
    auto task = std::make_shared<std::packaged_task<ReturnType()>>(
        std::bind(std::forward<F>(f), std::forward<Args>(args)...)
    );
    auto future = task->get_future();
    {
        std::lock_guard lock(queue_mutex_);
        queue_.push([task]() { (*task)(); });
    }
    queue_cv_.notify_one();
    return future;
}

Checkpoint: 1000 tasks complete correctly. Measure baseline throughput.

Phase 2: Lock-Free Deque (Days 3-5)

Goal: Implement the Chase-Lev work-stealing deque.

Key algorithm for push (owner only):

void push(T* task) {
    int64_t b = bottom_.load(std::memory_order_relaxed);
    int64_t t = top_.load(std::memory_order_acquire);
    Buffer* buf = buffer_.load(std::memory_order_relaxed);

    if (b - t >= static_cast<int64_t>(buf->capacity)) {
        // Buffer full, grow it
        buf = buf->grow(b, t);
        buffer_.store(buf, std::memory_order_release);
    }

    buf->put(b, task);
    std::atomic_thread_fence(std::memory_order_release);
    bottom_.store(b + 1, std::memory_order_relaxed);
}

Key algorithm for pop (owner only):

T* pop() {
    int64_t b = bottom_.load(std::memory_order_relaxed) - 1;
    Buffer* buf = buffer_.load(std::memory_order_relaxed);
    bottom_.store(b, std::memory_order_relaxed);

    std::atomic_thread_fence(std::memory_order_seq_cst);

    int64_t t = top_.load(std::memory_order_relaxed);

    if (t <= b) {
        // Non-empty
        T* task = buf->get(b);
        if (t == b) {
            // Last element, race with steal
            if (!top_.compare_exchange_strong(
                    t, t + 1,
                    std::memory_order_seq_cst,
                    std::memory_order_relaxed)) {
                // Lost race
                bottom_.store(t + 1, std::memory_order_relaxed);
                return nullptr;
            }
            bottom_.store(t + 1, std::memory_order_relaxed);
        }
        return task;
    } else {
        // Empty
        bottom_.store(t, std::memory_order_relaxed);
        return nullptr;
    }
}

Key algorithm for steal (thieves):

T* steal() {
    int64_t t = top_.load(std::memory_order_acquire);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    int64_t b = bottom_.load(std::memory_order_acquire);

    if (t < b) {
        // Non-empty
        Buffer* buf = buffer_.load(std::memory_order_relaxed);
        T* task = buf->get(t);

        if (!top_.compare_exchange_strong(
                t, t + 1,
                std::memory_order_seq_cst,
                std::memory_order_relaxed)) {
            // Lost race with another thief or pop
            return nullptr;
        }
        return task;
    }
    return nullptr;
}

Checkpoint: Single-threaded tests pass. Multi-threaded stress test with TSan reports no races.

Phase 3: Work-Stealing Integration (Days 6-8)

Goal: Each worker has own deque, idle workers steal.

Worker loop with stealing:

void worker_loop(size_t worker_id) {
    // Thread-local worker ID for spawn() to find current deque
    thread_local_worker_id = worker_id;

    while (!shutdown_.load(std::memory_order_relaxed)) {
        Task* task = nullptr;

        // 1. Try local deque first (hot path)
        task = workers_[worker_id].deque.pop();

        // 2. If empty, try stealing
        if (!task) {
            task = try_steal(worker_id);
        }

        if (task) {
            task->execute();
            delete task;
            workers_[worker_id].stats.tasks_executed.fetch_add(
                1, std::memory_order_relaxed);
        } else {
            // 3. No work anywhere, park thread
            park_worker(worker_id);
        }
    }
}

Stealing strategy (random victim):

Task* try_steal(size_t my_id) {
    // Start at random victim to avoid herding
    size_t start = random_() % num_workers_;

    for (size_t i = 0; i < num_workers_; ++i) {
        size_t victim = (start + i) % num_workers_;
        if (victim == my_id) continue;

        workers_[my_id].stats.steal_attempts.fetch_add(
            1, std::memory_order_relaxed);

        Task* stolen = workers_[victim].deque.steal();
        if (stolen) {
            workers_[my_id].stats.tasks_stolen.fetch_add(
                1, std::memory_order_relaxed);
            return stolen;
        }
    }
    return nullptr;
}

Task distribution:

template<typename F>
void spawn(F&& func) {
    Task* task = new FunctionTask(std::forward<F>(func));

    if (thread_local_worker_id >= 0) {
        // Called from worker thread: push to local deque (fast path)
        workers_[thread_local_worker_id].deque.push(task);
    } else {
        // Called from external thread: random worker
        size_t target = random_() % num_workers_;
        workers_[target].deque.push(task);
    }

    // Wake one parked worker if any
    unpark_one();
}

Checkpoint: Mandelbrot renders correctly. Work is distributed across workers.

Phase 4: Cache Optimization (Days 9-10)

Goal: Eliminate false sharing, optimize memory layout.

1. Ensure cache-line alignment:

struct alignas(64) WorkerState {
    // Put hot data (deque) first
    WSDeque<Task*> deque;

    // Cold data (statistics) on separate cache line
    struct alignas(64) Stats { ... } stats;

    // Padding to ensure next WorkerState starts on new cache line
    char padding[...];
};

2. Avoid false sharing in buffer:

// BAD: Adjacent task pointers share cache line
T* buffer[N];

// GOOD: Pad each slot to cache line (only if tasks are large)
struct alignas(64) Slot { T* task; };
Slot buffer[N];

3. Profile with perf:

$ perf stat -e cache-misses,cache-references ./benchmark
# Before: 15% cache miss rate
# After: 3% cache miss rate

Checkpoint: Scaling improves to 7x+ speedup on 8 cores.

Phase 5: API Completion (Days 11-14)

Goal: Complete public API with futures, wait_all, shutdown.

submit() with future:

template<typename F, typename... Args>
auto submit(F&& f, Args&&... args)
    -> std::future<std::invoke_result_t<F, Args...>>
{
    using ReturnType = std::invoke_result_t<F, Args...>;

    // Bind arguments
    auto bound = std::bind(std::forward<F>(f),
                           std::forward<Args>(args)...);

    // Wrap in packaged_task for future
    auto packaged = std::make_unique<std::packaged_task<ReturnType()>>(
        std::move(bound));

    auto future = packaged->get_future();

    // Create type-erased task
    Task* task = new PackagedTaskWrapper<ReturnType>(std::move(packaged));

    pending_count_.fetch_add(1, std::memory_order_seq_cst);

    // Submit to appropriate deque
    distribute_task(task);

    return future;
}

wait_all():

void wait_all() {
    // Participate in work-stealing while waiting
    while (pending_count_.load(std::memory_order_seq_cst) > 0) {
        // Try to help complete work
        Task* task = try_steal_any();
        if (task) {
            task->execute();
            delete task;
            pending_count_.fetch_sub(1, std::memory_order_seq_cst);
        } else {
            std::this_thread::yield();
        }
    }
}

Graceful shutdown:

~WorkStealingPool() {
    // Signal shutdown
    shutdown_.store(true, std::memory_order_release);

    // Wake all parked workers
    for (auto& worker : workers_) {
        unpark_worker(worker);
    }

    // Join all threads
    for (auto& worker : workers_) {
        if (worker.thread.joinable()) {
            worker.thread.join();
        }
    }

    // Clean up any remaining tasks
    for (auto& worker : workers_) {
        while (Task* task = worker.deque.pop()) {
            delete task;
        }
    }
}

Testing Strategy

Unit Tests

TEST(WSDeque, SingleThreadedPushPop) {
    WSDeque<int*> deque;
    int values[100];

    for (int i = 0; i < 100; ++i) {
        deque.push(&values[i]);
    }
    EXPECT_EQ(deque.size(), 100);

    // LIFO order
    for (int i = 99; i >= 0; --i) {
        EXPECT_EQ(deque.pop(), &values[i]);
    }
    EXPECT_TRUE(deque.empty());
}

TEST(WSDeque, ConcurrentSteal) {
    WSDeque<int*> deque;
    int values[10000];
    std::atomic<int> stolen_count{0};

    // Owner pushes
    for (int i = 0; i < 10000; ++i) {
        deque.push(&values[i]);
    }

    // Multiple thieves steal concurrently
    std::vector<std::thread> thieves;
    for (int t = 0; t < 4; ++t) {
        thieves.emplace_back([&] {
            while (int* v = deque.steal()) {
                stolen_count.fetch_add(1);
            }
        });
    }

    for (auto& t : thieves) t.join();
    EXPECT_EQ(stolen_count.load(), 10000);
}

TEST(WorkStealingPool, BasicSubmit) {
    WorkStealingPool pool(4);
    std::atomic<int> counter{0};

    std::vector<std::future<void>> futures;
    for (int i = 0; i < 1000; ++i) {
        futures.push_back(pool.submit([&counter] {
            counter.fetch_add(1);
        }));
    }

    for (auto& f : futures) f.wait();
    EXPECT_EQ(counter.load(), 1000);
}

TEST(WorkStealingPool, ResultPropagation) {
    WorkStealingPool pool(4);

    auto future = pool.submit([](int a, int b) { return a + b; }, 20, 22);
    EXPECT_EQ(future.get(), 42);
}

TEST(WorkStealingPool, WorkStealing) {
    WorkStealingPool pool(4);
    std::array<std::atomic<int>, 4> per_thread_count{};

    // Submit many tasks to worker 0
    for (int i = 0; i < 10000; ++i) {
        pool.submit([&per_thread_count] {
            size_t worker_id = get_current_worker_id();
            per_thread_count[worker_id].fetch_add(1);

            // Simulate work
            std::this_thread::sleep_for(std::chrono::microseconds(10));
        });
    }

    pool.wait_all();

    // Verify work was distributed (stolen)
    for (auto& count : per_thread_count) {
        EXPECT_GT(count.load(), 0);  // Each worker did some work
    }
}

Stress Tests

TEST(WorkStealingPool, StressTest) {
    WorkStealingPool pool(8);
    std::atomic<int> completed{0};
    constexpr int NUM_TASKS = 1'000'000;

    // Hammer from multiple external threads
    std::vector<std::thread> submitters;
    for (int t = 0; t < 8; ++t) {
        submitters.emplace_back([&pool, &completed] {
            for (int i = 0; i < NUM_TASKS / 8; ++i) {
                pool.spawn([&completed] {
                    completed.fetch_add(1);
                });
            }
        });
    }

    for (auto& t : submitters) t.join();
    pool.wait_all();

    EXPECT_EQ(completed.load(), NUM_TASKS);
}

TEST(WorkStealingPool, ForkJoinRecursive) {
    WorkStealingPool pool(8);

    std::function<int64_t(int)> fib = [&pool, &fib](int n) -> int64_t {
        if (n < 2) return n;
        if (n < 20) {
            // Below threshold, compute sequentially
            return fib(n - 1) + fib(n - 2);
        }

        // Fork
        auto f1 = pool.submit([&fib, n] { return fib(n - 1); });
        auto f2 = pool.submit([&fib, n] { return fib(n - 2); });

        // Join
        return f1.get() + f2.get();
    };

    EXPECT_EQ(fib(35), 9227465);
}

ThreadSanitizer

$ clang++ -std=c++17 -fsanitize=thread -g -O1 -o test_tsan tests/*.cpp
$ ./test_tsan

# MUST report: NO data races detected
# Common issues TSan catches:
#   - Missing memory order in atomics
#   - Data race on task execution
#   - Use-after-free on task deletion

Benchmarking

void benchmark_scaling() {
    std::vector<size_t> thread_counts = {1, 2, 4, 8, 16};

    for (size_t threads : thread_counts) {
        WorkStealingPool pool(threads);

        auto start = std::chrono::high_resolution_clock::now();

        // Parallel quicksort of 10M integers
        std::vector<int> data(10'000'000);
        std::iota(data.begin(), data.end(), 0);
        std::shuffle(data.begin(), data.end(), std::mt19937{42});

        parallel_quicksort(pool, data.data(), 0, data.size());

        auto end = std::chrono::high_resolution_clock::now();
        auto ms = std::chrono::duration<double, std::milli>(end - start).count();

        std::cout << threads << " threads: " << ms << " ms\n";
    }
}

Common Pitfalls & Debugging

Pitfall 1: ABA Problem in Lock-Free Deque

Symptom: Occasional corruption, tasks executed twice, or crashes after many iterations.

Cause: Classic ABA scenario in CAS loop:

Thread A reads top = 5, gets task pointer P
Thread B steals successfully, top becomes 6
More operations wrap around, top becomes 5 again, but with different task Q at that slot
Thread A’s CAS succeeds (5 == 5), returns stale pointer P

Solutions:

// Solution 1: Use 64-bit indices that never wrap (recommended)
// With 64-bit, you'd need 2^64 / num_tasks years to wrap
std::atomic<int64_t> top, bottom;  // int64_t, not int32_t

// Solution 2: Tagged pointer with version counter
struct TaggedTop {
    int64_t index;
    int64_t version;
};
std::atomic<TaggedTop> top;

// Solution 3: Hazard pointers (complex, rarely needed for deque)

Pitfall 2: Starvation Under Heavy Stealing

Symptom: Some workers consistently have zero work while others are overloaded.

Cause: Round-robin stealing always starts from worker 0, causing “hot spot”.

Solution: Random victim selection:

// BAD: Always starts from 0
for (size_t victim = 0; victim < num_workers_; ++victim) {
    if (victim != my_id) try_steal(victim);
}

// GOOD: Random start
size_t start = rng() % num_workers_;
for (size_t i = 0; i < num_workers_; ++i) {
    size_t victim = (start + i) % num_workers_;
    if (victim != my_id) try_steal(victim);
}

Symptom: Scaling stops at 4x even with 8 cores; perf shows high cache miss rate.

Cause: Worker state structures are adjacent in memory, sharing cache lines.

Debugging:

$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./benchmark

# Before fix: 15% miss rate
# After fix: 3% miss rate

$ perf c2c record ./benchmark
$ perf c2c report
# Shows "false sharing" hot spots

Solution: Cache-line alignment:

// BAD: Workers share cache lines
struct WorkerState {
    std::atomic<int> counter;  // 4 bytes
    WSDeque<Task*> deque;      // starts immediately after
};
WorkerState workers[8];  // Adjacent, false sharing!

// GOOD: Each worker on own cache line(s)
struct alignas(64) WorkerState {
    std::atomic<int> counter;
    WSDeque<Task*> deque;
    char padding[64 - ...];  // Ensure isolation
};

Pitfall 4: Deadlock on Shutdown

Symptom: Program hangs on exit; workers never join.

Cause: Workers are parked (waiting on condition variable) when shutdown is requested.

Solution: Wake all workers after setting shutdown flag:

~WorkStealingPool() {
    shutdown_.store(true, std::memory_order_release);

    // CRITICAL: Wake ALL parked workers
    for (auto& worker : workers_) {
        {
            std::lock_guard lock(worker.park_mutex);
            worker.parked = false;
        }
        worker.park_cv.notify_all();  // notify_all, not notify_one!
    }

    for (auto& worker : workers_) {
        if (worker.thread.joinable()) {
            worker.thread.join();
        }
    }
}

Pitfall 5: Memory Order Too Weak

Symptom: Works in debug mode, fails in release; works on x86, fails on ARM.

Cause: ARM has weaker memory model than x86. Code relying on x86’s strong ordering breaks.

Debugging:

# Run with ThreadSanitizer
$ clang++ -fsanitize=thread -g -O1 ...

# TSan will report missing synchronization
WARNING: ThreadSanitizer: data race
  Write of size 8 at ... by thread T1:
  Previous read of size 8 at ... by thread T2:

Solution: Start with memory_order_seq_cst everywhere, then carefully weaken:

// 1. Get it working with seq_cst
bottom_.store(b + 1, std::memory_order_seq_cst);

// 2. Analyze which orderings are actually needed
// 3. Weaken to minimum required (acquire/release usually sufficient)
bottom_.store(b + 1, std::memory_order_release);

// 4. Verify with TSan on both x86 and ARM (or use QEMU)

Pitfall 6: Thundering Herd on Task Submission

Symptom: Latency spike when submitting tasks after idle period.

Cause: All parked workers wake simultaneously, contend on stealing.

Solution: Wake workers incrementally:

void unpark_one() {
    for (auto& worker : workers_) {
        if (worker.parked.load(std::memory_order_acquire)) {
            {
                std::lock_guard lock(worker.park_mutex);
                worker.parked = false;
            }
            worker.park_cv.notify_one();
            return;  // Only wake ONE worker
        }
    }
}

Extensions & Challenges

Extension 1: Task Priorities

Support priority levels for scheduling:

enum class Priority { Low, Normal, High, Critical };

pool.submit(Priority::High, []{ /* important work */ });

// Implementation: Multiple deques per worker, check high-priority first

Extension 2: Task Dependencies (DAG Scheduling)

Support task graphs with dependencies:

auto a = pool.submit([]{ return compute_a(); });
auto b = pool.submit([]{ return compute_b(); });
auto c = pool.when_all(a, b).then([](int ra, int rb) {
    return ra + rb;
});

Extension 3: NUMA Awareness

On multi-socket systems, prefer stealing from same NUMA node:

// Steal order: same-socket workers first, then cross-socket
std::vector<size_t> steal_order = get_numa_aware_order(worker_id);
for (size_t victim : steal_order) {
    if (Task* t = try_steal(victim)) return t;
}

Extension 4: Continuations (Non-Blocking Then)

Add continuation support without blocking on get():

pool.submit(compute_a)
    .then([](int a) { return compute_b(a); })
    .then([](int b) { log_result(b); });

Extension 5: Coroutine Integration (C++20)

Make pool usable with co_await:

Task<int> async_computation(WorkStealingPool& pool) {
    int a = co_await pool.submit(compute_a);
    int b = co_await pool.submit(compute_b);
    co_return a + b;
}

Challenge: Implement Wait-Free Stealing

The Chase-Lev deque is lock-free but not wait-free (thieves may fail repeatedly). Implement bounded wait-free stealing using:

Helping mechanism: thieves help each other complete
Combining: batch multiple steal requests

This is research-level difficulty.

Resources

Essential Reading

“C++ Concurrency in Action” Chapter 9 - Anthony Williams
- Thread pool implementation, task submission, futures
“The Art of Multiprocessor Programming” Chapter 16 - Herlihy & Shavit
- Work-stealing algorithms, Chase-Lev deque correctness
“Scheduling Multithreaded Computations by Work Stealing” - Blumofe & Leiserson (1999)
- Original paper proving optimality of work-stealing
“Dynamic Circular Work-Stealing Deque” - Chase & Lev (2005)
- The lock-free deque algorithm you will implement
“Correct and Efficient Work-Stealing for Weak Memory Models” - Le et al. (2013)
- Memory ordering correctness for Chase-Lev

Code to Study

Project	Language	Notes
BS::thread_pool	C++	Modern, header-only, excellent docs
Rayon	Rust	Production work-stealing, par_iter
crossbeam-deque	Rust	Clean Chase-Lev implementation
Go runtime	Go	Goroutine work-stealing
Intel TBB	C++	Industrial strength, complex

Tools

ThreadSanitizer: Compile with -fsanitize=thread to detect races
perf: Profile cache behavior, find false sharing
Valgrind/Helgrind: Alternative race detector
QEMU: Test on ARM to verify memory ordering

Self-Assessment Checklist

Understanding

I can explain why work-stealing outperforms shared queues at scale
I can describe the Chase-Lev deque invariants and why they ensure correctness
I can explain acquire/release semantics and identify where they’re needed in the deque
I can explain what false sharing is and how to detect/fix it
I can describe the ABA problem and how 64-bit indices avoid it
I understand why LIFO execution preserves cache locality
I can explain when and how stealing happens in a balanced vs imbalanced workload

Implementation

My deque passes single-threaded push/pop tests
My deque passes concurrent stealing tests
ThreadSanitizer reports no data races
submit() correctly returns futures with results
Graceful shutdown works (no hung threads, no leaked tasks)
wait_all() correctly blocks until completion

Performance

Pool achieves 6x+ speedup on 8 cores for parallel quicksort
Task spawn overhead is < 200 ns
Cache miss rate is < 5% under load
CPU usage drops to near zero when pool is idle (parking works)
Performance matches or beats std::async for compute-bound work

Verification

Stress tested with 1M+ tasks
Tested with concurrent external submitters
Tested fork-join recursive patterns (parallel fib, quicksort)
Tested on both x86 and ARM (or verified with TSan)
Profiled with perf to verify no false sharing

Submission / Completion Criteria

Minimum Viable Completion

Basic thread pool with work-stealing deques
spawn() and submit() work correctly
Tasks execute on worker threads
ThreadSanitizer reports no races
Clean shutdown without deadlock

Full Completion

Lock-free Chase-Lev deque implementation
submit() returns working futures
wait_all() correctly synchronizes
Cache-line aligned worker state
Benchmark showing 7x+ speedup on 8 cores
Statistics collection (tasks executed, steals)

Excellence (Going Above & Beyond)

Comparison benchmark against std::async, simple pool
Parallel algorithms (parallel_for, parallel_reduce)
Task priorities
Continuations (.then())
NUMA-aware stealing
Detailed performance analysis with perf

Work-stealing thread pools power the most important parallel computing systems. Building one teaches you concurrency at the deepest level – from cache coherence to memory ordering to load balancing. After this project, you will understand what happens inside Rayon, Go’s runtime, and Intel TBB.

This guide was expanded from LEARN_CPP_CONCURRENCY_AND_PARALLELISM.md. For the complete learning path, see the project index.