Project 2: Work-Stealing Thread Pool

Build a scalable task scheduler with per-worker deques, work stealing, and deterministic benchmarks.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 2 to 4 weeks
Main Programming Language C++ (C++17)
Alternative Programming Languages C, Rust
Coolness Level Level 4: Hardcore Tech Flex
Business Potential 3 - Runtime Core
Prerequisites Threads, mutexes, atomics, basic data structures
Key Topics work stealing, Chase-Lev deque, memory ordering, cache behavior, scheduling

1. Learning Objectives

By completing this project, you will:

  1. Build a multi-threaded task scheduler that scales across cores.
  2. Implement a work-stealing deque with correct memory ordering.
  3. Diagnose and fix false sharing and contention bottlenecks.
  4. Provide a stable API for task submission, futures, and shutdown.
  5. Measure scheduler throughput with deterministic benchmarks.

2. All Theory Needed (Per-Concept Breakdown)

Concept 1: Atomics and Memory Ordering

Fundamentals

Thread pools depend on atomic operations to coordinate shared state without data races. Atomics provide a way to read and write memory with defined ordering guarantees. Without them, compilers and CPUs can reorder loads and stores in ways that break the scheduler. The core idea is that each shared variable in a concurrent data structure must be accessed with the correct atomic operations. Some operations only require relaxed ordering because they do not synchronize with other threads, while others need acquire and release semantics to ensure one thread sees the effects of another. If you use mutexes everywhere, you can avoid atomics but pay a high performance cost. A work-stealing deque is typically implemented with atomics to keep the fast path lock-free for the owning worker. Understanding the basic memory model is therefore essential for correctness and performance.

Deep Dive into the concept

In the C++ memory model, each atomic operation has an ordering: relaxed, acquire, release, acquire-release, or sequentially consistent. Relaxed atomics only guarantee atomicity; they do not establish a happens-before relationship. Acquire and release are the most common choices: a release store ensures that all writes before it are visible to a thread that performs a matching acquire load. For a deque, the owning thread often uses relaxed operations when pushing and popping from the bottom because no other thread touches those indices. But when a thief steals from the top, it must synchronize with the owner, and this requires acquire or release semantics.

The danger is that incorrect ordering can pass tests and still be wrong. For example, if you write a task pointer into a slot and then increment a size counter, another thread might see the size update before the task pointer is visible if the store is not released. This is a classic visibility bug. The fix is to ensure that the owner uses a release store when publishing the new task and that thieves use an acquire load when reading the task. Similarly, when a thief wins a race to steal the last element, it must use a compare-and-swap with acquire-release ordering to prevent both threads from returning the same task. The memory model gives you the tools but also lets you make subtle mistakes.

Another aspect is atomic contention. Atomic operations are still expensive because they can force cache line ownership changes. If all workers update the same global counter on every task, performance collapses. This is why task queues are per-worker and why global counters are often approximate. You can design metrics that are eventually consistent rather than perfectly accurate. For example, each worker can maintain a local count and occasionally publish it. This reduces cache bouncing.

Lock-free algorithms also have progress guarantees. A single atomic compare-and-swap in the steal path may be enough, but if you design it poorly you can create livelock where threads repeatedly fail and make no progress. Backoff strategies or stealing heuristics can prevent this. A correct design must define which operations are safe with relaxed semantics and which require stronger ordering. The Chase-Lev deque is the canonical example: the owner uses relaxed operations for push and pop, but the steal uses acquire and compare-and-swap to synchronize. This minimal ordering keeps performance high but still correct.

Finally, atomic memory ordering must align with the language and platform. On x86, the hardware ordering is relatively strong, so relaxed operations often work by accident. On ARM or POWER, the weaker ordering will break those assumptions. A real work-stealing pool must be correct across architectures. This is why you must explicitly choose orderings and document them. Your tests should run on at least one weak-memory emulator or use tools like ThreadSanitizer to detect data races. The key lesson is that atomics are not optional: they are the language of correct scheduling.

How This Fits on Projects

This concept controls the correctness of your deque in Section 4.4 and the concurrency decisions in Section 5.10 Phase 2. It also connects to allocator locking in P01-custom-memory-allocator.md and task scheduling in P03-mini-async-runtime.md.

Definitions & Key Terms

  • Atomic operation: A read-modify-write that is indivisible and has defined ordering.
  • Acquire/Release: Memory ordering that establishes happens-before relationships.
  • CAS (Compare-and-Swap): Atomic operation that updates a value if it matches expected.
  • Happens-before: The guarantee that one operation is visible to another.
  • Data race: Two threads access the same memory without synchronization and one writes.

Mental Model Diagram (ASCII)

Owner thread: write task -> release
Thief thread: acquire -> read task

How It Works (Step-by-Step)

  1. Owner writes task into deque slot.
  2. Owner publishes new tail index with release semantics.
  3. Thief reads head and tail with acquire semantics.
  4. Thief attempts CAS on head to claim task.

Invariants:

  • A task is visible before it is reported as available.
  • Only one thread can claim a task slot.

Failure modes:

  • Task pointer visible late, leading to null tasks.
  • Both threads returning the same task.

Minimal Concrete Example

std::atomic<size_t> head{0}, tail{0};
// Owner publishes tail with release
void push(Task t) {
    buffer[tail.load(std::memory_order_relaxed)] = t;
    tail.store(tail.load(std::memory_order_relaxed) + 1,
               std::memory_order_release);
}

Common Misconceptions

  • “Relaxed atomics are always safe” -> They can break visibility.
  • “x86 behavior is universal” -> ARM and POWER are weaker.
  • “Mutexes are always too slow” -> Sometimes a lock is simpler and fast enough.

Check-Your-Understanding Questions

  1. When do you need acquire/release instead of relaxed?
  2. Why can a CAS on the head index prevent duplicate steals?
  3. Why might a global atomic counter hurt performance?

Check-Your-Understanding Answers

  1. When one thread publishes data that another must see.
  2. It ensures only one thread successfully claims the same task.
  3. It causes cache line bouncing across cores.

Real-World Applications

  • Go runtime uses atomics to coordinate its scheduler.
  • Java ForkJoinPool uses atomic deques with stealing.
  • Lock-free queues in high-frequency trading rely on acquire/release.

Where You’ll Apply It

References

  • “C++ Concurrency in Action” - Memory model and atomics chapters.
  • “Rust Atomics and Locks” - Practical memory ordering guidance.

Key Insights

Atomics are the language of correct lock-free scheduling; wrong ordering is a silent bug.

Summary

A work-stealing pool is only as correct as its atomic ordering. Explicitly choose memory orders and test on weak memory architectures to avoid latent correctness bugs.

Homework/Exercises to Practice the Concept

  1. Write a single-producer single-consumer ring buffer with atomics.
  2. Create two threads where one publishes a struct and the other reads it; test with relaxed vs acquire/release.
  3. Use ThreadSanitizer to detect a data race in a toy example.

Solutions to the Homework/Exercises

  1. The ring buffer uses release when publishing and acquire when consuming.
  2. Relaxed ordering can show stale values; acquire/release fixes it.
  3. TSAN reports races when atomics or locks are missing.

Concept 2: Work-Stealing Deques and Scheduling

Fundamentals

Work stealing is a scheduling strategy where each worker thread owns a deque of tasks. The worker pushes and pops tasks from the bottom, while idle workers steal from the top. This design minimizes contention because the owner thread uses a fast, mostly lock-free path while thieves perform a slower path only when they are idle. The Chase-Lev deque is the most common algorithm for this model. It uses a circular buffer with head and tail indices and a carefully chosen atomic protocol to allow one owner and many thieves. The scheduler decides when to create tasks, how to split work, and when to steal. If tasks are too large, stealing is rare and load imbalance occurs. If tasks are too small, overhead dominates. Work stealing is therefore both a data structure problem and a workload partitioning problem.

Deep Dive into the concept

The Chase-Lev deque has three core operations: push by the owner, pop by the owner, and steal by a thief. The owner updates the tail and writes into the buffer. Because no other thread writes tail, these operations are simple and often use relaxed atomics. Stealing is the only contention point: a thief reads head and tail and attempts to increment head using CAS. If head equals tail, the deque is empty. If head is one less than tail, there is only one element; in this case the owner and thief can race to claim it, which is why both pop and steal must use a CAS to resolve the race safely.

Scheduling decisions are layered on top. A common approach is to push new tasks to the local deque, allowing depth-first execution, which improves cache locality. Stealing tends to be breadth-first because thieves steal from the top, which contains older tasks. This combination provides good locality for the owner while allowing load balancing across workers. However, the policy is not universally optimal. For some workloads, pushing to the top or using a global queue can reduce overhead. Your scheduler should expose a policy hook or at least document the chosen strategy.

Task granularity is the dominant performance factor. If each task takes too long, idle workers will steal rarely and you get poor utilization. If tasks are too small, overhead of scheduling and stealing dominates. A good heuristic is to ensure that each task does enough work to amortize scheduling overhead by at least 10x. In recursive workloads, you can spawn tasks until a threshold and then process locally. This prevents an explosion of tiny tasks.

The work-stealing algorithm also needs a strategy for idling. When a worker has no tasks, it should attempt to steal from other workers. If it fails repeatedly, it can back off by sleeping or yielding. Aggressive stealing keeps latency low but wastes CPU. Conservative stealing saves CPU but can increase latency. Many runtimes use an exponential backoff or a hybrid strategy: spin for a short time, then sleep or wait on a condition variable.

A scheduler must also handle shutdown. Workers might be sleeping or spinning when the pool is asked to stop. A clean shutdown requires a global flag and a wakeup mechanism. Often a condition variable is used to wake sleeping workers. For lock-free deques, you may also want to inject sentinel tasks that signal workers to exit. The shutdown path is part of the correctness model; if you forget to handle it, the pool can hang at exit.

How This Fits on Projects

This concept shapes the scheduler in Section 3.2 Functional Requirements and the deque design in Section 4.4. It also informs the task model in P03-mini-async-runtime.md and the parallelism assumptions used in P05-high-performance-string-search.md.

Definitions & Key Terms

  • Work stealing: Idle threads steal tasks from busy threads to balance load.
  • Deque: Double-ended queue used by workers and thieves.
  • Chase-Lev deque: A lock-free deque algorithm optimized for single owner, multiple thieves.
  • Task granularity: The amount of work per scheduled task.
  • Backoff: Strategy for waiting when no work is available.

Mental Model Diagram (ASCII)

Worker A deque: [top ... bottom]
Owner pops bottom, thief steals top

How It Works (Step-by-Step)

  1. Worker pushes tasks to its local deque bottom.
  2. Worker pops tasks from the bottom and executes them.
  3. Idle worker reads another deque and attempts to steal from the top.
  4. If steal succeeds, the thief executes the stolen task.
  5. If all deques are empty, workers back off or sleep.

Invariants:

  • The owner is the only thread that mutates the tail index.
  • At most one thread can claim a task at a given deque index.

Failure modes:

  • A race on the last element causes duplicate execution.
  • Excessive stealing causes high contention.

Minimal Concrete Example

// Simplified steal path
size_t h = head.load(std::memory_order_acquire);
size_t t = tail.load(std::memory_order_acquire);
if (h < t) {
    Task task = buffer[h % cap];
    if (head.compare_exchange_strong(h, h + 1,
            std::memory_order_acq_rel)) {
        return task;
    }
}
return nullptr;

Common Misconceptions

  • “A global queue is enough” -> It becomes a contention hotspot.
  • “Stealing always helps” -> It can add overhead if tasks are tiny.
  • “Depth-first is always best” -> Some workloads benefit from breadth-first.

Check-Your-Understanding Questions

  1. Why do owners use the bottom and thieves use the top?
  2. What happens when owner and thief race for the last task?
  3. How does task granularity affect stealing frequency?

Check-Your-Understanding Answers

  1. It minimizes contention because owner and thieves operate on opposite ends.
  2. A CAS resolves which thread wins, preventing duplicate execution.
  3. Coarse tasks reduce stealing; fine tasks increase overhead.

Real-World Applications

  • Go runtime uses per-P queues with stealing.
  • Java ForkJoinPool is based on the same work-stealing model.
  • Many game engines schedule jobs using work stealing.

Where You’ll Apply It

References

  • “The Art of Multiprocessor Programming” - Work stealing chapters.
  • “Structured Parallel Programming” - Task scheduling models.

Key Insights

Work stealing provides scalable scheduling by keeping the fast path local and pushing contention to the idle path.

Summary

The Chase-Lev deque is the backbone of work stealing. It works because the owner is single-writer and thieves coordinate with atomic CAS. The scheduler policies on top determine real performance.

Homework/Exercises to Practice the Concept

  1. Implement a single-threaded deque and test push/pop.
  2. Add a second thread that only steals and measure contention.
  3. Simulate task sizes and observe throughput vs overhead.

Solutions to the Homework/Exercises

  1. The deque should preserve FIFO order from the top and LIFO from the bottom.
  2. Contention rises when tasks are small and steal frequency is high.
  3. Throughput peaks when task size is large enough to amortize scheduling cost.

Concept 3: Cache Coherence, False Sharing, and Performance

Fundamentals

Performance of a thread pool is often limited by cache coherence, not by CPU speed. When multiple threads repeatedly modify variables on the same cache line, the cache line bounces between cores, causing stalls. This is called false sharing. It happens when unrelated fields share a cache line and are written by different threads. A work-stealing pool is particularly sensitive because head and tail indices are updated frequently. If those indices share a cache line, you will see massive contention. The fix is to add padding or align those fields to cache line size. Another performance issue is oversubscription: too many worker threads create context switching overhead and hurt throughput. Good performance requires tuning thread counts, task granularity, and memory layout.

Deep Dive into the concept

Modern CPUs maintain coherence across caches using protocols like MESI. A cache line is usually 64 bytes. When one core writes to a line, other cores must invalidate their copies. If multiple threads write different variables in the same line, the line constantly invalidates and revalidates, creating coherence traffic. This is false sharing. In a deque, head and tail indices are updated by different threads. If they are on the same line, the owner and thieves fight over the line. The fix is to put them on separate cache lines using padding or alignment. Another hotspot is the global worker array: if each worker updates its own status field but the fields are adjacent, you get the same issue.

Performance also depends on how you measure it. Microbenchmarks can lie because they fit in cache and avoid real contention. A realistic benchmark should include task sizes, random distribution, and warm-up. Determinism matters: you should fix the random seed and report the configuration. Otherwise it is impossible to compare changes. Your benchmark CLI must therefore accept a seed and report it in output. You should also capture latency percentiles, not just throughput, because work stealing can create long-tail latency when the queue is empty or when stealing requires synchronization.

Another factor is oversubscription. If you run more worker threads than CPU cores, the OS scheduler will time-slice them. Context switching adds overhead and can break cache locality. For CPU-bound workloads, you generally want one worker per core. For mixed workloads with blocking tasks, you may want more, but then you need a way to detect blocking and compensate. Some runtimes separate blocking work into a dedicated pool. In this project, you can keep it simple: document that tasks should be CPU-bound and avoid blocking calls.

You also need to avoid global locks. Even if your deque is lock-free, a global queue or a global stats counter can dominate runtime. Use per-worker counters and aggregate them occasionally. When you need synchronization, use a condition variable only for idle workers, not on every task. The overall goal is to keep the fast path free of contention and to measure performance in ways that expose real bottlenecks.

How This Fits on Projects

This concept shapes the layout of your deque in Section 4.4 and the benchmark design in Section 3.4 and Section 5.10 Phase 3. It is also critical for allocator performance in P01-custom-memory-allocator.md and SIMD scanning in P05-high-performance-string-search.md.

Definitions & Key Terms

  • Cache line: The smallest unit of data moved between caches, typically 64 bytes.
  • False sharing: Unrelated variables on the same cache line that cause coherence traffic.
  • Coherence protocol: Hardware rules that keep caches consistent.
  • Oversubscription: More threads than CPU cores.
  • Tail latency: Slowest percentile of response times (p95, p99).

Mental Model Diagram (ASCII)

[head][tail] -> same cache line -> contention
[head]...... [tail] -> separate lines -> less contention

How It Works (Step-by-Step)

  1. Identify hot variables updated by different threads.
  2. Align and pad them to cache line size.
  3. Benchmark with realistic task sizes.
  4. Measure throughput and tail latency.

Invariants:

  • Hot fields updated by different threads should not share a cache line.
  • Benchmarks must be deterministic to compare changes.

Failure modes:

  • Unexplained throughput drops due to false sharing.
  • Misleading benchmark results due to non-deterministic workloads.

Minimal Concrete Example

struct alignas(64) PaddedIndex {
    std::atomic<size_t> value;
    char pad[64 - sizeof(std::atomic<size_t>)];
};

Common Misconceptions

  • “Atomic means fast” -> Atomics can be slow if they cause cache contention.
  • “More threads always means faster” -> Oversubscription hurts throughput.
  • “Average latency is enough” -> Tail latency reveals contention spikes.

Check-Your-Understanding Questions

  1. Why does false sharing slow down a scheduler?
  2. Why should head and tail indices be on different cache lines?
  3. What is the impact of oversubscription on CPU-bound tasks?

Check-Your-Understanding Answers

  1. Cache line bouncing causes stalls on every write.
  2. It reduces coherence traffic between owner and thieves.
  3. Context switching overhead increases and cache locality decreases.

Real-World Applications

  • High-performance thread pools in web servers focus on cache line layout.
  • Game engines align job system data to avoid contention.

Where You’ll Apply It

References

  • “Computer Systems: A Programmer’s Perspective” - Cache memory chapters.
  • “Systems Performance” by Brendan Gregg - CPU cache behavior.

Key Insights

The fastest scheduler is the one that avoids shared cache lines and measures performance realistically.

Summary

False sharing and poor benchmarks are the silent killers of a thread pool. Align hot fields, avoid global counters, and use deterministic workloads to measure improvements.

Homework/Exercises to Practice the Concept

  1. Write a microbenchmark that increments two atomics with and without padding.
  2. Measure throughput with 1, 2, 4, and 8 worker threads.
  3. Compare throughput with task sizes of 100ns vs 10us.

Solutions to the Homework/Exercises

  1. Padded atomics should show higher throughput.
  2. Throughput saturates at core count, then declines.
  3. Larger task sizes improve efficiency by amortizing scheduling overhead.

3. Project Specification

3.1 What You Will Build

A work-stealing thread pool library that provides a clean API for task submission, futures, and controlled shutdown. The pool uses per-worker deques, implements the Chase-Lev algorithm for stealing, and includes a deterministic benchmark tool. The library must be safe under heavy contention and should scale on 4 or more cores. The project intentionally excludes advanced task cancellation or priorities, but it must handle task exceptions or failures gracefully.

3.2 Functional Requirements

  1. Pool creation: Create a pool with a configurable number of worker threads.
  2. Task submission: Submit tasks and receive a future-like handle.
  3. Work stealing: Idle workers steal from other workers.
  4. Shutdown: Gracefully stop workers and wait for completion.
  5. Benchmark CLI: Deterministic throughput and latency measurements.
  6. Error handling: Task failures are captured and reported to the caller.

3.3 Non-Functional Requirements

  • Performance: Linear or near-linear scaling for CPU-bound workloads.
  • Reliability: No data races under ThreadSanitizer; safe shutdown.
  • Usability: Simple API with examples and documented behavior.

3.4 Example Usage / Output

$ ./poolbench --seed 4242 --tasks 200000 --size 2000
poolbench 1.0
threads=8 seed=4242 tasks=200000
throughput=7.2M tasks/s
p50=1.1us p99=9.4us

3.5 Data Formats / Schemas / Protocols

Benchmark results are emitted as JSON:

{
  "threads": 8,
  "seed": 4242,
  "tasks": 200000,
  "task_size_ns": 2000,
  "throughput": 7200000,
  "latency_us": {"p50": 1.1, "p99": 9.4}
}

3.6 Edge Cases

  • Submitting zero tasks returns immediately.
  • Submitting tasks after shutdown returns an error.
  • Tasks that throw exceptions are captured in the future.
  • Worker count of 1 should behave like a serial executor.

3.7 Real World Outcome

A reusable thread pool library and a benchmark that proves scaling on multicore hardware.

3.7.1 How to Run (Copy/Paste)

make
./poolbench --seed 4242 --tasks 200000 --size 2000

3.7.2 Golden Path Demo (Deterministic)

  • Use seed 4242 and fixed task size.
  • Expect repeatable throughput within a small variance.

3.7.3 If CLI: Exact Terminal Transcript

$ ./poolbench --seed 4242 --tasks 50000 --size 1000
poolbench 1.0
threads=4 seed=4242 tasks=50000
throughput=4.1M tasks/s
p50=0.9us p99=7.8us
exit_code=0

$ ./poolbench --tasks -1
poolbench: invalid tasks value
usage: poolbench --tasks N --size NS
exit_code=2

3.7.4 If Web App

Not applicable.

3.7.5 If API

Not applicable.

3.7.6 If Library

Install/import

#include "thread_pool.h"

Minimal usage

ThreadPool pool(4);
auto fut = pool.submit([](){ return 42; });
int result = fut.get();

Error handling

auto fut = pool.submit([](){ throw std::runtime_error("fail"); });
try { fut.get(); } catch (...) { /* handle error */ }

3.7.7 If GUI / Desktop / Mobile

Not applicable.

3.7.8 If TUI

Not applicable.


4. Solution Architecture

4.1 High-Level Design

+--------------+    +-------------------+
| API Layer    | -> | Task Submission   |
+--------------+    +-------------------+
        |                   |
        v                   v
+--------------+    +-------------------+
| Worker Deque | <- | Work Stealing     |
+--------------+    +-------------------+
        |
        v
+--------------+
| Worker Threads|
+--------------+

4.2 Key Components

Component Responsibility Key Decisions
Worker deque Store tasks per worker Chase-Lev algorithm
Scheduler Decide where to push tasks Local-first policy
Futures Return results and errors Shared state and condition variable
Benchmark Generate deterministic workloads Fixed seed and task size

4.4 Data Structures (No Full Code)

struct Deque {
    std::atomic<size_t> head;
    std::atomic<size_t> tail;
    Task* buffer;
};

struct Worker {
    Deque deque;
    std::thread thread;
};

4.4 Algorithm Overview

Key Algorithm: Steal

  1. Read head and tail with acquire.
  2. If head < tail, try CAS to increment head.
  3. If CAS succeeds, return task.

Complexity Analysis:

  • Time: O(1) expected for push/pop/steal.
  • Space: O(n) for tasks in queues.

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install build-essential
make

5.2 Project Structure

thread_pool/
|-- src/
|   |-- thread_pool.cpp
|   |-- deque.cpp
|   `-- bench.cpp
|-- include/
|   `-- thread_pool.h
|-- tests/
|   `-- test_pool.cpp
`-- Makefile

5.3 The Core Question You’re Answering

“How do you schedule thousands of tasks across many cores without a global lock bottleneck?”

5.4 Concepts You Must Understand First

  1. Memory ordering and atomics
  2. Work-stealing deque mechanics
  3. Cache line behavior and false sharing

5.5 Questions to Guide Your Design

  1. How many tasks should be queued per worker before stealing begins?
  2. How will you implement shutdown without race conditions?
  3. What ordering do your atomic operations require?

5.6 Thinking Exercise

Consider a pool of 4 workers and 1 task. Two workers attempt to steal.

  • How do you prevent the same task from running twice?
  • What ordering must the CAS use?

5.7 The Interview Questions They’ll Ask

  1. How does work stealing reduce contention?
  2. What is the Chase-Lev deque and why is it used?
  3. What is false sharing and how do you avoid it?

5.8 Hints in Layers

Hint 1: Implement a single-producer deque first

Focus on push and pop for the owner only.

Hint 2: Add stealing with CAS

Implement a minimal steal path and test with 2 threads.

Hint 3: Add futures

Wrap tasks in a shared state and return a handle.

Hint 4: Benchmark and pad

Add padding to head/tail and measure speedup.

5.9 Books That Will Help

Topic Book Chapter
Concurrency “C++ Concurrency in Action” Atomics chapters
Scheduling “The Art of Multiprocessor Programming” Work stealing

5.10 Implementation Phases

Phase 1: Core Deque (1 week)

Goals:

  • Implement owner push/pop and buffer resizing

Tasks:

  1. Build ring buffer and push/pop.
  2. Add unit tests for ordering.

Checkpoint: Single-threaded deque tests pass.

Phase 2: Work Stealing (1 week)

Goals:

  • Implement steal path and atomic ordering

Tasks:

  1. Add CAS-based steal.
  2. Add two-thread stress test.

Checkpoint: No duplicate task execution.

Phase 3: Pool + Benchmark (1 to 2 weeks)

Goals:

  • Full thread pool with benchmarks

Tasks:

  1. Add worker threads and futures.
  2. Add benchmark CLI with seed.

Checkpoint: Throughput scales with cores.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Deque type Chase-Lev vs lock-based Chase-Lev Fast owner operations
Task model Function pointer vs std::function std::function Ease of use
Shutdown Sentinel tasks vs global flag Global flag + CV Simpler and safe

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Deque correctness push/pop/steal order
Integration Tests Task execution multiple workers, futures
Edge Case Tests Shutdown and errors submit after shutdown

6.2 Critical Test Cases

  1. Steal correctness: no task executes twice.
  2. Shutdown: pool exits cleanly without deadlocks.
  3. Scaling: throughput improves with more threads.

6.3 Test Data

seed: 4242
tasks: 100000
size_ns: 1000

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
False sharing Low throughput Pad head/tail
Wrong memory ordering Rare crashes Use acquire/release
Busy spin on idle High CPU usage Add backoff or CV

7.2 Debugging Strategies

  • ThreadSanitizer: Detect data races in the deque.
  • Logging: Log steal attempts with thread ids.

7.3 Performance Traps

  • Global counters updated on every task.
  • Oversubscription causing context switching.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a simple global queue fallback.
  • Add task priorities (low and high).

8.2 Intermediate Extensions

  • Add work stealing across NUMA nodes with affinity.
  • Add a dedicated blocking task pool.

8.3 Advanced Extensions

  • Implement lock-free futures.
  • Add adaptive task splitting based on runtime metrics.

9. Real-World Connections

9.1 Industry Applications

  • ForkJoinPool and Rayon: Work-stealing in production runtimes.
  • Game engines: Job systems for parallel rendering and physics.
  • Rayon: Rust data-parallel runtime using work stealing.
  • Go runtime: Scheduling model with per-P queues.

9.3 Interview Relevance

  • Explain why work stealing scales better than a global queue.
  • Describe how you avoid false sharing.

10. Resources

10.1 Essential Reading

  • “The Art of Multiprocessor Programming” - work stealing.
  • “C++ Concurrency in Action” - atomics and memory model.

10.2 Video Resources

  • Conference talks on work-stealing schedulers.

10.3 Tools & Documentation

  • ThreadSanitizer for race detection.
  • perf for contention profiling.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why work stealing reduces contention.
  • I can explain acquire vs release ordering.
  • I can describe false sharing and how to avoid it.

11.2 Implementation

  • All functional requirements are met.
  • ThreadSanitizer reports no data races.
  • Benchmarks are deterministic with fixed seeds.

11.3 Growth

  • I can compare work stealing to a global queue design.
  • I can explain my scheduling policy in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Work stealing deque implemented and tested.
  • Thread pool executes tasks correctly.
  • Benchmark CLI runs with deterministic output.

Full Completion:

  • Futures support error propagation.
  • Clean shutdown with no hanging threads.

Excellence (Going Above & Beyond):

  • Demonstrate near-linear scaling on 8 cores.
  • Include detailed perf analysis of false sharing fixes.