Project 9: NUMA-Aware Thread Pool

Build a thread pool with per-node queues and locality-aware work stealing to keep tasks close to their data.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	2-3 weeks
Main Programming Language	C++ (Alternatives: C, Rust)
Alternative Programming Languages	C, Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	3. The “Service & Support” Model
Prerequisites	Threading, queues, CPU affinity
Key Topics	thread pinning, topology, work stealing, locality metrics

1. Learning Objectives

By completing this project, you will:

Discover and represent NUMA topology for scheduling decisions.
Pin worker threads to specific CPUs and nodes.
Implement per-node work queues.
Design a locality-aware work-stealing policy.
Track local vs remote task execution rates.
Provide deterministic benchmark workloads for testing.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Topology-Aware Thread Placement and Affinity

Fundamentals

CPU affinity binds a thread to specific CPUs so the OS does not migrate it. On NUMA systems, pinning threads to CPUs within a node keeps their memory accesses local. Without affinity, threads may migrate, causing cache coldness and remote memory access. A NUMA-aware thread pool must understand the machine’s topology, map CPUs to nodes, and pin each worker accordingly. This ensures that per-node queues and data structures are accessed locally.

Deep Dive into the Concept

Linux scheduling is optimized for fairness, not locality. By default, threads can migrate between CPUs, which can improve load balance but harms cache locality and NUMA placement. On NUMA systems, a migrating thread may access memory allocated on the original node, causing remote memory access and higher latency. This can negate the benefits of a NUMA-aware design.

Affinity APIs (such as pthread_setaffinity_np or sched_setaffinity) allow you to restrict a thread to a CPU set. For NUMA-aware scheduling, you typically map workers to nodes: each worker is pinned to a CPU on its node. This aligns with memory placement and reduces cross-node traffic. The mapping can be static (fixed CPU list per node) or dynamic (choose least loaded CPU). Static mapping is simpler and deterministic, which is useful for reproducible benchmarks.

Topology discovery is the first step. You can use libnuma or sysfs to map CPUs to nodes. The distance matrix can help order nodes for stealing: workers should prefer stealing from nearby nodes to reduce latency. You also need to consider hyper-threading: two logical CPUs may share the same physical core. For performance measurements, pinning to physical cores is preferable. Your thread pool should optionally support a mode that uses only one thread per physical core.

Affinity also interacts with cpusets and containers. If the process is restricted to a subset of CPUs, your pool must adapt. It should detect the available CPUs and nodes and distribute workers accordingly. If only one node is visible, the pool should gracefully fall back to a UMA-style global queue.

A key design decision is whether to allow dynamic migration for load balancing. Allowing migration can improve throughput but may reduce locality. For this project, you will keep workers pinned and rely on work stealing to balance load without migration.

How this fits on projects

This concept defines how you place worker threads and why per-node queues are meaningful. Without affinity, your locality statistics are meaningless.

Definitions & Key Terms

Affinity -> Binding threads to specific CPUs.
Topology -> Mapping of CPUs to NUMA nodes.
Hyper-threading -> Multiple logical CPUs sharing a physical core.
Migration -> OS moving a thread between CPUs.

Mental Model Diagram (ASCII)

Node 0: CPU0 CPU1 CPU2 CPU3 -> Workers 0-3
Node 1: CPU8 CPU9 CPU10 CPU11 -> Workers 4-7

Pinned workers stay within their node

How It Works (Step-by-Step)

Discover CPU-to-node mapping.
Assign each worker to a CPU in its node.
Pin the worker thread to that CPU.
Create a per-node work queue.
Track local vs remote executions.

Invariants: Workers remain pinned; each queue is local to a node.

Failure modes: cpuset restrictions, invalid CPU IDs, hyper-threading mismatch.

Minimal Concrete Example

cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu_id, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);

Common Misconceptions

“Affinity is optional” -> Without it, locality assumptions break.
“Any CPU in a node is equivalent” -> hyper-threading can skew results.
“Pinning reduces throughput” -> it often improves cache locality.

Check-Your-Understanding Questions

Why does thread migration hurt NUMA locality?
How does hyper-threading affect thread placement?
Why should a pool detect cpuset restrictions?

Check-Your-Understanding Answers

Threads may access memory on remote nodes, increasing latency.
Logical CPUs share core resources, which can reduce throughput.
Because unavailable CPUs cannot be used for pinning or work distribution.

Real-World Applications

NUMA-aware web servers and database worker pools.
High-performance computing runtimes.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.2, Sec. 5.10.
Also used in: P01-numa-topology-explorer.

References

“C++ Concurrency in Action” (Williams) – Ch. 2-4
“Computer Architecture” – Ch. 2

Key Insights

Affinity is the foundation of NUMA-aware scheduling; without it, locality is accidental.

Summary

Topology-aware thread placement ensures that workers stay near their data. Affinity APIs allow you to pin threads, reduce migration, and make per-node queues meaningful.

Homework/Exercises to Practice the Concept

Pin a thread to CPU 0 and verify with taskset -cp.
Compare performance with and without affinity in a simple loop.
Map CPUs to nodes using sysfs and print the mapping.

Solutions to the Homework/Exercises

The CPU list should show only CPU 0.
With affinity, variance is lower and throughput more stable.
/sys/devices/system/node/node*/cpulist provides the mapping.

2.2 Work Stealing and Locality Metrics

Fundamentals

Work stealing balances load by allowing idle workers to take tasks from other queues. In a NUMA-aware pool, stealing should respect topology: workers should prefer stealing from the local node, then from nearby nodes, and only last from far nodes. This reduces remote memory access while still balancing load. Locality metrics (local vs remote task execution percentage) quantify how well the pool preserves locality.

Deep Dive into the Concept

Work stealing is a common technique in parallel runtimes. Each worker has a local queue; when it runs out of work, it steals from another queue. This reduces contention compared to a global queue because most operations are local. In NUMA systems, however, the choice of victim queue matters: stealing from a remote node increases latency and may cause the worker to touch data allocated on that node.

A locality-aware stealing policy uses the distance matrix to order potential victims. For example, a worker on node 0 might first steal from other queues on node 0, then from node 1, then from node 2. The goal is to balance load without excessive remote access. You can encode the policy as an ordered list of nodes per worker. This policy can be static (computed once at startup) or dynamic (adjusted based on runtime metrics).

Locality metrics are essential to validate the design. You can track the node of each task (if tasks carry a preferred node) and compare it to the executing worker’s node. The ratio of local executions to total executions is a key metric. You should also track the number of steals per node distance, and the distribution of queue lengths. These metrics reveal whether the pool is respecting locality or constantly stealing from far nodes.

Another important consideration is data affinity. Tasks often operate on data allocated on a specific node. Your API can allow tasks to specify a preferred node, and the pool can enqueue the task on that node’s queue. If the task is stolen, the pool can still track it as a remote execution. This provides feedback to the user and helps tune load balancing thresholds.

Finally, the stealing algorithm must avoid contention. If all workers try to steal from the same queue, you get a new hot spot. A common approach is randomized stealing within a preferred node set. Your design should balance determinism (for tests) with randomness (for real workloads). You can provide a fixed seed for deterministic test runs and a random seed for production.

How this fits on projects

This concept defines the pool’s balancing strategy and how you measure its effectiveness. It connects the data structure design (per-node queues) with real workload behavior.

Definitions & Key Terms

Work stealing -> Idle workers take tasks from others’ queues.
Locality metric -> Percentage of tasks executed on their preferred node.
Victim selection -> Strategy for choosing which queue to steal from.
Distance-aware stealing -> Ordering steals by NUMA distance.

Mental Model Diagram (ASCII)

Worker on Node 0
1) Steal from Node 0 queues
2) Steal from Node 1 queues
3) Steal from Node 2 queues

Distance-aware ordering reduces remote cost

How It Works (Step-by-Step)

Worker checks local queue.
If empty, pick victim queue based on distance order.
Attempt steal; if failed, try next node.
Execute task and record locality metrics.

Invariants: Local queues preferred; remote steals are last resort.

Failure modes: Excessive stealing, load imbalance, hot victim queue.

Minimal Concrete Example

for (int node : steal_order[worker_node]) {
    if (try_steal_from(node)) break;
}

Common Misconceptions

“Stealing always hurts locality” -> it can preserve throughput when local queues are empty.
“Random stealing is enough” -> without distance awareness, remote access spikes.
“Locality metrics are optional” -> they are required to validate design.

Check-Your-Understanding Questions

Why should stealing be ordered by NUMA distance?
How do you measure locality success?
What happens if all workers steal from the same queue?

Check-Your-Understanding Answers

To reduce remote access latency and interconnect traffic.
By tracking the percentage of tasks executed on their preferred node.
Contention increases and throughput drops.

Real-World Applications

Task schedulers in databases and analytics engines.
Runtime systems for parallel programming languages.

Where You’ll Apply It

In this project: Sec. 4.2, Sec. 5.10, Sec. 7.1.
Also used in: P06-numa-aware-data-structure-library.

References

“The Art of Multiprocessor Programming” – Ch. 7
“C++ Concurrency in Action” – Ch. 7

Key Insights

Work stealing preserves throughput; distance-aware stealing preserves locality.

Summary

A NUMA-aware thread pool must balance load without destroying locality. Distance-aware work stealing and explicit locality metrics provide a measurable way to achieve this balance.

Homework/Exercises to Practice the Concept

Implement random stealing and compare locality metrics.
Implement distance-aware stealing and compare results.
Add a threshold where remote stealing is allowed only if local queues are empty for N attempts.

Solutions to the Homework/Exercises

Random stealing yields higher remote task rates.
Distance-aware stealing improves locality without reducing throughput much.
The threshold reduces remote steals while keeping workers busy.

3. Project Specification

3.1 What You Will Build

A NUMA-aware thread pool with per-node queues, pinned workers, and locality-aware work stealing. Tasks can specify a preferred node, and the pool reports locality statistics.

Included: topology discovery, thread pinning, per-node queues, work stealing, stats. Excluded: distributed scheduling or preemption.

3.2 Functional Requirements

Discover NUMA topology and map CPUs to nodes.
Spawn and pin workers per node.
Maintain per-node task queues.
Allow tasks to specify preferred node.
Implement distance-aware work stealing.
Report local vs remote execution statistics.
Provide deterministic benchmark workloads.

3.3 Non-Functional Requirements

Performance: locality-aware pool outperforms global queue.
Reliability: no lost tasks, correct shutdown.
Usability: clear API and stats output.

3.4 Example Usage / Output

NUMAThreadPool pool(4); // 4 threads per node
pool.submit(task, preferred_node);

pool.stats();
// Node 0: 98.2% local tasks
// Node 1: 97.5% local tasks
// Cross-node steals: 0.6%

3.5 Data Formats / Schemas / Protocols

Stats output (JSON):

{
  "node0": {"local_pct": 98.2, "steals": 5},
  "node1": {"local_pct": 97.5, "steals": 7},
  "cross_node_steals": 12
}

3.6 Edge Cases

Only one NUMA node visible.
Uneven queue distribution.
Tasks without preferred node.

3.7 Real World Outcome

You can demonstrate a thread pool that keeps most tasks local while still balancing load through controlled stealing.

3.7.1 How to Run (Copy/Paste)

cmake -S . -B build && cmake --build build
./build/numa_pool_demo --threads-per-node=4 --seed=42

3.7.2 Golden Path Demo (Deterministic)

$ ./numa_pool_demo --threads-per-node=4 --seed=42
Node 0: 98.2% local tasks
Node 1: 97.5% local tasks
Cross-node steals: 0.6%

3.7.3 If CLI: Exact Terminal Transcript

$ ./numa_pool_demo --threads-per-node=4 --skew=0.8
Node 0: 85.1% local tasks
Node 1: 83.4% local tasks
Cross-node steals: 12.7%

3.7.4 Failure Demo (Deterministic)

$ ./numa_pool_demo --threads-per-node=999
ERROR: too many threads for available CPUs
EXIT: 1

Exit Codes:

0 success
1 invalid arguments
2 initialization failure

4. Solution Architecture

4.1 High-Level Design

+--------------+   +---------------+   +------------------+
| Topology     |-->| Worker Pools  |-->| Stats Reporter   |
+--------------+   +---------------+   +------------------+
        ^                 ^
        |                 |
+--------------+     +---------------+
| Task Router  |---->| Per-Node Queues|
+--------------+     +---------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Task { std::function<void()> fn; int preferred_node; };
struct Queue { std::deque<Task> tasks; std::mutex lock; };

4.4 Algorithm Overview

Key Algorithm: Task Execution

Worker pops from local queue.
If empty, steal from nearby nodes.
Execute task and record locality metrics.
Repeat until shutdown.

Complexity Analysis:

Time: O(1) average per task
Space: O(total tasks)

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential cmake libnuma-dev

5.2 Project Structure

numa_pool/
|-- src/
|   |-- topology.cpp
|   |-- pool.cpp
|   |-- queue.cpp
|   +-- stats.cpp
|-- include/
|   +-- numa_pool.h
|-- tests/
|   +-- deterministic_workload.cpp
|-- CMakeLists.txt
+-- README.md

5.3 The Core Question You’re Answering

“How do I keep threads and tasks on the same NUMA node?”

5.4 Concepts You Must Understand First

CPU affinity and topology.
Work stealing algorithms.
Locality measurement.

5.5 Questions to Guide Your Design

How will you order stealing by distance?
How will you define a task’s preferred node?
How will you measure remote execution?

5.6 Thinking Exercise

If a task accesses data on node 1 but is executed on node 0, is it better to move the task or move the data? Why?

5.7 The Interview Questions They’ll Ask

“How does a NUMA-aware thread pool differ from a normal one?”
“What metrics show locality success?”
“When should you allow cross-node stealing?”

5.8 Hints in Layers

Hint 1: Start with per-node queues.

Hint 2: Add affinity pinning.

Hint 3: Add distance-aware stealing.

Hint 4: Add metrics for local vs remote.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Concurrency | “C++ Concurrency in Action” | Ch. 2-4 | | Parallel algorithms | “The Art of Multiprocessor Programming” | Ch. 7 |

5.10 Implementation Phases

Phase 1: Topology + Pinning (1 week)

Goals: discover topology and pin threads.

Tasks:

Map CPUs to nodes.
Pin threads to CPUs.

Checkpoint: workers stay on assigned CPUs.

Phase 2: Queues + Stealing (1 week)

Goals: implement per-node queues and work stealing.

Tasks:

Add per-node queues.
Implement stealing policy.

Checkpoint: idle workers can steal tasks.

Phase 3: Metrics + Benchmarks (1 week)

Goals: measure locality and performance.

Tasks:

Track local vs remote tasks.
Provide deterministic benchmark workloads.

Checkpoint: local tasks >95% under node-local workload.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Locality >95% for node-local workload.
Skewed workload triggers remote steals.
Invalid thread count exits with code 1.

6.3 Test Data

seed=42, tasks=1,000,000, skew=0.8
expected_local_pct >= 85%

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Log task routing decisions.
Track queue lengths per node.

7.3 Performance Traps

Excessive cross-node stealing can erase NUMA benefits.

8. Extensions & Challenges

8.1 Beginner Extensions

Add per-task latency metrics.
Add CSV export of stats.

8.2 Intermediate Extensions

Add topology-aware task batching.
Implement adaptive steal thresholds.

8.3 Advanced Extensions

Integrate with NUMA-aware allocator to co-locate task data.

9. Real-World Connections

9.1 Industry Applications

Thread pools in database engines.
Runtime schedulers for parallel workloads.

tbb – work-stealing scheduler.
folly – executor framework.

9.3 Interview Relevance

Discussing task scheduling under NUMA constraints.

10. Resources

10.1 Essential Reading

“C++ Concurrency in Action” – Ch. 2-4
“The Art of Multiprocessor Programming” – Ch. 7

10.2 Video Resources

Work-stealing schedulers talks.

10.3 Tools & Documentation

libnuma – topology and placement APIs.

11. Self-Assessment Checklist

11.1 Understanding

I can explain why affinity matters.
I can explain work stealing and locality metrics.

11.2 Implementation

Pool runs tasks correctly with locality stats.
Deterministic benchmarks are reproducible.

11.3 Growth

I can explain this design in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

Per-node queues and pinned workers.
Locality stats reported.

Full Completion:

Distance-aware stealing with deterministic benchmarks.

Excellence (Going Above & Beyond):

Adaptive scheduling with workload-aware data placement.