Project 9: NUMA-Aware Thread Pool
Build a thread pool with per-node queues and locality-aware work stealing to keep tasks close to their data.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 2-3 weeks |
| Main Programming Language | C++ (Alternatives: C, Rust) |
| Alternative Programming Languages | C, Rust |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 3. The “Service & Support” Model |
| Prerequisites | Threading, queues, CPU affinity |
| Key Topics | thread pinning, topology, work stealing, locality metrics |
1. Learning Objectives
By completing this project, you will:
- Discover and represent NUMA topology for scheduling decisions.
- Pin worker threads to specific CPUs and nodes.
- Implement per-node work queues.
- Design a locality-aware work-stealing policy.
- Track local vs remote task execution rates.
- Provide deterministic benchmark workloads for testing.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Topology-Aware Thread Placement and Affinity
Fundamentals
CPU affinity binds a thread to specific CPUs so the OS does not migrate it. On NUMA systems, pinning threads to CPUs within a node keeps their memory accesses local. Without affinity, threads may migrate, causing cache coldness and remote memory access. A NUMA-aware thread pool must understand the machine’s topology, map CPUs to nodes, and pin each worker accordingly. This ensures that per-node queues and data structures are accessed locally.
Deep Dive into the Concept
Linux scheduling is optimized for fairness, not locality. By default, threads can migrate between CPUs, which can improve load balance but harms cache locality and NUMA placement. On NUMA systems, a migrating thread may access memory allocated on the original node, causing remote memory access and higher latency. This can negate the benefits of a NUMA-aware design.
Affinity APIs (such as pthread_setaffinity_np or sched_setaffinity) allow you to restrict a thread to a CPU set. For NUMA-aware scheduling, you typically map workers to nodes: each worker is pinned to a CPU on its node. This aligns with memory placement and reduces cross-node traffic. The mapping can be static (fixed CPU list per node) or dynamic (choose least loaded CPU). Static mapping is simpler and deterministic, which is useful for reproducible benchmarks.
Topology discovery is the first step. You can use libnuma or sysfs to map CPUs to nodes. The distance matrix can help order nodes for stealing: workers should prefer stealing from nearby nodes to reduce latency. You also need to consider hyper-threading: two logical CPUs may share the same physical core. For performance measurements, pinning to physical cores is preferable. Your thread pool should optionally support a mode that uses only one thread per physical core.
Affinity also interacts with cpusets and containers. If the process is restricted to a subset of CPUs, your pool must adapt. It should detect the available CPUs and nodes and distribute workers accordingly. If only one node is visible, the pool should gracefully fall back to a UMA-style global queue.
A key design decision is whether to allow dynamic migration for load balancing. Allowing migration can improve throughput but may reduce locality. For this project, you will keep workers pinned and rely on work stealing to balance load without migration.
How this fits on projects
This concept defines how you place worker threads and why per-node queues are meaningful. Without affinity, your locality statistics are meaningless.
Definitions & Key Terms
- Affinity -> Binding threads to specific CPUs.
- Topology -> Mapping of CPUs to NUMA nodes.
- Hyper-threading -> Multiple logical CPUs sharing a physical core.
- Migration -> OS moving a thread between CPUs.
Mental Model Diagram (ASCII)
Node 0: CPU0 CPU1 CPU2 CPU3 -> Workers 0-3
Node 1: CPU8 CPU9 CPU10 CPU11 -> Workers 4-7
Pinned workers stay within their node
How It Works (Step-by-Step)
- Discover CPU-to-node mapping.
- Assign each worker to a CPU in its node.
- Pin the worker thread to that CPU.
- Create a per-node work queue.
- Track local vs remote executions.
Invariants: Workers remain pinned; each queue is local to a node.
Failure modes: cpuset restrictions, invalid CPU IDs, hyper-threading mismatch.
Minimal Concrete Example
cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu_id, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
Common Misconceptions
- “Affinity is optional” -> Without it, locality assumptions break.
- “Any CPU in a node is equivalent” -> hyper-threading can skew results.
- “Pinning reduces throughput” -> it often improves cache locality.
Check-Your-Understanding Questions
- Why does thread migration hurt NUMA locality?
- How does hyper-threading affect thread placement?
- Why should a pool detect cpuset restrictions?
Check-Your-Understanding Answers
- Threads may access memory on remote nodes, increasing latency.
- Logical CPUs share core resources, which can reduce throughput.
- Because unavailable CPUs cannot be used for pinning or work distribution.
Real-World Applications
- NUMA-aware web servers and database worker pools.
- High-performance computing runtimes.
Where You’ll Apply It
- In this project: Sec. 3.1, Sec. 4.2, Sec. 5.10.
- Also used in: P01-numa-topology-explorer.
References
- “C++ Concurrency in Action” (Williams) – Ch. 2-4
- “Computer Architecture” – Ch. 2
Key Insights
Affinity is the foundation of NUMA-aware scheduling; without it, locality is accidental.
Summary
Topology-aware thread placement ensures that workers stay near their data. Affinity APIs allow you to pin threads, reduce migration, and make per-node queues meaningful.
Homework/Exercises to Practice the Concept
- Pin a thread to CPU 0 and verify with
taskset -cp. - Compare performance with and without affinity in a simple loop.
- Map CPUs to nodes using sysfs and print the mapping.
Solutions to the Homework/Exercises
- The CPU list should show only CPU 0.
- With affinity, variance is lower and throughput more stable.
/sys/devices/system/node/node*/cpulistprovides the mapping.
2.2 Work Stealing and Locality Metrics
Fundamentals
Work stealing balances load by allowing idle workers to take tasks from other queues. In a NUMA-aware pool, stealing should respect topology: workers should prefer stealing from the local node, then from nearby nodes, and only last from far nodes. This reduces remote memory access while still balancing load. Locality metrics (local vs remote task execution percentage) quantify how well the pool preserves locality.
Deep Dive into the Concept
Work stealing is a common technique in parallel runtimes. Each worker has a local queue; when it runs out of work, it steals from another queue. This reduces contention compared to a global queue because most operations are local. In NUMA systems, however, the choice of victim queue matters: stealing from a remote node increases latency and may cause the worker to touch data allocated on that node.
A locality-aware stealing policy uses the distance matrix to order potential victims. For example, a worker on node 0 might first steal from other queues on node 0, then from node 1, then from node 2. The goal is to balance load without excessive remote access. You can encode the policy as an ordered list of nodes per worker. This policy can be static (computed once at startup) or dynamic (adjusted based on runtime metrics).
Locality metrics are essential to validate the design. You can track the node of each task (if tasks carry a preferred node) and compare it to the executing worker’s node. The ratio of local executions to total executions is a key metric. You should also track the number of steals per node distance, and the distribution of queue lengths. These metrics reveal whether the pool is respecting locality or constantly stealing from far nodes.
Another important consideration is data affinity. Tasks often operate on data allocated on a specific node. Your API can allow tasks to specify a preferred node, and the pool can enqueue the task on that node’s queue. If the task is stolen, the pool can still track it as a remote execution. This provides feedback to the user and helps tune load balancing thresholds.
Finally, the stealing algorithm must avoid contention. If all workers try to steal from the same queue, you get a new hot spot. A common approach is randomized stealing within a preferred node set. Your design should balance determinism (for tests) with randomness (for real workloads). You can provide a fixed seed for deterministic test runs and a random seed for production.
How this fits on projects
This concept defines the pool’s balancing strategy and how you measure its effectiveness. It connects the data structure design (per-node queues) with real workload behavior.
Definitions & Key Terms
- Work stealing -> Idle workers take tasks from others’ queues.
- Locality metric -> Percentage of tasks executed on their preferred node.
- Victim selection -> Strategy for choosing which queue to steal from.
- Distance-aware stealing -> Ordering steals by NUMA distance.
Mental Model Diagram (ASCII)
Worker on Node 0
1) Steal from Node 0 queues
2) Steal from Node 1 queues
3) Steal from Node 2 queues
Distance-aware ordering reduces remote cost
How It Works (Step-by-Step)
- Worker checks local queue.
- If empty, pick victim queue based on distance order.
- Attempt steal; if failed, try next node.
- Execute task and record locality metrics.
Invariants: Local queues preferred; remote steals are last resort.
Failure modes: Excessive stealing, load imbalance, hot victim queue.
Minimal Concrete Example
for (int node : steal_order[worker_node]) {
if (try_steal_from(node)) break;
}
Common Misconceptions
- “Stealing always hurts locality” -> it can preserve throughput when local queues are empty.
- “Random stealing is enough” -> without distance awareness, remote access spikes.
- “Locality metrics are optional” -> they are required to validate design.
Check-Your-Understanding Questions
- Why should stealing be ordered by NUMA distance?
- How do you measure locality success?
- What happens if all workers steal from the same queue?
Check-Your-Understanding Answers
- To reduce remote access latency and interconnect traffic.
- By tracking the percentage of tasks executed on their preferred node.
- Contention increases and throughput drops.
Real-World Applications
- Task schedulers in databases and analytics engines.
- Runtime systems for parallel programming languages.
Where You’ll Apply It
- In this project: Sec. 4.2, Sec. 5.10, Sec. 7.1.
- Also used in: P06-numa-aware-data-structure-library.
References
- “The Art of Multiprocessor Programming” – Ch. 7
- “C++ Concurrency in Action” – Ch. 7
Key Insights
Work stealing preserves throughput; distance-aware stealing preserves locality.
Summary
A NUMA-aware thread pool must balance load without destroying locality. Distance-aware work stealing and explicit locality metrics provide a measurable way to achieve this balance.
Homework/Exercises to Practice the Concept
- Implement random stealing and compare locality metrics.
- Implement distance-aware stealing and compare results.
- Add a threshold where remote stealing is allowed only if local queues are empty for N attempts.
Solutions to the Homework/Exercises
- Random stealing yields higher remote task rates.
- Distance-aware stealing improves locality without reducing throughput much.
- The threshold reduces remote steals while keeping workers busy.
3. Project Specification
3.1 What You Will Build
A NUMA-aware thread pool with per-node queues, pinned workers, and locality-aware work stealing. Tasks can specify a preferred node, and the pool reports locality statistics.
Included: topology discovery, thread pinning, per-node queues, work stealing, stats. Excluded: distributed scheduling or preemption.
3.2 Functional Requirements
- Discover NUMA topology and map CPUs to nodes.
- Spawn and pin workers per node.
- Maintain per-node task queues.
- Allow tasks to specify preferred node.
- Implement distance-aware work stealing.
- Report local vs remote execution statistics.
- Provide deterministic benchmark workloads.
3.3 Non-Functional Requirements
- Performance: locality-aware pool outperforms global queue.
- Reliability: no lost tasks, correct shutdown.
- Usability: clear API and stats output.
3.4 Example Usage / Output
NUMAThreadPool pool(4); // 4 threads per node
pool.submit(task, preferred_node);
pool.stats();
// Node 0: 98.2% local tasks
// Node 1: 97.5% local tasks
// Cross-node steals: 0.6%
3.5 Data Formats / Schemas / Protocols
Stats output (JSON):
{
"node0": {"local_pct": 98.2, "steals": 5},
"node1": {"local_pct": 97.5, "steals": 7},
"cross_node_steals": 12
}
3.6 Edge Cases
- Only one NUMA node visible.
- Uneven queue distribution.
- Tasks without preferred node.
3.7 Real World Outcome
You can demonstrate a thread pool that keeps most tasks local while still balancing load through controlled stealing.
3.7.1 How to Run (Copy/Paste)
cmake -S . -B build && cmake --build build
./build/numa_pool_demo --threads-per-node=4 --seed=42
3.7.2 Golden Path Demo (Deterministic)
$ ./numa_pool_demo --threads-per-node=4 --seed=42
Node 0: 98.2% local tasks
Node 1: 97.5% local tasks
Cross-node steals: 0.6%
3.7.3 If CLI: Exact Terminal Transcript
$ ./numa_pool_demo --threads-per-node=4 --skew=0.8
Node 0: 85.1% local tasks
Node 1: 83.4% local tasks
Cross-node steals: 12.7%
3.7.4 Failure Demo (Deterministic)
$ ./numa_pool_demo --threads-per-node=999
ERROR: too many threads for available CPUs
EXIT: 1
Exit Codes:
0success1invalid arguments2initialization failure
4. Solution Architecture
4.1 High-Level Design
+--------------+ +---------------+ +------------------+
| Topology |-->| Worker Pools |-->| Stats Reporter |
+--------------+ +---------------+ +------------------+
^ ^
| |
+--------------+ +---------------+
| Task Router |---->| Per-Node Queues|
+--------------+ +---------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |—|—|—| | Topology Manager | CPU-to-node mapping | libnuma vs sysfs | | Worker Pool | Thread creation and pinning | per-node thread count | | Task Router | Assign tasks to queues | preferred node hint | | Stealing Engine | Balance load | distance-aware policy | | Stats | Local vs remote metrics | per-node counters |
4.3 Data Structures (No Full Code)
struct Task { std::function<void()> fn; int preferred_node; };
struct Queue { std::deque<Task> tasks; std::mutex lock; };
4.4 Algorithm Overview
Key Algorithm: Task Execution
- Worker pops from local queue.
- If empty, steal from nearby nodes.
- Execute task and record locality metrics.
- Repeat until shutdown.
Complexity Analysis:
- Time: O(1) average per task
- Space: O(total tasks)
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install -y build-essential cmake libnuma-dev
5.2 Project Structure
numa_pool/
|-- src/
| |-- topology.cpp
| |-- pool.cpp
| |-- queue.cpp
| +-- stats.cpp
|-- include/
| +-- numa_pool.h
|-- tests/
| +-- deterministic_workload.cpp
|-- CMakeLists.txt
+-- README.md
5.3 The Core Question You’re Answering
“How do I keep threads and tasks on the same NUMA node?”
5.4 Concepts You Must Understand First
- CPU affinity and topology.
- Work stealing algorithms.
- Locality measurement.
5.5 Questions to Guide Your Design
- How will you order stealing by distance?
- How will you define a task’s preferred node?
- How will you measure remote execution?
5.6 Thinking Exercise
If a task accesses data on node 1 but is executed on node 0, is it better to move the task or move the data? Why?
5.7 The Interview Questions They’ll Ask
- “How does a NUMA-aware thread pool differ from a normal one?”
- “What metrics show locality success?”
- “When should you allow cross-node stealing?”
5.8 Hints in Layers
Hint 1: Start with per-node queues.
Hint 2: Add affinity pinning.
Hint 3: Add distance-aware stealing.
Hint 4: Add metrics for local vs remote.
5.9 Books That Will Help
| Topic | Book | Chapter | |—|—|—| | Concurrency | “C++ Concurrency in Action” | Ch. 2-4 | | Parallel algorithms | “The Art of Multiprocessor Programming” | Ch. 7 |
5.10 Implementation Phases
Phase 1: Topology + Pinning (1 week)
Goals: discover topology and pin threads.
Tasks:
- Map CPUs to nodes.
- Pin threads to CPUs.
Checkpoint: workers stay on assigned CPUs.
Phase 2: Queues + Stealing (1 week)
Goals: implement per-node queues and work stealing.
Tasks:
- Add per-node queues.
- Implement stealing policy.
Checkpoint: idle workers can steal tasks.
Phase 3: Metrics + Benchmarks (1 week)
Goals: measure locality and performance.
Tasks:
- Track local vs remote tasks.
- Provide deterministic benchmark workloads.
Checkpoint: local tasks >95% under node-local workload.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Steal order | random vs distance-aware | distance-aware | preserves locality | | Queue type | lock vs lock-free | lock | simpler and deterministic | | Task routing | explicit node vs auto | explicit | clearer API |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |—|—|—| | Unit Tests | Queue correctness | enqueue/dequeue order | | Integration Tests | Locality metrics | local vs remote ratio | | Edge Case Tests | Single node | fallback to global queue |
6.2 Critical Test Cases
- Locality >95% for node-local workload.
- Skewed workload triggers remote steals.
- Invalid thread count exits with code 1.
6.3 Test Data
seed=42, tasks=1,000,000, skew=0.8
expected_local_pct >= 85%
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |—|—|—| | No pinning | High variance | Set affinity | | Over-stealing | Locality drops | tighten steal thresholds | | Hot queues | Contention | balance task routing |
7.2 Debugging Strategies
- Log task routing decisions.
- Track queue lengths per node.
7.3 Performance Traps
- Excessive cross-node stealing can erase NUMA benefits.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add per-task latency metrics.
- Add CSV export of stats.
8.2 Intermediate Extensions
- Add topology-aware task batching.
- Implement adaptive steal thresholds.
8.3 Advanced Extensions
- Integrate with NUMA-aware allocator to co-locate task data.
9. Real-World Connections
9.1 Industry Applications
- Thread pools in database engines.
- Runtime schedulers for parallel workloads.
9.2 Related Open Source Projects
- tbb – work-stealing scheduler.
- folly – executor framework.
9.3 Interview Relevance
- Discussing task scheduling under NUMA constraints.
10. Resources
10.1 Essential Reading
- “C++ Concurrency in Action” – Ch. 2-4
- “The Art of Multiprocessor Programming” – Ch. 7
10.2 Video Resources
- Work-stealing schedulers talks.
10.3 Tools & Documentation
- libnuma – topology and placement APIs.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why affinity matters.
- I can explain work stealing and locality metrics.
11.2 Implementation
- Pool runs tasks correctly with locality stats.
- Deterministic benchmarks are reproducible.
11.3 Growth
- I can explain this design in interviews.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Per-node queues and pinned workers.
- Locality stats reported.
Full Completion:
- Distance-aware stealing with deterministic benchmarks.
Excellence (Going Above & Beyond):
- Adaptive scheduling with workload-aware data placement.