Project 6: NUMA-Aware Data Structure Library

Build a library of NUMA-aware data structures (sharded map, per-node queue, replicated read cache) that maximize local access.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	3-4 weeks
Main Programming Language	C++ (Alternatives: C, Rust)
Alternative Programming Languages	C, Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	3. The “Service & Support” Model
Prerequisites	Concurrency, memory allocation, NUMA basics
Key Topics	sharding, replication, locality, synchronization

1. Learning Objectives

By completing this project, you will:

Design data structures that keep most operations on local memory.
Implement sharded and replicated structures with explicit placement.
Choose key-to-node mapping strategies that avoid hot spots.
Measure locality (local vs remote access ratios).
Handle cross-node operations without destroying scalability.
Evaluate trade-offs between sharding and replication.
Provide deterministic benchmarks for comparison.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Sharding and Locality-Oriented Design

Fundamentals

Sharding splits a data structure into multiple independent partitions (shards), each owned by a specific NUMA node. Operations on a shard are handled by threads running on that node, keeping memory access local and reducing contention. Sharding is the primary strategy for NUMA-aware design because it aligns data ownership with physical memory placement. The core challenge is mapping keys or tasks to shards in a way that balances load and minimizes cross-node access.

Deep Dive into the Concept

In a shared-memory multiprocessor, the naive approach is to store all data in a single structure protected by a global lock. This ensures correctness but destroys scalability: every thread contends on the same lock and touches the same cache lines. On a NUMA system, it also causes remote memory access because the shared structure resides on a single node. Sharding solves this by partitioning the data. Each shard can be stored in memory allocated on a specific node, and threads pinned to that node operate on it. The lock scope is reduced, and memory access becomes mostly local.

The key question is how to assign work to shards. For key-value structures, the mapping function might be a hash of the key modulo the number of nodes. This works if the key distribution is uniform. For skewed workloads, you need strategies like consistent hashing or dynamic rebalancing. For queues, you might shard by producer or by consumer. For a scheduler, you might shard by task type or affinity. Each choice affects locality and load balance.

Sharding is not free. Cross-shard operations (like range queries or multi-key updates) require coordination and can introduce latency. Some data structures require global ordering (e.g., a priority queue), which is hard to shard without losing semantics. You can mitigate this by introducing a two-level design: local shards for fast operations and a global aggregator for rare cross-node operations. The aggregator can also be replicated to reduce access costs.

NUMA-aware sharding also involves physical placement. Each shard’s memory should be allocated on the node that owns it, using a NUMA-aware allocator (Project 5). Threads should be pinned to that node. Without pinning, the shard might still be accessed remotely, undermining the design. You also need to track locality metrics. A common metric is the percentage of operations that are local vs remote. This can be measured by sampling thread node IDs and checking whether the target shard is local.

Finally, sharding interacts with cache coherence. By reducing shared writes, sharding reduces cache-line bouncing and coherence traffic. This is one of the main reasons it scales. The trade-off is complexity: you must design routing, handle cross-shard operations, and possibly rebalance shards. This project teaches you to make those trade-offs explicitly.

How this fits on projects

Sharding is the core design pattern for your NUMA-aware data structures. You will implement a sharded map and a per-node queue using this approach.

Definitions & Key Terms

Shard -> A partition of a data structure owned by a specific node.
Key-to-node mapping -> Function that assigns keys to shards.
Locality -> Accessing data on the same NUMA node.
Skew -> Uneven distribution of keys or workload.

Mental Model Diagram (ASCII)

Node 0: Shard A (keys hash to 0)
Node 1: Shard B (keys hash to 1)

Threads pinned to Node 0 -> Shard A
Threads pinned to Node 1 -> Shard B

How It Works (Step-by-Step)

Choose a shard count (often equal to node count).
Allocate each shard’s memory on its node.
Pin threads to nodes and route requests by shard.
Perform operations locally within a shard.
For cross-shard operations, use a global coordinator.

Invariants: Each shard’s memory is local to its owning node.

Failure modes: skewed key distribution, thread migration, mis-sized shards.

Minimal Concrete Example

size_t shard_for_key(uint64_t key, size_t nodes) {
    return (key * 11400714819323198485llu) % nodes;
}

Common Misconceptions

“Sharding always scales” -> skew can create hot shards.
“Sharding eliminates all locks” -> each shard still needs synchronization.
“NUMA locality is automatic” -> you must pin threads and allocate locally.

Check-Your-Understanding Questions

Why does sharding improve locality on NUMA systems?
How can skewed keys hurt a sharded design?
When might you need a global coordinator?

Check-Your-Understanding Answers

Because each shard is stored locally and accessed by local threads.
A hot shard becomes a bottleneck while others are underutilized.
For operations that span multiple shards (range queries, global ordering).

Real-World Applications

Sharded caches in distributed systems.
Per-core or per-socket counters in telemetry.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.2, Sec. 5.10.
Also used in: P09-numa-aware-thread-pool, P10-numa-aware-database-buffer-pool.

References

“The Art of Multiprocessor Programming” – Ch. 7
“Computer Systems: A Programmer’s Perspective” – Ch. 12

Key Insights

Sharding turns global contention into local work, which is exactly what NUMA rewards.

Summary

Sharding is the primary design pattern for NUMA-aware data structures. By partitioning data and aligning each shard with a node, you improve locality and scalability, at the cost of more complex routing and cross-shard coordination.

Homework/Exercises to Practice the Concept

Implement a simple sharded counter and measure throughput.
Test uniform vs skewed key distributions.
Add a global coordinator for a cross-shard query.

Solutions to the Homework/Exercises

Throughput should increase compared to a single global counter.
Skewed keys reduce throughput due to hot shard contention.
Coordinator adds overhead but enables cross-shard queries.

2.2 Replication, Consistency, and Synchronization

Fundamentals

Replication duplicates data across nodes to improve read locality. Each node keeps a local copy, so reads are fast, but writes must update multiple replicas. The consistency model determines how and when updates propagate. Strong consistency ensures all replicas are identical but requires synchronization and communication; eventual consistency allows temporary divergence but improves performance. In NUMA-aware data structures, replication is often used for read-heavy workloads, while writes are either centralized or coordinated with lightweight protocols.

Deep Dive into the Concept

Replication is a classic trade-off between read latency and write complexity. In a NUMA system, reads from remote memory are slower than local reads. Replicating read-mostly data structures (such as read-only caches or configuration data) allows each node to serve reads locally. Writes, however, must update all replicas. This can be done with a global lock (strong consistency), with versioning and epoch-based updates (read-copy-update), or with a single writer thread that propagates updates asynchronously.

For this project, you can implement a replicated read cache where each node has a local map. Writes are funneled through a global update queue. Each node periodically applies updates to its local copy. This provides eventual consistency but keeps read latency low. For stronger consistency, you can use a global lock or sequence number: writers increment a global version and update each replica in order, and readers can optionally check the version if they need stronger guarantees.

Synchronization is another core concept. Each shard or replica must protect its internal state. For per-node structures, a simple mutex may suffice. For read-heavy structures, reader-writer locks or RCU-style techniques reduce contention. For queues, you might use lock-free algorithms, but correctness becomes harder. The key is to align synchronization with locality: each node should operate mostly on its local locks and data, and cross-node synchronization should be minimized.

Replication also interacts with cache coherence. Having replicas reduces cross-node sharing because reads hit local memory, but the update process itself can create bursts of coherence traffic. This is why batched updates are common. You can design the library so that updates are applied in batches, reducing overhead while keeping replicas reasonably fresh.

Finally, consistency must be explicit. The library’s API should document whether reads are strongly consistent or eventually consistent. Users must understand the trade-off. In your project, you will include a clear configuration flag for consistency mode and report the latency trade-offs in your benchmark output.

How this fits on projects

This concept guides the design of replicated structures in your library and determines how you coordinate updates across nodes.

Definitions & Key Terms

Replication -> Keeping multiple copies of data across nodes.
Consistency model -> Rules for how replicas are synchronized.
Eventual consistency -> Replicas converge over time but may diverge temporarily.
RCU (Read-Copy-Update) -> Synchronization technique for read-heavy workloads.
Epoch -> Versioning mechanism for updates.

Mental Model Diagram (ASCII)

Node 0 Replica  <-- updates -- Global Writer -->  Node 1 Replica
      | local reads                                   | local reads

How It Works (Step-by-Step)

Each node maintains its own replica.
Writes go to a global update queue.
Replicas periodically apply updates in batches.
Reads are served locally without cross-node access.

Invariants: Replicas converge to the same state eventually.

Failure modes: update lag, stale reads, excessive cross-node synchronization.

Minimal Concrete Example

// Pseudocode: apply updates in batches
while (updates_available()) {
    auto batch = dequeue_updates();
    apply_batch(local_replica, batch);
}

Common Misconceptions

“Replication is always faster” -> Writes become more expensive.
“Eventual consistency is always acceptable” -> some workloads require strong consistency.
“Replication eliminates locks” -> replicas still need synchronization.

Check-Your-Understanding Questions

Why is replication beneficial for read-heavy workloads?
What trade-off does eventual consistency introduce?
How can batching reduce update overhead?

Check-Your-Understanding Answers

Reads are served locally without remote access.
Reads may see stale data temporarily.
It amortizes synchronization cost over multiple updates.

Real-World Applications

Replicated metadata in distributed databases.
Read-only caches in analytics systems.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.2, Sec. 5.10.
Also used in: P10-numa-aware-database-buffer-pool.

References

“The Art of Multiprocessor Programming” – Ch. 5-6
“Operating Systems: Three Easy Pieces” – Ch. 30 (concurrency)

Key Insights

Replication trades write complexity for local reads, which NUMA systems reward.

Summary

Replication can dramatically reduce read latency by keeping local copies on each node. The cost is more complex synchronization and a clear choice of consistency model. Your library should make this trade-off explicit and measurable.

Homework/Exercises to Practice the Concept

Implement a replicated counter with batched updates.
Measure read latency before and after replication.
Compare strong vs eventual consistency behavior.

Solutions to the Homework/Exercises

Batched updates reduce overhead and improve throughput.
Read latency drops significantly when reads are local.
Strong consistency increases write latency; eventual consistency risks stale reads.

3. Project Specification

3.1 What You Will Build

A library numa_ds containing at least:

A sharded hash map.
A per-node queue.
A replicated read cache with configurable consistency.

Included: NUMA-aware placement, sharding logic, benchmarking harness. Excluded: full-blown distributed system features or persistence.

3.2 Functional Requirements

Implement sharded map with key-to-node routing.
Implement per-node queue with local producers/consumers.
Implement replicated read cache with update propagation.
Support explicit node placement using libnuma.
Provide locality statistics for operations.
Include a benchmark suite with deterministic workload seeds.

3.3 Non-Functional Requirements

Performance: Local operations outperform global structures.
Reliability: Correctness under multi-threaded access.
Usability: Clear API and example usage.

3.4 Example Usage / Output

$ ./ds_bench --structure=sharded_map --threads=16
Throughput: 4.2 M ops/sec
Remote reads: 3.1%

3.5 Data Formats / Schemas / Protocols

Benchmark JSON output:

{
  "structure": "sharded_map",
  "threads": 16,
  "throughput_mops": 4.2,
  "remote_reads_pct": 3.1
}

3.6 Edge Cases

Skewed key distribution causes hot shard.
Thread migration breaks locality assumptions.
Replica update lag leads to stale reads.

3.7 Real World Outcome

You can demonstrate clear locality improvements: sharded and replicated structures show higher throughput and lower remote access compared to global shared structures.

3.7.1 How to Run (Copy/Paste)

cmake -S . -B build && cmake --build build
./build/ds_bench --structure=sharded_map --threads=16 --seed=42

3.7.2 Golden Path Demo (Deterministic)

$ ./ds_bench --structure=sharded_map --threads=16 --seed=42
Throughput: 4.2 M ops/sec
Remote reads: 3.1%

3.7.3 If CLI: Exact Terminal Transcript

$ ./ds_bench --structure=global_map --threads=16 --seed=42
Throughput: 1.7 M ops/sec
Remote reads: 41.6%

3.7.4 Failure Demo (Deterministic)

$ ./ds_bench --structure=unknown
ERROR: unknown structure type
EXIT: 1

Exit Codes:

0 success
1 invalid arguments
2 benchmark failure

4. Solution Architecture

4.1 High-Level Design

+--------------+   +--------------+   +------------------+
| API Layer    |-->| Shards/Replicas|-->| Benchmark Suite |
+--------------+   +--------------+   +------------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

struct Shard {
    std::mutex lock;
    std::unordered_map<Key, Value> map;
};

4.4 Algorithm Overview

Key Algorithm: Sharded Map Get/Put

Hash key to shard.
Acquire shard lock.
Perform operation locally.
Release lock.

Complexity Analysis:

Time: O(1) average per operation
Space: O(N) keys

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential cmake libnuma-dev

5.2 Project Structure

numa_ds/
|-- src/
|   |-- sharded_map.cpp
|   |-- per_node_queue.cpp
|   |-- replica_cache.cpp
|   +-- bench.cpp
|-- include/
|   +-- numa_ds.h
|-- tests/
|   +-- deterministic_bench.cpp
|-- CMakeLists.txt
+-- README.md

5.3 The Core Question You’re Answering

“How do I design data structures so that threads mostly touch local memory?”

5.4 Concepts You Must Understand First

Sharding and key routing.
Replication and consistency.
NUMA placement and thread pinning.

5.5 Questions to Guide Your Design

How will you map keys to shards without hot spots?
When should you replicate vs shard?
How will you measure remote access percentage?

5.6 Thinking Exercise

If 90% of operations are reads and 10% are writes, which strategy is better: sharding or replication? Explain.

5.7 The Interview Questions They’ll Ask

“What is the difference between sharding and replication?”
“How does NUMA affect lock contention?”
“What metrics show locality success?”

5.8 Hints in Layers

Hint 1: Start with a simple sharded map.

Hint 2: Add per-node queues.

Hint 3: Add a replicated cache with batched updates.

Hint 4: Benchmark global vs NUMA-aware designs.

5.9 Books That Will Help

| Topic | Book | Chapter | |—|—|—| | Concurrency | “The Art of Multiprocessor Programming” | Ch. 3-5 | | System performance | “Computer Systems: A Programmer’s Perspective” | Ch. 12 |

5.10 Implementation Phases

Phase 1: Sharded Map (1 week)

Goals: implement key routing and shard-local map.

Tasks:

Implement shard selection.
Allocate shards on specific nodes.

Checkpoint: local access >90% under node-local workload.

Phase 2: Per-Node Queue (1 week)

Goals: implement node-local queue.

Tasks:

Add enqueue/dequeue operations.
Pin threads to node-local queue.

Checkpoint: throughput improves vs global queue.

Phase 3: Replicated Cache (1 week)

Goals: implement replica cache with update propagation.

Tasks:

Add update queue and batch apply.
Expose consistency modes.

Checkpoint: read latency drops with replication.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Sharded map throughput > global map throughput.
Replicated cache reads are local >95% of the time.
Skewed keys produce detectable imbalance warnings.

6.3 Test Data

keys=1,000,000, skew=zipf(0.9), seed=42

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Track per-shard hit rates.
Log remote access percentages per node.

7.3 Performance Traps

Over-synchronization across nodes destroys locality gains.

8. Extensions & Challenges

8.1 Beginner Extensions

Add metrics for per-shard queue length.
Add read-only replica mode.

8.2 Intermediate Extensions

Implement consistent hashing.
Add adaptive rebalancing.

8.3 Advanced Extensions

Add lock-free shard queues.
Integrate with NUMA-aware allocator from Project 5.

9. Real-World Connections

9.1 Industry Applications

Sharded caches in large-scale services.
NUMA-aware in-memory databases.

folly – NUMA-aware data structures.
tbb – task scheduling and concurrent containers.

9.3 Interview Relevance

Designing for locality is a core systems interview topic.

10. Resources

10.1 Essential Reading

“The Art of Multiprocessor Programming” – Ch. 3-5
“Computer Systems: A Programmer’s Perspective” – Ch. 12

10.2 Video Resources

Talks on NUMA-aware data structures.

10.3 Tools & Documentation

libnuma – placement APIs.

11. Self-Assessment Checklist

11.1 Understanding

I can explain sharding vs replication.
I can explain how locality metrics are measured.

11.2 Implementation

Sharded and replicated structures function correctly.
Benchmarks show locality improvements.

11.3 Growth

I can explain these trade-offs in interviews.

12. Submission / Completion Criteria

Minimum Viable Completion:

Sharded map and per-node queue implemented.
Basic benchmark shows improvement.

Full Completion:

Replicated cache implemented with consistency mode.
Local access >90% under node-local workload.

Excellence (Going Above & Beyond):

Adaptive rebalancing and lock-free extensions.