Learn NUMA vs UMA Architectures: From Memory Basics to High-Performance Systems
Goal: By the end of this guide you will understand how the memory hierarchy, cache coherence, and NUMA topology interact to produce real-world performance on multi-core systems. You will be able to explain when UMA is sufficient, when NUMA is unavoidable, and how thread placement, memory policies, and allocators change latency and bandwidth. You will build tools that discover topology, measure local vs remote access, detect false sharing, and verify page placement. You will design NUMA-aware allocators, data structures, and schedulers, and you will be able to reason about performance cliffs using evidence from measurements.
Introduction
NUMA (Non-Uniform Memory Access) is a shared-memory architecture where the time to access memory depends on which CPU core and which memory node are involved. UMA (Uniform Memory Access) systems expose one uniform memory latency to all cores. Modern servers and workstations are usually NUMA at the socket or chiplet level, which means performance depends on where data lives and how your threads move across the machine.
What you will build (by the end of this guide):
- A topology explorer that discovers NUMA nodes, CPUs, memory sizes, and distance matrices
- Microbenchmarks that measure local vs remote latency and per-node bandwidth
- A false-sharing detector using hardware performance counters
- NUMA-aware allocators, data structures, and a NUMA-aware thread pool
- A page-migration tool that proves your placement decisions are working
- A capstone NUMA-aware database buffer pool with locality-aware execution
Scope (what is included):
- Linux NUMA topology discovery (sysfs, numactl, libnuma, hwloc)
- Memory policies and page placement (first-touch, bind, interleave, preferred)
- Thread placement, CPU affinity, and NUMA-aware scheduling patterns
- Cache coherence, false sharing, and cache-to-cache traffic analysis
- Measurement and profiling for latency, bandwidth, and locality
Out of scope (for this guide):
- Distributed memory systems (MPI clusters, RDMA fabrics)
- GPU memory architecture and HBM tuning
- Vendor-specific microarchitectural tuning beyond OS-visible NUMA controls
The Big Picture (Mental Model)
Application Threads
|
v
Scheduler + Affinity Memory Policy
| |
v v
CPU Cores ---> Cache Hierarchy --> Page Faults --> Page Allocation
| |
v v
Local Node DRAM Remote Node DRAM
| |
+-------- Interconnect -------+
Key Terms You Will See Everywhere
- NUMA node: A group of CPUs with local access to a subset of memory.
- Local vs remote memory: Memory attached to the same node vs another node.
- First-touch: Pages are allocated on the node that first faults them in.
- Distance matrix: Relative cost of accessing each node from each node.
- False sharing: Independent variables sharing a cache line causing contention.
- HITM: Cache-to-cache transfer of a modified line (expensive coherence traffic).
How to Use This Guide
- Read the Theory Primer first. It is the mini-book that explains the mental model behind every project.
- Start with Project 1 even if you are experienced. You need to see your machine’s topology before anything else.
- Measure before you optimize. The benchmarks are not optional; they teach you what your hardware actually does.
- Use the design questions. They turn a vague implementation into a deliberate architecture.
- Write down your observations. NUMA work is about evidence. Keep a log of numbers and topology outputs.
- Repeat projects with different policies. Bind vs interleave vs preferred will change your results.
Prerequisites & Background Knowledge
Before starting these projects, you should have foundational understanding in these areas:
Essential Prerequisites (Must Have)
Programming Skills:
- Confident C or C++ programming (pointers, structs, memory allocation)
- Comfortable with Linux command-line tools and process inspection
- Basic concurrency (threads, locks, and shared memory)
Systems Fundamentals:
- Memory hierarchy basics (caches vs DRAM)
- Virtual memory and page faults
- CPU scheduling and affinity concepts
- Recommended Reading: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Ch. 5-6, 9, 12
Architecture Basics:
- Cache lines and spatial/temporal locality
- Multi-core execution and shared caches
- Recommended Reading: “Computer Organization and Design RISC-V Edition” by Patterson & Hennessy — Ch. 5-6
Helpful But Not Required
Advanced Performance Tools:
- perf, perf c2c, and hardware counters
- Can learn during: Projects 2, 3, 4, 8
Allocator and Data Structure Internals:
- Arena allocators, per-thread pools
- Can learn during: Projects 5, 6
Database internals or storage engines:
- Buffer pools, page caching
- Can learn during: Project 10
Self-Assessment Questions
Before starting, ask yourself:
- Can I explain the difference between latency and bandwidth in one paragraph?
- Can I write a C program that spawns threads and pins them to CPUs?
- Do I know what a cache line is and why it matters for performance?
- Can I read a
/procor/sysfile and interpret its contents? - Have I used a profiler or benchmark tool on Linux before?
If you answered “no” to questions 1-3: Spend 1-2 weeks reading the recommended chapters before starting.
If you answered “yes” to all 5: You’re ready to begin.
Development Environment Setup
Required Tools:
- A Linux machine (Ubuntu 22.04+ or Debian 12+ recommended)
- GCC or Clang (C11 or later)
numactlandnumastatperf(Linux perf tools)hwloc(lstopo)
Recommended Tools:
cmakeormesonfor buildsgdborlldbfor debuggingtasksetfor quick affinity checks
Testing Your Setup:
# Verify NUMA visibility
$ lscpu | grep -i "NUMA"
NUMA node(s): 2
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
# Verify tools exist
$ which numactl numastat perf lstopo
/usr/bin/numactl
/usr/bin/numastat
/usr/bin/perf
/usr/bin/lstopo
Time Investment
- Simple projects (1-2): Weekend (4-8 hours each)
- Moderate projects (3-4): 1 week each
- Complex projects (5-8): 2-3 weeks each
- Capstone (9-10): 3-6 weeks
- Total sprint: 8-12 weeks if done sequentially
Important Reality Check
NUMA is a performance discipline, not a syntax tutorial. You will often build something that “works” but performs poorly. The learning happens when you measure, explain the numbers, and then change the architecture to fix the results. Expect to iterate.
Big Picture / Mental Model
Workload -> Threads -> Affinity -> Memory Policy -> Pages -> NUMA Nodes
| | | | | |
| | | | | +--> Local DRAM (fast)
| | | | +------------> Remote DRAM (slow)
| | | +------------------------> Interleave / Bind
| | +--------------------------------------> CPU placement
| +-------------------------------------------------> Scheduling
+------------------------------------------------------------> Data layout
The core idea: performance is a product of where threads run and where data is placed. Your projects will measure and control both sides.
Theory Primer (Read This Before Coding)
This section is the mini-book that explains the mental model behind every project. Each chapter is a concept you must understand deeply.
Chapter 1: Memory Hierarchy, Latency, and Bandwidth
Fundamentals
Modern CPUs are orders of magnitude faster than DRAM, so computers are built around a memory hierarchy: tiny, fast caches close to the core and larger, slower memory further away. Latency is the time to fetch one piece of data; bandwidth is the total volume of data you can move per second. A workload that does random pointer chasing is latency-bound, while a streaming workload is often bandwidth-bound. Caches exploit temporal and spatial locality by moving data in cache-line-sized chunks, which means a single load may pull in extra neighboring bytes. The operating system and hardware prefetchers try to guess what you will need next, but that only works when access patterns are regular. Understanding which layer you hit (L1, L2, L3, or DRAM) is essential to explaining performance. This chapter gives you the baseline intuition you will use when comparing UMA and NUMA.
Deep Dive into the Concept
Think of memory access as a pipeline. A load instruction begins in the core, checks the L1 data cache, then L2, then the last-level cache (LLC), and only if all caches miss does it go to the memory controller and DRAM. Each miss adds latency, but the CPU can overlap some of that latency by issuing multiple outstanding memory requests, which is called memory-level parallelism (MLP). If your code issues dependent loads (pointer chasing), MLP collapses to one outstanding request, and latency dominates. If your code streams through large arrays, MLP grows and bandwidth becomes the limiter.
The memory hierarchy is also shaped by cache lines and write policies. Cache lines are fixed-size blocks (commonly 64 bytes) that move between caches and memory as a unit. A store that misses in cache usually triggers a read-for-ownership (write-allocate), which means a write can cause extra read traffic before the write is committed. This behavior matters when you think you are only writing but are actually causing read bandwidth pressure. Similarly, a small write in a cache line can invalidate the line in other cores, creating coherence traffic even if the other core only touched adjacent data.
Latency and bandwidth are different metrics. Latency is a per-access delay, while bandwidth is aggregate throughput. A system can have high bandwidth but still poor single-load latency, and vice versa. This is why benchmarks like STREAM focus on sustained bandwidth, while pointer-chasing microbenchmarks focus on latency. The well-known “latency numbers” give an order-of-magnitude feel (L1 ~0.5 ns, main memory ~100 ns), but the exact numbers on your machine will vary, so you must measure them. Intel’s Memory Latency Checker (MLC) explicitly measures local and cross-socket latencies and bandwidth, showing how the memory hierarchy behaves under load. STREAM is designed to measure sustained memory bandwidth using simple vector kernels where data reuse is intentionally minimal.
Another key factor is the TLB (Translation Lookaside Buffer). The CPU translates virtual addresses to physical addresses using page tables. The TLB caches those translations. If your workload touches huge amounts of memory with poor locality, you can become TLB-bound. Large pages (2 MB or 1 GB huge pages) reduce TLB pressure but can interact with NUMA placement because page migration moves larger chunks. This is why NUMA-aware systems often combine memory policy, page sizes, and workload structure.
Finally, remember that the memory hierarchy is not just a hardware story. The OS influences it through page placement, NUMA balancing, and scheduling decisions. If a thread migrates to another CPU, its cache working set becomes cold and must be rebuilt. If memory is allocated on one node but the thread runs on another, you pay the remote latency cost every time. This is why your projects always pair measurement with placement control.
A subtle factor is write buffering and store forwarding: a store can retire before data is visible to other cores, and the coherence protocol later propagates it. This matters because microbenchmarks that mix reads and writes can observe different latencies than read-only tests. Aligning buffers and warming pages before timing avoids measuring page faults instead of steady-state memory latency.
How This Fits in the Projects
- Project 2 uses pointer-chasing to measure latency differences.
- Project 3 measures bandwidth with STREAM-like kernels.
- Project 4 shows how cache lines and coherence traffic create false sharing.
Definitions & Key Terms
- Latency: Time for a single memory access to complete.
- Bandwidth: Total data moved per unit time.
- Cache line: The unit of data transfer between cache and memory.
- Working set: The active set of memory a program touches repeatedly.
- MLP: Memory-level parallelism; multiple outstanding misses.
- TLB: Cache of virtual-to-physical translations.
Mental Model Diagram
CPU Core
|
v
L1 (very fast) -> L2 -> L3 (shared)
| |
+------> Memory Controller ------> DRAM
How It Works (Step-by-Step)
- The CPU issues a load.
- L1 cache is checked; if hit, latency is minimal.
- On miss, L2 and L3 caches are checked.
- If all caches miss, the request goes to the memory controller.
- DRAM access completes and fills caches with the cache line.
- The CPU resumes once the data arrives.
Minimal Concrete Example
// Pointer-chasing loop to measure latency
volatile uint64_t *p = buffer;
uint64_t sum = 0;
for (size_t i = 0; i < iterations; i++) {
p = (volatile uint64_t *)*p;
sum += (uint64_t)p;
}
Common Misconceptions
- “High bandwidth means low latency.” (Not true; they are different metrics.)
- “A cache miss is always bad.” (Some workloads are bandwidth-bound and tolerate misses.)
- “If I use a fast CPU, memory won’t matter.” (Memory stalls dominate many workloads.)
Check-Your-Understanding Questions
- Why can pointer-chasing loops be slower than streaming loops?
- What is the difference between latency and bandwidth?
- How does cache-line size affect spatial locality?
Check-Your-Understanding Answers
- Pointer chasing creates dependent loads, which prevents MLP and exposes raw latency.
- Latency is per-access delay; bandwidth is throughput of many accesses.
- Larger cache lines pull in more adjacent data, which helps if you use it and hurts if you don’t.
Real-World Applications
- Database buffer pools and in-memory analytics
- HPC stencil codes and linear algebra
- High-frequency trading systems with tight latency budgets
Where You Will Apply It
- Project 2: Memory Latency Microbenchmark
- Project 3: Memory Bandwidth Benchmark
- Project 4: False Sharing Detector
References
- https://gist.github.com/jboner/2841832 (Latency numbers)
- https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html (Intel MLC)
- https://www.cs.virginia.edu/~mccalpin/papers/bandwidth/node2.html (STREAM benchmark description)
Key Insights
Performance is limited by where data is, not just how fast the CPU is.
Summary
You now have the baseline model: caches hide latency when locality is good, and memory bandwidth becomes the limiter when working sets are large. This mental model will be used in every project.
Homework/Exercises to Practice the Concept
- Write a microbenchmark that measures sequential vs random access latency.
- Increase array size until it no longer fits in L3 and observe the latency jump.
- Measure the effect of huge pages on TLB misses using
perf stat.
Solutions to the Homework/Exercises
- Use a pointer-chasing loop for random access and a for-loop for sequential access.
- Double the array size each run and plot cycles per access; the step-up indicates a cache boundary.
- Run with and without
-madvise hugepagesortransparent_hugepageand comparedTLB-load-misses.
Chapter 2: UMA vs NUMA Architecture and Topology
Fundamentals
UMA systems provide uniform access time to all memory, which makes programming simple but limits scalability. NUMA systems split memory into nodes, each local to a set of CPUs. Access to local memory is faster; access to remote memory is slower because it crosses an interconnect. NUMA lets machines scale to more cores by spreading memory controllers across sockets, but the tradeoff is non-uniform latency. This means that two threads accessing the same data can see different performance depending on where they run. Understanding node topology, distance matrices, and how the OS exposes NUMA nodes is the basis for every optimization you will perform.
Deep Dive into the Concept
In UMA, the memory subsystem appears as a single, uniform pool. A shared bus or crossbar connects CPUs to memory controllers, and access latency is roughly the same regardless of which CPU issues the request. This simplicity is attractive, but the bus becomes a bottleneck as core counts rise. NUMA is the answer: each socket (or chiplet group) has its own memory controller and attached DRAM, forming a node. The system still presents a single shared address space, but the physical placement of pages determines how long access takes.
NUMA systems are often cache-coherent (ccNUMA), meaning the hardware maintains a coherent view of memory across all caches. That makes the programming model similar to UMA, but the cost of coherence traffic rises with distance. When one core writes to a cache line that another core on a different node has, the interconnect must carry invalidation or update traffic. This is why NUMA performance depends on locality and why false sharing is more painful on multi-socket machines.
The OS exposes NUMA topology in multiple ways. On Linux, lscpu reports the number of NUMA nodes and which CPUs belong to each node. The numactl --hardware command prints node sizes and a distance matrix. Under the hood, sysfs exposes this data in /sys/devices/system/node/, including node*/cpulist, node*/meminfo, and node*/distance. Libraries like libnuma wrap this into a programmatic API, giving you functions such as numa_max_node, numa_node_to_cpus, and numa_distance. These tools are critical for discovering the topology you will optimize.
NUMA topology is not always as simple as “socket = node”. Some systems split a single socket into multiple nodes for latency reasons. Others expose memory-only nodes (for example, persistent memory tiers). This is why measuring distance and inspecting topology is step one in any NUMA-aware design. The distance matrix is not a physical latency in nanoseconds; it is a relative cost metric that the kernel uses to pick “nearest” nodes for allocations and migrations. The exact values and ratios vary by platform.
When a thread accesses memory, the OS and hardware must decide whether the access is local or remote. If the page is local, the memory controller on that node services the request. If remote, the request is forwarded over the interconnect to the remote node’s memory controller, adding delay and consuming interconnect bandwidth. In multiprocessor systems, this interconnect becomes a shared resource, so remote access can cause contention even if the remote memory is idle.
The practical consequence: you must design data placement and thread placement together. If you schedule threads on node 1 but allocate pages on node 0, you have created a remote memory workload. If you interleave memory across nodes, you balance bandwidth but increase average latency. These tradeoffs are not theoretical; they will be visible in your benchmark results.
The kernel uses the distance matrix as a heuristic, not a physics constant. Two nodes with distance 20 are “closer” than nodes with distance 30, but the actual nanosecond delta depends on load, interconnect traffic, and frequency scaling. Virtualized environments may compress or hide distances entirely, which is why you should always check what the guest exposes rather than assuming bare-metal behavior.
How This Fits in the Projects
- Project 1 discovers the topology and distance matrix.
- Project 2 measures local vs remote latency.
- Project 5 and 6 make placement decisions based on node topology.
Definitions & Key Terms
- UMA: Uniform access time to all memory locations.
- NUMA: Access time depends on memory location relative to CPU.
- NUMA node: A set of CPUs with local access to a subset of memory.
- Distance matrix: Relative cost of node-to-node access.
- ccNUMA: Cache-coherent NUMA, where caches remain coherent across nodes.
Mental Model Diagram
Node 0 (CPUs 0-7) Node 1 (CPUs 8-15)
| Local DRAM | | Local DRAM |
+-----+------+ +-----+------+
| |
+------ Interconnect --------+
(remote access path)
How It Works (Step-by-Step)
- The CPU issues a load to a physical address.
- The hardware determines which node owns the memory page.
- If local, the request goes to the local memory controller.
- If remote, the request travels over the interconnect to the remote node.
- The remote controller services the request and returns data.
Minimal Concrete Example
# Show NUMA topology
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32768 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Common Misconceptions
- “NUMA only matters on 4+ sockets.” (Even 2-socket systems show measurable differences.)
- “Interleaving always improves performance.” (It improves bandwidth but can increase latency.)
- “NUMA is the same as a cluster.” (NUMA is a single shared address space.)
Check-Your-Understanding Questions
- Why does NUMA improve scalability compared to UMA?
- What does a distance matrix represent?
- Why is interleaving not always the best policy?
Check-Your-Understanding Answers
- NUMA spreads memory controllers across sockets, removing a single shared-bus bottleneck.
- It is a relative cost metric of accessing memory from one node to another.
- Interleaving trades latency for bandwidth and can increase average access time.
Real-World Applications
- Multi-socket database servers
- Large in-memory analytics platforms
- Scientific simulations on shared-memory supernodes
Where You Will Apply It
- Project 1: NUMA Topology Explorer
- Project 2: Memory Latency Microbenchmark
- Project 5: NUMA-Aware Memory Allocator
- Project 10: NUMA-Aware Database Buffer Pool
References
- https://man7.org/linux/man-pages/man7/numa.7.html (NUMA overview)
- https://en.wikipedia.org/wiki/Uniform_memory_access (UMA definition)
- https://en.wikipedia.org/wiki/Non-uniform_memory_access (NUMA definition)
- https://man7.org/linux/man-pages/man8/numactl.8.html (numactl hardware and policies)
Key Insights
NUMA scales by distributing memory controllers, but it makes locality a first-order performance concern.
Summary
UMA is simple but does not scale well; NUMA scales but forces you to reason about data placement and thread placement explicitly.
Homework/Exercises to Practice the Concept
- Inspect
numactl --hardwareand write a short summary of your node distances. - Pin a process to node 0 and read memory allocated on node 1; measure the slowdown.
- Compare
numactl --interleave=allvs--membind=0for a bandwidth test.
Solutions to the Homework/Exercises
- Record the distance matrix and identify which nodes are closest.
- Use
numactl --cpunodebind=0 --membind=1and compare latency to the local case. - Interleave should increase aggregate bandwidth but may reduce single-thread latency.
Chapter 3: Cache Coherence, False Sharing, and Contention
Fundamentals
When multiple cores cache the same memory line, the system must keep those caches coherent. Coherence protocols (like MESI/MOESI) track cache line states and ensure writes are visible to other cores. This coherence traffic is cheap inside a socket but becomes expensive across NUMA nodes. False sharing happens when unrelated variables share a cache line: one core writes its variable, invalidating the line in other cores, forcing reloads even though the data is logically unrelated. This creates cache-line ping-pong and can destroy scalability. Understanding coherence states, invalidation, and cache-line granularity is essential for diagnosing performance cliffs. The cost of a single write can include remote invalidations and acknowledgments, which is why the same code can scale on UMA but collapse on NUMA.
Deep Dive into the Concept
Cache coherence exists because each core has private caches. Without coherence, two cores could see different values for the same memory location. MESI (Modified, Exclusive, Shared, Invalid) is a widely used invalidate-based protocol that tracks cache lines across cores. When a core writes a line, other caches must invalidate their copies, and when another core reads the line, a cache-to-cache transfer (or memory read) brings it back. MOESI and MESIF add extra states to optimize shared or owned lines, reducing unnecessary memory write-backs or redundant responses.
In UMA systems, coherence traffic usually stays on a shared bus or ring. In NUMA systems, coherence traffic can cross sockets, which means it travels over a slower interconnect and competes with other traffic. This is why false sharing is more expensive on NUMA: every invalidation may traverse the interconnect, and every reload may fetch remote data. The “HITM” (Hit Modified) event represents a cache-to-cache transfer where another cache holds the modified line. This is a classic signal of false sharing or heavy sharing, and tools like perf c2c are designed to surface it.
False sharing is subtle because it looks like normal sharing in your code. Two threads may update separate counters, but if those counters fall on the same cache line, they contend. Coherence doesn’t know the difference; it invalidates the whole line. The fix is often simple: align or pad the data so each thread writes to its own cache line. Another mitigation is to restructure data so writes are batched, or use per-thread/per-node accumulators with a final reduction step.
Cache coherence protocols can be snooping-based or directory-based. Snooping protocols broadcast coherence messages to all caches, which works for small systems but scales poorly. Directory-based protocols track which caches hold each line, allowing targeted invalidations or updates. ccNUMA systems often rely on directory mechanisms to scale beyond a single socket. The cost of tracking directory state is real (metadata overhead), but it avoids the broadcast storm of snooping.
When you measure performance, coherence shows up as increased latency, reduced bandwidth, and high HITM counts. The perf c2c tool can attribute cacheline contention to specific addresses and show which CPUs and nodes are involved. This makes false sharing visible and gives you a concrete path to fix it. In later projects, you will intentionally create false sharing to observe its cost, then eliminate it by padding and partitioning.
Coherence also interacts with the memory consistency model. Atomic operations and fences often force stronger ordering, which can increase traffic or stall pipelines while ownership of a cache line is negotiated. On NUMA, the cost of acquiring exclusive ownership for a line can include cross-socket invalidations and acknowledgments, which is why highly contended locks scale poorly. A useful mental model is that every write is a distributed coordination event for that cache line.
Even read-mostly workloads can suffer if a single writer repeatedly flips the line between Shared and Modified states. In practice, this means you should design for write isolation, not just read scalability. Cache-line granularity is the unit of coherence, so align data structures to that reality.
How This Fits in the Projects
- Project 4 detects false sharing using
perf c2c. - Project 6 uses padding and partitioning to avoid coherence hotspots.
- Project 8 simulates MESI state transitions to make coherence visible.
Definitions & Key Terms
- Cache coherence: Mechanisms that keep cached copies consistent.
- MESI/MOESI: Common cache coherence protocols and states.
- HITM: Hit Modified; cache-to-cache transfer of a modified line.
- False sharing: Unrelated variables on the same cache line.
- Directory protocol: Tracks which caches hold a line to avoid broadcasts.
Mental Model Diagram
Core A Cache (line X) <----> Core B Cache (line X)
| write invalidates |
+---------- Interconnect --------+
How It Works (Step-by-Step)
- Core A writes to a cache line, moving it to Modified state.
- Coherence invalidates other copies of the line.
- Core B reads the line and triggers a cache-to-cache transfer (HITM).
- The line becomes Shared or Owned, depending on protocol.
- Repeat writes cause ping-pong across cores/nodes.
Minimal Concrete Example
// False sharing: two counters on one cache line
struct Counters {
long a;
long b; // likely same cache line as a
};
// Fix: pad to cache line size (assume 64 bytes)
struct AlignedCounters {
long a;
char pad[56];
long b;
};
Common Misconceptions
- “If variables are different, they can’t contend.” (They can if on same line.)
- “Cache coherence is free.” (It can dominate runtime under contention.)
- “Locks are the only cause of contention.” (False sharing can be worse.)
Check-Your-Understanding Questions
- What is false sharing and why does it happen?
- Why is HITM a red flag in NUMA profiling?
- How does padding reduce coherence traffic?
Check-Your-Understanding Answers
- False sharing occurs when unrelated variables share a cache line and writes invalidate each other.
- HITM means a cache-to-cache transfer of a modified line, often caused by sharing or false sharing.
- Padding isolates variables into separate cache lines, preventing unnecessary invalidations.
Real-World Applications
- High-throughput counters and metrics systems
- Parallel reductions and per-thread accumulators
- Queue and hash table designs in concurrent systems
Where You Will Apply It
- Project 4: False Sharing Detector
- Project 6: NUMA-Aware Data Structure Library
- Project 8: Cache Coherence Simulator
References
- https://en.wikipedia.org/wiki/MESI_protocol (MESI overview)
- https://en.wikipedia.org/wiki/MOESI_protocol (MOESI overview)
- https://en.wikipedia.org/wiki/False_sharing (False sharing definition)
- https://en.wikipedia.org/wiki/Directory-based_cache_coherence (Directory-based coherence)
- https://man7.org/linux/man-pages/man1/perf-c2c.1.html (perf c2c)
Key Insights
Coherence traffic is the hidden tax on parallelism, and false sharing makes it explode.
Summary
Cache coherence makes shared memory possible, but it can sabotage scalability when lines bounce across cores and nodes.
Homework/Exercises to Practice the Concept
- Write a microbenchmark with two counters and observe the performance difference with padding.
- Use
perf c2con a tight loop and find the hottest cache line. - Explain why a single shared counter becomes a scalability bottleneck.
Solutions to the Homework/Exercises
- Align each counter to 64 bytes and compare throughput before and after.
- Run
perf c2c recordandperf c2c reportto identify HITM-heavy lines. - Each increment causes invalidations across cores; per-thread counters avoid this.
Chapter 4: OS Exposure and Topology Discovery
Fundamentals
The operating system exposes NUMA topology so that applications and tools can reason about locality. On Linux, lscpu shows the number of NUMA nodes and CPU mappings, numactl --hardware prints node sizes and distance matrices, and hwloc can visualize the topology as a tree. Under the hood, sysfs exposes node directories in /sys/devices/system/node/, and libnuma provides a C API that reads this information and enforces memory policies. Understanding these interfaces is the foundation for building topology-aware tools and for validating that your memory placement is actually happening. Topology discovery is not just curiosity; it is a prerequisite for any allocator or scheduler decision you will make later in this guide.
Deep Dive into the Concept
Linux exposes NUMA topology through multiple layers. The simplest human-readable view is lscpu, which parses sysfs and /proc/cpuinfo to show sockets, cores, threads, caches, and NUMA nodes. numactl --hardware gives you the node list, per-node memory sizes, and a distance matrix. This is often the first tool you should run on a new system.
Sysfs is the authoritative source of topology. Each node appears as /sys/devices/system/node/nodeN/ and contains files like cpulist, meminfo, and distance. These can be parsed directly, which is what you will do in Project 1. Libnuma provides a friendlier API on top of these, including numa_available(), numa_max_node(), numa_node_to_cpus(), and numa_distance(). It also exposes helpers for allocating memory on specific nodes and for binding threads to nodes. The libnuma documentation notes that a NUMA node is defined as a memory region with uniform access speed from a CPU, and that the library focuses on memory nodes rather than cache levels.
Topology can be more complex than you expect. Some nodes have CPUs but no memory. Some systems expose multiple nodes per socket (for example, when a socket is split into multiple NUMA domains). Memory-only nodes can exist when using memory tiering or persistent memory. This is why you should not assume a fixed mapping such as “node = socket”. Tools like hwloc provide a richer graph that includes caches, NUMA nodes, and I/O devices, which can be useful when you want to place threads near storage or NICs.
Another important layer is the process’s allowed CPUs and nodes. The current cpuset and cgroup configuration can restrict which nodes a process may allocate from. Libnuma reads /proc/self/status to determine Mems_allowed and Cpus_allowed, and it will respect those restrictions. This is critical in containerized environments or HPC schedulers where your job is isolated to a subset of the machine.
By the end of this chapter, you should be able to answer: How many NUMA nodes does this machine have? Which CPUs belong to each node? What is the relative cost of accessing each node? And does my process have access to all nodes or a restricted subset?
Beyond discovery, you must validate placement at runtime. Tools like numastat -p <pid> summarize where a process’s pages reside, while /proc/<pid>/numa_maps gives a detailed per-range view. You can combine these with your own sysfs parsing to confirm that allocations are landing on the intended nodes. This validation step is critical because cpusets, containers, or job schedulers may silently restrict your allowed nodes.
hwloc provides a topology tree that includes cache levels and NUMA nodes, and it can export XML for programmatic use. This is useful when you want to persist topology snapshots or compare machines. For CPU-to-node mapping in code, you can also read /sys/devices/system/cpu/cpu*/node*/ or use numa_node_of_cpu to derive the node directly.
For programmatic validation, numa_node_size64 can report per-node memory sizes, and numa_distance gives you the distance matrix used by the kernel’s placement heuristics. Use these APIs to cross-check sysfs parsing.
How This Fits in the Projects
- Project 1 uses sysfs and libnuma to build a topology explorer.
- Project 5 uses libnuma to allocate memory on specific nodes.
- Project 9 uses topology to build a NUMA-aware thread pool.
Definitions & Key Terms
- sysfs: Kernel-exposed filesystem with topology info.
- libnuma: User-space library for NUMA policies and queries.
- cpuset: Kernel mechanism that restricts CPU and memory nodes.
- distance: Relative NUMA cost between nodes.
- hwloc: Portable topology discovery library.
Mental Model Diagram
/sys/devices/system/node/
node0/ -> cpulist, meminfo, distance
node1/ -> cpulist, meminfo, distance
How It Works (Step-by-Step)
- The kernel enumerates nodes and CPUs at boot.
- Sysfs exposes node directories and metadata.
- Tools like
lscpuandnumactlread sysfs. - libnuma wraps sysfs into a programmatic API.
- Your application queries libnuma to discover topology.
Minimal Concrete Example
#include <numa.h>
#include <stdio.h>
int main() {
if (numa_available() < 0) {
printf("NUMA not available
");
return 1;
}
int max = numa_max_node();
printf("Max node: %d
", max);
return 0;
}
Common Misconceptions
- “NUMA topology is always static.” (It can change with hotplug or cpusets.)
- “Every node has memory and CPUs.” (Some nodes are memory-only.)
- “numactl shows absolute latency.” (It shows relative distances.)
Check-Your-Understanding Questions
- Where does
lscpuget its NUMA information? - What does
numa_distance(a, b)represent? - How can cpusets affect your NUMA allocations?
Check-Your-Understanding Answers
- It reads sysfs and
/procto gather topology and NUMA node mappings. - It is a relative cost metric used by the kernel for placement decisions.
- Cpusets restrict which nodes your process is allowed to allocate from.
Real-World Applications
- HPC schedulers that pin jobs to specific sockets
- Container runtimes enforcing resource isolation
- Performance tools that visualize system topology
Where You Will Apply It
- Project 1: NUMA Topology Explorer
- Project 5: NUMA-Aware Memory Allocator
- Project 9: NUMA-Aware Thread Pool
References
- https://man7.org/linux/man-pages/man1/lscpu.1.html (lscpu NUMA reporting)
- https://man7.org/linux/man-pages/man3/numa.3.html (libnuma API)
- https://man7.org/linux/man-pages/man8/numactl.8.html (numactl hardware view)
- https://www.open-mpi.org/projects/hwloc/doc/v2.10.0/a00339.php (hwloc overview)
Key Insights
You cannot optimize NUMA without first discovering the topology you are actually running on.
Summary
Linux exposes NUMA topology through sysfs, tools, and libnuma. Master those interfaces before writing any NUMA-aware code.
Homework/Exercises to Practice the Concept
- Write a script that parses
/sys/devices/system/node/node*/cpulist. - Compare
lscpuandnumactl --hardwareoutputs for consistency. - Use
hwloc-lsorlstopoto visualize your topology.
Solutions to the Homework/Exercises
- Read each
cpulistfile and map CPU ranges to nodes. - The node lists and CPU mappings should match; if not, check cpuset restrictions.
- The diagram should show NUMA nodes, caches, and cores in a hierarchy.
Chapter 5: Memory Placement Policies and Page Migration
Fundamentals
NUMA performance depends on where pages are allocated. Linux provides memory policies that control which node a page should come from. The default behavior is local allocation (often called first-touch), meaning the page is allocated on the node of the CPU that first faults it in. You can override this with policies such as bind (restrict to specific nodes), preferred (try a node but allow fallback), or interleave (spread pages across nodes for bandwidth). Policies can be set per-thread with set_mempolicy or per-memory range with mbind. Pages can also be migrated after allocation using move_pages or migrate_pages. Automatic NUMA balancing can move pages based on observed access patterns. Understanding these policies lets you shape locality instead of hoping it happens by accident.
Deep Dive into the Concept
Linux implements NUMA policies as a combination of mode and node mask. The kernel documentation defines primary modes such as default, bind, preferred, interleave, and local allocation. The policy can be applied at different scopes: per-thread (default policy) or per-VMA (range of virtual memory). The set_mempolicy() system call sets the thread’s default policy, while mbind() sets a policy for a range of addresses. Policies only take effect when pages are actually allocated, which for anonymous memory usually happens at first touch. This is why initialization order matters: if a single thread zeroes a large buffer, all pages land on that thread’s node.
The details matter. MPOL_BIND forces allocations onto specific nodes; if they run out of memory, allocation fails (or falls back depending on flags and kernel version). MPOL_PREFERRED chooses one node but allows fallback to others when memory is scarce. MPOL_INTERLEAVE spreads pages across nodes, which can maximize bandwidth for streaming workloads but increases the average latency for single-threaded access. MPOL_LOCAL explicitly chooses local allocation, which is similar to default but can be applied to a range.
Page migration changes where pages live after allocation. move_pages() can query page locations or move them to target nodes. migrate_pages() can move all pages of a process from one set of nodes to another. These operations are not free: the page must be copied, and in-use pages may be busy or require writeback. The man pages list the specific error codes you will see when migration fails (e.g., EBUSY, EACCES, ENOMEM). Understanding these is essential when building a migration tool.
Automatic NUMA balancing is a kernel feature that tries to fix bad placement by sampling memory accesses. The kernel periodically unmaps pages, causing a fault when they are next accessed. Based on the faulting CPU, the kernel can migrate the page to a more local node. This can improve performance for poorly placed workloads, but it adds overhead and is not universally beneficial. The sysctl kernel.numa_balancing controls this behavior, and the kernel documentation explicitly warns that the overhead may outweigh the benefit if the workload is already NUMA-aware.
Memory policies also interact with cpusets. Even if you request a node in your policy, the kernel may restrict allocations to the nodes allowed by the cpuset. This means that NUMA-aware code must always consider the process’s allowed node mask, not just the machine’s global topology.
The key lesson is that memory placement is a policy decision, not a hardware inevitability. With the right tools, you can force locality, spread bandwidth, or migrate pages to follow threads. Every project after Project 2 depends on understanding these tradeoffs.
Migration has hidden costs: page copies, TLB shootdowns, and potential NUMA contention while pages are locked. Large pages magnify both the benefit and the cost because moving a 2 MB page relocates more data but can also stall longer. Policies should therefore be chosen with workload stability in mind. This is why benchmarks should always report policy and placement alongside performance numbers.
How This Fits in the Projects
- Project 2 uses first-touch and explicit binding to compare latencies.
- Project 5 builds an allocator that enforces placement policies.
- Project 7 migrates pages and verifies placement changes.
Definitions & Key Terms
- First-touch: Pages allocated on the node of the first accessing CPU.
- MPOL_BIND: Restrict allocations to specific nodes.
- MPOL_INTERLEAVE: Round-robin allocations across nodes.
- MPOL_PREFERRED: Prefer a node but allow fallback.
- Page migration: Moving allocated pages between nodes.
- Automatic NUMA balancing: Kernel feature that migrates pages based on access.
Mental Model Diagram
Virtual Pages -> Policy -> Node Selection -> Physical Pages
^ | |
| | +--> Local node
| +--------------> Remote node (fallback)
+-- First touch
How It Works (Step-by-Step)
- A thread touches an unmapped page.
- The kernel consults the thread or VMA policy.
- The kernel selects a node according to policy and cpuset constraints.
- The page is allocated from that node.
- Later,
move_pagesormigrate_pagescan relocate it.
Minimal Concrete Example
#include <numaif.h>
#include <unistd.h>
// Prefer node 1 for this thread
unsigned long mask = 1UL << 1;
set_mempolicy(MPOL_PREFERRED, &mask, 8 * sizeof(mask));
Common Misconceptions
- “set_mempolicy moves existing pages.” (It affects future allocations only.)
- “Interleave always improves performance.” (It can increase latency.)
- “NUMA balancing always helps.” (It can add overhead and hurt tuned apps.)
Check-Your-Understanding Questions
- When does a page actually get allocated under Linux NUMA policies?
- What is the difference between
set_mempolicyandmbind? - Why might automatic NUMA balancing be disabled for tuned workloads?
Check-Your-Understanding Answers
- When the page is first faulted in (usually on first write for anonymous memory).
set_mempolicysets a thread default;mbindsets a policy on a memory range.- The sampling faults add overhead and may conflict with explicit placement.
Real-World Applications
- NUMA-aware allocators (jemalloc arenas, per-node pools)
- HPC codes that must control placement to scale
- Database buffer pools with per-socket partitioning
Where You Will Apply It
- Project 2: Memory Latency Microbenchmark
- Project 5: NUMA-Aware Memory Allocator
- Project 7: NUMA Memory Migration Tool
References
- https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html (Kernel NUMA policy)
- https://man7.org/linux/man-pages/man2/set_mempolicy.2.html (set_mempolicy)
- https://man7.org/linux/man-pages/man2/mbind.2.html (mbind)
- https://man7.org/linux/man-pages/man2/move_pages.2.html (move_pages)
- https://man7.org/linux/man-pages/man2/migrate_pages.2.html (migrate_pages)
- https://www.kernel.org/doc/html/v6.7/admin-guide/sysctl/kernel.html (numa_balancing)
Key Insights
NUMA performance is largely a policy problem: where you place pages determines where you pay latency.
Summary
Linux gives you explicit control over NUMA placement. First-touch is only the beginning; real performance comes from choosing the right policy for the workload.
Homework/Exercises to Practice the Concept
- Allocate a large array and initialize it from different threads; observe placement changes.
- Use
numactl --membindvs--interleavefor the same program and compare results. - Write a small tool that calls
move_pagesto query page locations.
Solutions to the Homework/Exercises
- Use
numastat -p <pid>or/proc/<pid>/numa_mapsto observe node distribution. - Interleave should reduce node imbalance but may increase latency for single-threaded access.
- Call
move_pageswithnodes = NULLto query and print node IDs.
Chapter 6: Thread Placement, Scheduling, and CPU Affinity
Fundamentals
Thread placement determines which CPU cores execute your code. On NUMA systems, this directly affects which memory is local. Linux provides CPU affinity APIs (sched_setaffinity, pthread_setaffinity_np) to pin threads to CPUs, and tools like numactl and taskset to control placement at the process level. Cpusets can further restrict which CPUs and memory nodes a process is allowed to use. If a thread migrates across nodes, its cache working set becomes cold and its memory accesses may become remote. NUMA-aware performance requires controlling where threads run. Pinning affects cache warmth and memory locality, so it is a core design decision rather than a late-stage optimization tweak.
Deep Dive into the Concept
CPU affinity is a per-thread mask that restricts which CPUs a thread may execute on. sched_setaffinity() sets the mask for a given thread ID; pthread_setaffinity_np() does the same at the pthread layer. When you set affinity, the kernel migrates the thread if it is currently running on a CPU not in the mask. Threads inherit affinity from their parent, so setting the mask early avoids surprising migrations.
Affinity does not exist in isolation. Cpusets (via cgroups) can restrict which CPUs and memory nodes a process is allowed to use. The kernel intersects your requested affinity mask with the cpuset’s allowed CPUs. This means that pinning a thread to CPU 0 may silently map to a different CPU if CPU 0 is not in the cpuset. Understanding the allowed CPU set is critical, especially in containerized environments.
Thread placement interacts with memory placement. If you bind memory to node 0 but allow threads to run on node 1, every access becomes remote. If you pin threads to node 0 but allocate with interleave, you trade latency for bandwidth. The most robust designs co-locate threads and their data: allocate on the node where the thread runs, and keep the thread there.
Schedulers also make decisions based on load. If a system is oversubscribed or if you do not pin threads, the scheduler may move them between CPUs. This can improve throughput but harms locality. The right strategy depends on workload: latency-sensitive services often use strict pinning, while throughput-oriented batch jobs may allow more flexibility. You will build both approaches in the thread-pool project.
Finally, remember that affinity is not always enough. For some workloads, you want to group threads by node and build per-node queues to keep work local. This is the difference between naive pinning and NUMA-aware scheduling: the latter understands topology and uses it for both thread placement and task dispatch.
Hyperthreads complicate placement. Two logical CPUs may share the same physical core and caches, which can be beneficial for latency when they share data but harmful when they compete for resources. A common strategy is to pin latency-sensitive threads to separate physical cores first, then use siblings for background work. lscpu --extended and hwloc can reveal these relationships, and you can build a topology-aware placement policy that prefers cores over siblings.
Linux also has NUMA-aware scheduling features that attempt to keep tasks on the same node as their memory, but they are heuristics and can be overridden by explicit affinity. This means your application should make a deliberate choice: either trust the scheduler and avoid strict pinning, or take full control and accept the responsibility to balance load yourself. In practice, NUMA-critical services often pin long-lived worker threads and run a separate, unpinned thread pool for background tasks.
Always verify placement. ps -o pid,psr,comm -p <pid> shows the last CPU a thread ran on, and taskset -cp <pid> reveals the current affinity mask. These small checks prevent hours of debugging when results look inconsistent.
How This Fits in the Projects
- Project 1 uses CPU affinity to map CPUs to nodes.
- Project 9 builds a NUMA-aware thread pool with per-node queues.
- Project 10 uses affinity to co-locate query execution with buffer pool pages.
Definitions & Key Terms
- Affinity mask: The set of CPUs a thread is allowed to run on.
- cpuset: Kernel mechanism that constrains CPUs and memory nodes.
- Pinning: Fixing a thread to a specific CPU or node.
- Migration: Moving a thread between CPUs, often harming locality.
Mental Model Diagram
Thread -> Affinity Mask -> Allowed CPUs -> Node -> Local Memory
How It Works (Step-by-Step)
- The thread starts with an inherited CPU mask.
- You call
pthread_setaffinity_nporsched_setaffinity. - The kernel applies cpuset restrictions and migrates the thread if needed.
- The thread runs on the allowed CPUs until its mask changes.
Minimal Concrete Example
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(0, &set);
CPU_SET(1, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
Common Misconceptions
- “Pinning is always faster.” (It can reduce scheduler flexibility and load balance.)
- “Affinity guarantees memory locality.” (Only if you also control placement.)
- “Affinity is process-wide.” (It is per-thread.)
Check-Your-Understanding Questions
- What happens if you pin a thread to a CPU outside its cpuset?
- Why does thread migration hurt cache locality?
- How does per-node work stealing differ from global work stealing?
Check-Your-Understanding Answers
- The kernel silently intersects with the allowed cpuset; the thread may run elsewhere.
- The thread loses warm caches and may access remote memory after migration.
- Per-node work stealing prefers local tasks, reducing remote accesses.
Real-World Applications
- Low-latency trading systems with dedicated core pinning
- Database servers with per-socket worker pools
- HPC runtimes that bind threads to cores
Where You Will Apply It
- Project 1: NUMA Topology Explorer
- Project 9: NUMA-Aware Thread Pool
- Project 10: NUMA-Aware Database Buffer Pool
References
- https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html (sched_setaffinity)
- https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html (pthread affinity)
- https://man7.org/linux/man-pages/man7/cpuset.7.html (cpusets)
- https://man7.org/linux/man-pages/man8/numactl.8.html (numactl affinity)
Key Insights
Thread placement and data placement must be designed together or NUMA penalties will dominate.
Summary
Affinity controls where threads run, cpusets control where they are allowed to run, and NUMA-aware designs align threads with their data.
Homework/Exercises to Practice the Concept
- Pin a thread to a single CPU and measure a microbenchmark.
- Allow the thread to run on all CPUs and compare performance variance.
- Create two thread groups, one per NUMA node, and measure local vs remote access.
Solutions to the Homework/Exercises
- Use
sched_setaffinityorpthread_setaffinity_npand log cycles. - Expect higher variance due to migrations and cache warmup.
- Use
numactl --cpunodebindto enforce node affinity and compare.
Chapter 7: NUMA-Aware Allocation and Data Structures
Fundamentals
NUMA-aware allocators and data structures place memory close to the threads that use it. This can be as simple as per-node arenas or as advanced as partitioned data structures that avoid cross-node sharing. Common strategies include sharding, replication for read-mostly data, interleaving for bandwidth, and per-node pools to reduce contention. NUMA-aware algorithms often trade consistency or simplicity for locality. This chapter shows how to design data structures that respect topology. Without this discipline, the best algorithm can be dominated by remote accesses and coherence traffic. The simplest pattern is to avoid cross-node writes whenever possible and merge results only at well-defined synchronization points.
Deep Dive into the Concept
The default allocator typically treats memory as uniform. On NUMA systems, this can lead to a single thread allocating memory on its node while other threads on different nodes access it remotely. NUMA-aware allocators address this by creating per-node arenas or per-thread caches. When a thread requests memory, the allocator chooses a local arena and uses numa_alloc_onnode or memory policy APIs to place pages near that thread. This reduces remote accesses and lowers coherence traffic.
Data structures can also be partitioned by NUMA node. A sharded hash table, for example, can have one shard per node. Threads on that node access the local shard, and cross-node access is minimized. For read-heavy data, replication can be more effective: each node maintains a local copy, and occasional updates propagate across nodes. For write-heavy shared structures, batching and delegation can reduce contention. SmartPQ, a NUMA-aware priority queue, demonstrates this by switching between NUMA-oblivious and NUMA-aware modes based on contention, achieving performance gains in measured workloads.
NUMA-aware design is not only about data placement. It is also about access patterns. If a data structure forces every operation to touch a global lock or a shared counter, the coherence traffic will dominate regardless of placement. Per-node counters, relaxed consistency, or hierarchical aggregation are common solutions. This is why many NUMA-aware designs use a “reduce” phase: each node accumulates locally and only occasionally combines global state.
Research shows that NUMA-aware memory management can deliver significant speedups. The JArena paper reports up to 4.3x performance improvements on a 256-core ccNUMA node by eliminating false page sharing and providing a NUMA-aware heap. The SmartPQ paper reports 1.87x average performance improvements for priority queues in NUMA systems. These are not marginal gains; they reflect architectural realities.
The key tradeoff is complexity. NUMA-aware allocators and data structures are more complex to implement, test, and reason about. They often require topology discovery, policy selection, and careful measurement. But for performance-critical systems, the payoff is worth it.
Data layout transformations are often necessary. An array-of-structs layout may interleave fields used by different threads on the same cache line, while a struct-of-arrays layout can isolate per-thread fields and improve prefetching. On NUMA, this also affects which pages are touched by which threads during initialization, influencing first-touch placement. If you partition data by node, you should initialize it on that node to avoid remote placement from the beginning.
Some allocators (like glibc malloc) are not NUMA-aware by default, while others (jemalloc, tcmalloc) provide arena controls that can be combined with NUMA policies. Even without a custom allocator, you can use mbind or numa_alloc_interleaved for large, shared arrays where bandwidth matters more than latency. The key is to match policy to access pattern, not apply a single rule everywhere.
Memory reclamation is also topology-sensitive. If one node frees memory that another node reuses, you can accidentally create remote hot spots. Per-node free lists or epoch-based reclamation tied to node-local threads help keep freed pages near the threads that will reuse them.
How This Fits in the Projects
- Project 5 builds a NUMA-aware allocator with per-node arenas.
- Project 6 implements NUMA-aware data structures and sharding.
- Project 10 uses these techniques in a buffer pool manager.
Definitions & Key Terms
- Arena allocator: Memory pool managed separately to reduce contention.
- Sharding: Partitioning data into independent subsets per node.
- Replication: Maintaining local copies to avoid remote reads.
- Delegation: Routing operations to a node-local owner thread.
Mental Model Diagram
Node 0: [Shard 0 | Arena 0]
Node 1: [Shard 1 | Arena 1]
Threads mostly touch their local shard.
How It Works (Step-by-Step)
- Discover node topology and thread placement.
- Create per-node arenas or shards.
- Route allocations/operations to the local node.
- Handle cross-node operations with batching or delegation.
- Aggregate results with a reduction step.
Minimal Concrete Example
// Allocate local memory on the current node
int node = numa_node_of_cpu(sched_getcpu());
void *ptr = numa_alloc_onnode(size, node);
Common Misconceptions
- “NUMA-aware data structures always outperform.” (Only if locality is real.)
- “Replication is free.” (It increases memory use and update cost.)
- “Sharding removes all contention.” (Hot keys can still concentrate load.)
Check-Your-Understanding Questions
- Why do per-node arenas reduce contention?
- When is replication preferable to sharding?
- What is the tradeoff of delegation-based designs?
Check-Your-Understanding Answers
- They reduce cross-node allocation traffic and cacheline sharing.
- When reads dominate and updates are infrequent.
- Delegation can add latency but reduces coherence traffic and contention.
Real-World Applications
- High-concurrency in-memory databases
- Runtime memory allocators (jemalloc, tcmalloc)
- Priority queues and graph analytics engines
Where You Will Apply It
- Project 5: NUMA-Aware Memory Allocator
- Project 6: NUMA-Aware Data Structure Library
- Project 10: NUMA-Aware Database Buffer Pool
References
- https://man7.org/linux/man-pages/man3/numa.3.html (libnuma allocation)
- https://arxiv.org/abs/1902.07590 (JArena)
- https://arxiv.org/abs/2406.06900 (SmartPQ)
Key Insights
NUMA-aware allocation and data structures can deliver multi-x speedups when locality is preserved.
Summary
Designing for locality is as important as algorithmic complexity in NUMA systems. The right placement can outperform clever code that ignores topology.
Homework/Exercises to Practice the Concept
- Implement a per-thread allocator and compare it to
mallocon a multi-core system. - Build a sharded hash map with one shard per NUMA node.
- Measure the effect of replication for a read-heavy workload.
Solutions to the Homework/Exercises
- Use
numa_alloc_onnodefor each thread’s arena and measure remote access stats. - Route operations by node ID and compare throughput to a single shared map.
- Replicate data and measure reduced remote reads but increased memory use.
Chapter 8: Measurement and Profiling for NUMA
Fundamentals
NUMA performance cannot be optimized without measurement. You need benchmarks that isolate latency (pointer chasing) and bandwidth (STREAM-like kernels) and tools that show locality and coherence traffic. On Linux, numastat shows per-node memory usage, /proc/<pid>/numa_maps reveals where pages live, perf shows hardware counters, and perf c2c highlights cacheline contention. Intel MLC provides a latency/bandwidth matrix. This chapter explains how to measure correctly and interpret the results. The goal is to produce repeatable numbers that explain why a workload is slow, not just to generate a benchmark score. Treat your benchmarks like experiments: control inputs, record conditions, and repeat enough times to understand variance.
Deep Dive into the Concept
NUMA measurement begins with topology awareness. Always record node counts, CPU mappings, and distance matrices before running benchmarks. Use numactl --cpunodebind and --membind to force locality, and then compare with interleaved or remote allocations. For latency, pointer-chasing loops are the gold standard because they prevent memory-level parallelism and reveal raw access costs. For bandwidth, STREAM-style kernels (copy, scale, sum, triad) stress the memory subsystem with large arrays that exceed cache size.
Measurement requires control. Disable CPU frequency scaling if possible, pin threads to CPUs, and ensure you are not competing with other workloads. Use large enough data sizes to avoid cache effects unless you are explicitly measuring caches. Record multiple runs and report distributions, not just averages.
NUMA-specific tools add critical context. numastat reads /proc/*/numa_maps and /sys/devices/system/node/node*/numastat to show memory distribution and access patterns. perf c2c records cache-to-cache transfers and identifies hot cache lines with HITM events, which often signal false sharing. perf stat can show LLC misses, remote accesses (on some platforms), and TLB behavior.
Intel MLC provides a latency and bandwidth matrix across sockets, which makes local vs remote differences explicit. It can also measure cache-to-cache transfer latency. STREAM defines standardized kernels for measuring sustained bandwidth, and its design intentionally avoids data reuse. These tools help you answer: Is my workload latency-bound or bandwidth-bound? Are remote accesses dominating? Is coherence traffic causing contention?
Finally, measurement is not just observation; it is validation. Every optimization you attempt in later projects must be measured against a baseline. NUMA-aware code is only successful if it produces measurable improvements in latency, bandwidth, or scalability.
A good benchmark separates allocation, initialization, and timing. If you include allocation in your timed region, you’re mostly measuring page faults and zeroing costs, not steady-state access. Warm up the data to ensure pages are mapped, then start timing. If you want to measure first-touch effects, time the initialization separately and record which thread touched each page. For bandwidth tests, use non-temporal stores if you want to bypass caches, and clearly document whether caches are part of the measurement.
Verification should be explicit. After a run, inspect /proc/<pid>/numa_maps and numastat -p <pid> to confirm placement. For fine-grained checks, move_pages can return the node for individual pages, letting you sample and verify distributions. This is especially important in mixed workloads where some arrays are local and others are intentionally interleaved.
Counter-based profiling requires careful event selection. perf stat can track LLC misses, cache references, and context switches; perf c2c focuses on HITM traffic. On some systems, uncore events can show remote memory accesses. Always record the CPU frequency and system load so you can interpret results and reproduce them later. Measurement is a controlled experiment, not a one-off script.
Finally, report results with context. Include the CPU model, NUMA topology, memory policy, thread count, and data size in every output. Without this metadata, comparisons across machines or time are meaningless.
How This Fits in the Projects
- Project 2 and 3 are measurement projects by design.
- Project 4 uses perf c2c to quantify false sharing.
- Project 7 validates page migration.
Definitions & Key Terms
- Baseline: A reference measurement before optimization.
- HITM: Cache-to-cache transfer of a modified line.
- numa_maps: Per-process mapping of pages to nodes.
- STREAM: Standard bandwidth benchmark.
- MLC: Intel Memory Latency Checker.
Mental Model Diagram
Benchmark -> Measurements -> Interpretation -> Optimization
How It Works (Step-by-Step)
- Record topology (
lscpu,numactl --hardware). - Pin threads and choose memory policy.
- Run latency and bandwidth benchmarks.
- Inspect numastat and perf counters.
- Compare to baseline and iterate.
Minimal Concrete Example
# Measure remote vs local latency with numactl
$ numactl --cpunodebind=0 --membind=0 ./latency_bench
$ numactl --cpunodebind=0 --membind=1 ./latency_bench
Common Misconceptions
- “One run is enough.” (NUMA results vary; always repeat.)
- “If latency improved, bandwidth must too.” (They trade off.)
- “perf c2c always works.” (Requires hardware support and correct events.)
Check-Your-Understanding Questions
- Why does pointer-chasing measure latency effectively?
- What does numastat tell you that perf doesn’t?
- Why is interleaving useful for bandwidth tests?
Check-Your-Understanding Answers
- It creates dependent loads, preventing MLP and exposing raw latency.
- numastat shows per-node memory placement and access counts.
- Interleaving spreads pages across nodes to use more memory channels.
Real-World Applications
- Performance regression testing in data centers
- Capacity planning for in-memory databases
- Benchmarking HPC nodes for workload placement
Where You Will Apply It
- Project 2: Memory Latency Microbenchmark
- Project 3: Memory Bandwidth Benchmark
- Project 4: False Sharing Detector
- Project 7: NUMA Memory Migration Tool
References
- https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html (Intel MLC)
- https://www.cs.virginia.edu/~mccalpin/papers/bandwidth/node2.html (STREAM benchmark)
- https://man7.org/linux/man-pages/man8/numastat.8.html (numastat)
- https://man7.org/linux/man-pages/man1/perf-c2c.1.html (perf c2c)
Key Insights
NUMA optimization is measurement-driven; without evidence, you are guessing.
Summary
You now know the tools and methodology for measuring locality, latency, bandwidth, and coherence. You will use these in every project.
Homework/Exercises to Practice the Concept
- Measure local and remote latency with a pointer-chasing benchmark.
- Run STREAM-like kernels under
--membindand--interleave. - Use
perf c2cto identify a false-sharing hotspot.
Solutions to the Homework/Exercises
- Use
numactl --cpunodebindand--membindto force locality. - Interleave should increase bandwidth by using multiple nodes.
- The cache line with highest HITM count is the likely hotspot.
Glossary (High-Signal)
- NUMA node: A set of CPUs with faster access to a subset of memory.
- UMA: Uniform memory access; all memory latency is equal.
- ccNUMA: Cache-coherent NUMA with hardware coherence across nodes.
- Distance matrix: Relative access cost between nodes.
- First-touch: Allocation policy based on the first faulting CPU.
- Affinity: Restricting a thread to specific CPUs.
- Interleave: Spread pages across nodes for bandwidth.
- False sharing: Unrelated data sharing a cache line, causing contention.
- HITM: Cache-to-cache transfer of a modified line.
Why NUMA vs UMA Matters
The Modern Problem It Solves
Modern CPUs have far more cores than a single memory controller can feed. NUMA solves this by distributing memory controllers across sockets, but it introduces non-uniform access times. In practice this means two machines with the same CPU count can have dramatically different performance depending on memory placement and thread scheduling.
Real-world impact (examples with sources and year):
- NUMA-aware memory management can deliver multi-x speedups: JArena reports up to 4.3x improvement on a 256-core ccNUMA node (2019). (Source: JArena paper)
- NUMA-aware data structures can be ~2x faster: SmartPQ reports 1.87x average improvement over a NUMA-oblivious queue (2024). (Source: SmartPQ paper)
- Latency gaps are huge even within one machine: “Latency numbers” show L1 cache ~0.5 ns vs main memory ~100 ns (2025 update), and NUMA adds an additional remote penalty on top of that. (Source: latency numbers)
- Local vs remote measurements are explicit: Intel MLC documentation reports separate local and cross-socket latency/bandwidth matrices, making locality differences visible (2025). (Source: Intel MLC)
Sources:
- https://arxiv.org/abs/1902.07590
- https://arxiv.org/abs/2406.06900
- https://gist.github.com/jboner/2841832/82fe487e68d6fcc354eac5cec5c72975b32493d0
- https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html
OLD (UMA) NEW (NUMA)
┌───────────────────────┐ ┌────────────┐ ┌────────────┐
│ Single memory pool │ │ Node 0 │ │ Node 1 │
│ Uniform latency │ │ CPUs+DRAM │ │ CPUs+DRAM │
└─────────┬─────────────┘ └─────┬──────┘ └─────┬──────┘
| | |
v +-- interconnect
Scalability wall (remote penalty)
Context & Evolution (Optional)
As core counts grew, shared buses and uniform memory architectures became bottlenecks. NUMA evolved to scale memory bandwidth with core count, but the cost is a more complex performance model that depends on placement and locality.
Concept Summary Table
This section provides a map of the mental models you will build during these projects.
| Concept Cluster | What You Need to Internalize |
|---|---|
| Memory Hierarchy | How latency and bandwidth interact with caches and DRAM |
| UMA vs NUMA | Why topology changes access costs and scaling behavior |
| Cache Coherence | How false sharing and HITM traffic destroy scalability |
| Topology Discovery | How Linux exposes nodes, CPUs, and distances |
| Memory Policies | How placement (bind, preferred, interleave) changes performance |
| Thread Placement | How affinity and scheduling determine locality |
| NUMA-Aware Data Structures | How sharding/replication reduce remote access |
| Measurement & Profiling | How to measure, validate, and iterate |
Project-to-Concept Map
| Project | What It Builds | Primer Chapters It Uses |
|---|---|---|
| Project 1: NUMA Topology Explorer | Topology discovery tool | Ch. 2, 4 |
| Project 2: Memory Latency Microbenchmark | Local vs remote latency data | Ch. 1, 2, 5, 8 |
| Project 3: Memory Bandwidth Benchmark | Bandwidth measurements | Ch. 1, 5, 8 |
| Project 4: False Sharing Detector | Cacheline contention analysis | Ch. 1, 3, 8 |
| Project 5: NUMA-Aware Memory Allocator | Per-node arenas | Ch. 5, 6, 7 |
| Project 6: NUMA-Aware Data Structures | Sharded and replicated structures | Ch. 3, 7 |
| Project 7: NUMA Memory Migration Tool | Page migration + verification | Ch. 5, 8 |
| Project 8: Cache Coherence Simulator | MESI/MOESI simulation | Ch. 3 |
| Project 9: NUMA-Aware Thread Pool | Locality-aware scheduling | Ch. 2, 6, 7 |
| Project 10: NUMA-Aware Database Buffer Pool | Locality-aware cache engine | Ch. 1, 2, 5, 7 |
Deep Dive Reading by Concept
Fundamentals & Architecture
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Memory hierarchy | Computer Systems: A Programmer’s Perspective — Ch. 6 | Cache and memory behavior from a programmer view |
| Memory hierarchy | Computer Architecture (Hennessy & Patterson) — Ch. 2 | Formal memory hierarchy models |
| UMA/NUMA basics | Computer Organization and Design (RISC-V) — Ch. 6 | Architectural foundations for multiprocessors |
OS and Memory Policies
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Virtual memory & paging | Operating Systems: Three Easy Pieces — Ch. 13-21 | Page faults and placement mechanics |
| Scheduling & affinity | Operating Systems: Three Easy Pieces — Ch. 6-8 | Scheduling decisions and locality |
| Linux memory management | The Linux Programming Interface — Ch. 6, 49 | Practical OS interfaces for memory |
Concurrency & Coherence
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Concurrency & caches | Computer Systems: A Programmer’s Perspective — Ch. 12 | Synchronization and memory behavior |
| Concurrent data structures | The Art of Multiprocessor Programming — Ch. 3-5 | Locking, contention, and scalability |
| Multiprocessor coherence | Computer Architecture (Hennessy & Patterson) — Ch. 5-6 | Coherence protocols and scaling |
Allocators & Data Structures
| Concept | Book & Chapter | Why This Matters |
|---|---|---|
| Memory allocation | C Interfaces and Implementations — Ch. 5-6 | Allocator internals and tradeoffs |
| Performance tuning | Computer Systems: A Programmer’s Perspective — Ch. 5 | Measurement-driven optimization |
Quick Start: Your First 48 Hours
Feeling overwhelmed? Start here instead of reading everything:
Day 1 (4 hours):
- Skim “Chapter 2: UMA vs NUMA” and “Chapter 4: Topology Discovery”
- Run
numactl --hardwareandlscpuon your machine - Start Project 1 and get a minimal topology report
- Ignore distance matrix parsing for now; just print nodes and CPU lists
Day 2 (4 hours):
- Add distance matrix parsing to Project 1
- Compare your output with
numactl --hardware - Read the “Core Question” section for Project 1
- Capture a screenshot of your topology output for your notes
End of Weekend: You now understand the shape of your system and can explain what a NUMA node is. That mental model will make every later project clearer.
Recommended Learning Paths
Path 1: The Systems Performance Engineer (Recommended Start)
Best for: Developers who care about latency and throughput
- Project 1 (Topology Explorer)
- Project 2 (Latency Benchmark)
- Project 3 (Bandwidth Benchmark)
- Project 4 (False Sharing Detector)
- Project 5 (NUMA-Aware Allocator)
Path 2: The Concurrency Specialist
Best for: Developers building multi-threaded systems
- Project 1
- Project 4
- Project 6
- Project 8
- Project 9
Path 3: The Database / Data Engineer
Best for: Developers building storage engines or analytics
- Project 1
- Project 2
- Project 5
- Project 9
- Project 10
Path 4: The Completionist
Best for: Those building a full NUMA performance lab
Phase 1: Foundation (Weeks 1-2)
- Project 1, 2, 3
Phase 2: Coherence & Measurement (Weeks 3-4)
- Project 4, 8
Phase 3: Placement & Structures (Weeks 5-8)
- Project 5, 6, 7
Phase 4: Integration (Weeks 9-12)
- Project 9, 10
Success Metrics
By the end of this guide, you should be able to:
- Explain and measure the latency difference between local and remote memory on your machine
- Produce a topology report that matches
numactl --hardware - Demonstrate false sharing and eliminate it with padding
- Show measurable improvements from NUMA-aware allocation (lower remote access %, higher throughput)
- Build a NUMA-aware thread pool that keeps >95% of tasks local
- Explain, with numbers, why a given workload is latency-bound or bandwidth-bound
Appendix: Linux NUMA Tooling Cheatsheet
# Topology
lscpu
numactl --hardware
lstopo
# Policies
numactl --membind=0 --cpunodebind=0 ./app
numactl --interleave=all ./app
# Observability
numastat -p <pid>
cat /proc/<pid>/numa_maps
perf c2c record -- ./app
perf c2c report
Project Overview Table
| Project | Focus | Difficulty | Time | Outcome |
|---|---|---|---|---|
| 1. NUMA Topology Explorer | Topology discovery | Beginner | Weekend | Topology report tool |
| 2. Memory Latency Microbenchmark | Latency measurement | Intermediate | 1 week | Local vs remote latency data |
| 3. Memory Bandwidth Benchmark | Bandwidth measurement | Intermediate | 1 week | STREAM-like bandwidth report |
| 4. False Sharing Detector | Coherence analysis | Intermediate | 1 week | HITM/false-sharing report |
| 5. NUMA-Aware Memory Allocator | Placement control | Advanced | 2-3 weeks | Per-node allocator |
| 6. NUMA-Aware Data Structures | Locality-aware DS | Advanced | 3-4 weeks | Sharded/replicated library |
| 7. NUMA Memory Migration Tool | Page migration | Advanced | 2 weeks | Page relocation tool |
| 8. Cache Coherence Simulator | MESI/MOESI model | Expert | 2-3 weeks | Coherence simulator |
| 9. NUMA-Aware Thread Pool | Scheduling | Advanced | 2-3 weeks | Locality-aware pool |
| 10. NUMA-Aware Database Buffer Pool | Integration | Expert | 4-6 weeks | NUMA-aware cache engine |
Project List
Project 1: NUMA Topology Explorer
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: NUMA Topology / System Information
- Software or Tool: libnuma, Linux sysfs, lscpu
- Main Book: “Computer Architecture” by Hennessy & Patterson
What you’ll build: A CLI tool that discovers and prints NUMA nodes, CPU lists, per-node memory sizes, and the distance matrix. It also reports cache information and compares its output to numactl --hardware.
Why it teaches NUMA basics: You cannot optimize what you cannot see. This project forces you to learn how Linux exposes NUMA topology and how tools derive their output.
Core challenges you’ll face:
- Reading sysfs → mapping node directories and parsing CPU lists
- Using libnuma → programmatic topology queries
- Distance matrix → interpreting relative costs and presenting them clearly
- Validation → matching output with
numactl --hardware
Real World Outcome
You will run your tool and get a report like this:
$ ./numa-topology
=== NUMA Topology Report ===
Nodes: 2
CPUs: 16
Node 0:
CPUs: 0-7
Memory: 32768 MB (28912 MB free)
Distance to Node 0: 10
Distance to Node 1: 21
Node 1:
CPUs: 8-15
Memory: 32768 MB (30120 MB free)
Distance to Node 0: 21
Distance to Node 1: 10
Distance Matrix:
N0 N1
N0: 10 21
N1: 21 10
Validation:
numactl --hardware: MATCH
The Core Question You’re Answering
“How do I discover and trust the real NUMA topology of a Linux machine?”
This question matters because every placement decision depends on topology. If your model of the machine is wrong, your optimizations will be too.
Concepts You Must Understand First
- NUMA nodes and distance matrices
- What does a distance value represent?
- Why is it relative rather than absolute?
- Book Reference: “Computer Architecture” — Ch. 2
- Linux sysfs topology
- Where do
/sys/devices/system/node/*files come from? - What does
node*/cpulistencode? - Book Reference: “The Linux Programming Interface” — Ch. 6
- Where do
- libnuma basics
- What does
numa_available()return? - How do you map nodes to CPUs?
- Book Reference: “Linux System Programming” — Ch. 6
- What does
Questions to Guide Your Design
- Topology accuracy
- Will you trust libnuma, sysfs parsing, or both?
- How will you handle systems without NUMA?
- Parsing and representation
- How will you parse CPU ranges like “0-3,8-11”?
- How will you represent distance matrices in a readable table?
- Validation
- How will you compare your output to
numactl --hardware? - What is your fallback if the outputs differ?
- How will you compare your output to
Thinking Exercise
The “CPULIST Puzzle”
Given this cpulist:
0-3,8-11,16,18-19
- How many CPUs is that?
- Which CPUs are contiguous ranges?
- How would you encode this in a bitmask?
The Interview Questions They’ll Ask
- “How does Linux expose NUMA topology to user space?”
- “What is a NUMA distance matrix and how is it used?”
- “Why might node distance be different across platforms?”
- “How would you handle systems without NUMA support?”
- “How do cpusets affect what your tool reports?”
Hints in Layers
Hint 1: Starting Point Use libnuma to check if NUMA is available.
if (numa_available() < 0) { /* handle no NUMA */ }
Hint 2: Next Step
Use numa_max_node() and numa_node_to_cpus() to map CPUs to nodes.
Hint 3: Key Technique
Parse /sys/devices/system/node/nodeN/distance for the matrix.
Hint 4: Debugging/Verification Compare your output with:
numactl --hardware
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| NUMA basics | “Computer Architecture” by Hennessy & Patterson | Ch. 2 |
| Linux sysfs | “The Linux Programming Interface” by Kerrisk | Ch. 6 |
| CPU topology | “Linux System Programming” by Love | Ch. 6 |
Common Pitfalls & Debugging
Problem 1: “NUMA not available”
- Why: Kernel not compiled with NUMA or running in a VM without NUMA exposure
- Fix: Test on bare metal or enable NUMA in VM settings
- Quick test:
lscpu | grep NUMA
Problem 2: “CPU list parsing wrong”
- Why: Not handling ranges like
0-3,8-11 - Fix: Implement robust range parsing
- Quick test: Compare count with
lscpu
Problem 3: “Distances look weird”
- Why: Interpreting relative numbers as absolute latency
- Fix: Label them as relative costs
- Verification: Compare to
numactl --hardware
Definition of Done
- Tool reports correct node count, CPU lists, and memory sizes
- Distance matrix matches
numactl --hardware - Handles non-NUMA systems gracefully
- Output is readable and consistent
Project 2: Memory Latency Microbenchmark
- Main Programming Language: C
- Alternative Programming Languages: Rust, Assembly
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Memory Performance / Benchmarking
- Software or Tool: rdtsc/rdtscp, libnuma, numactl
- Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron
What you’ll build: A microbenchmark that measures latency of local vs remote memory using pointer chasing. It outputs latency in cycles and nanoseconds for each node pair.
Why it teaches memory performance: Latency numbers are abstract until you measure them. This project forces you to build a reliable benchmark and validates the NUMA penalty on your machine.
Core challenges you’ll face:
- Accurate timing → using rdtsc/rdtscp correctly
- Eliminating MLP → pointer-chasing to force serialized loads
- Placement control → binding memory and threads to specific nodes
- Noise reduction → pinning threads and repeating runs
Real World Outcome
$ numactl --cpunodebind=0 --membind=0 ./latency_bench
Latency (cycles @ 3.0 GHz):
Node 0 -> Node 0: 240 cycles (80 ns)
Node 0 -> Node 1: 390 cycles (130 ns)
$ numactl --cpunodebind=0 --membind=1 ./latency_bench
Node 0 -> Node 1: 395 cycles (132 ns)
The Core Question You’re Answering
“How much slower is remote memory on my machine, and how do I measure it correctly?”
Concepts You Must Understand First
- Pointer chasing and MLP
- Why does it serialize loads?
- How does it expose latency?
- Book Reference: CS:APP Ch. 6
- NUMA policies
- How does
--membindaffect allocation? - What is first-touch?
- Book Reference: OSTEP Ch. 13-21
- How does
- Timing and rdtsc
- How do you convert cycles to nanoseconds?
- How do you avoid out-of-order noise?
- Book Reference: CS:APP Ch. 5
Questions to Guide Your Design
- How will you prevent prefetching from hiding latency?
- How will you ensure the benchmark is pinned to one node?
- How will you convert cycles into time reliably?
Thinking Exercise
Latency Trap
If you measure a simple for loop over an array, why might you see bandwidth rather than latency? How does pointer chasing fix that?
The Interview Questions They’ll Ask
- “Why is pointer chasing used for latency measurement?”
- “How does
rdtscdiffer fromclock_gettime?” - “What is first-touch and how does it affect your benchmark?”
- “Why do you need to pin the thread?”
- “What makes latency benchmarks tricky to trust?”
Hints in Layers
Hint 1: Starting Point Allocate a large array and create a random pointer chain.
Hint 2: Next Step
Use rdtscp to serialize timing around the load.
Hint 3: Key Technique
Pin the benchmark thread with sched_setaffinity or numactl.
Hint 4: Debugging/Verification
Compare results with and without --membind to verify NUMA penalty.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Memory hierarchy | CS:APP | Ch. 6 |
| Performance measurement | CS:APP | Ch. 5 |
| Virtual memory | OSTEP | Ch. 13-21 |
Common Pitfalls & Debugging
Problem 1: “Latency seems too low”
- Why: Prefetching or MLP hiding latency
- Fix: Use pointer chasing and dependent loads
- Quick test: Increase stride and see latency jump
Problem 2: “Results vary wildly”
- Why: CPU frequency scaling or thread migration
- Fix: Pin the thread and disable scaling
- Quick test: Compare results across 10 runs
Problem 3: “Remote is not slower”
- Why: Memory not actually bound to remote node
- Fix: Use
numactl --membindand validate withnumastat
Definition of Done
- Benchmark measures stable local and remote latencies
- Results differ by at least 20-50% on a real NUMA system
- Thread is pinned and placement validated
- Output includes cycles and nanoseconds
Project 3: Memory Bandwidth Benchmark
- Main Programming Language: C
- Alternative Programming Languages: Rust, C++
- Coolness Level: Level 3: Genuinely Useful
- Business Potential: 2. The “Internal Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Memory Bandwidth / Benchmarking
- Software or Tool: STREAM-style kernels, numactl
- Main Book: “Computer Architecture” by Hennessy & Patterson
What you’ll build: A STREAM-like benchmark that measures copy/scale/add/triad bandwidth across NUMA nodes and policies.
Why it teaches bandwidth: NUMA doesn’t just change latency; it changes how many memory channels you can use. This project shows how interleaving and placement affect bandwidth.
Core challenges you’ll face:
- Large working sets → ensuring caches are bypassed
- Policy control → binding or interleaving memory
- Thread placement → using one thread vs many threads
Real World Outcome
$ numactl --cpunodebind=0 --membind=0 ./bandwidth_bench
COPY: 38.2 GB/s
SCALE: 36.9 GB/s
SUM: 35.1 GB/s
TRIAD: 34.8 GB/s
$ numactl --cpunodebind=0 --interleave=all ./bandwidth_bench
COPY: 60.4 GB/s
SCALE: 58.7 GB/s
SUM: 56.9 GB/s
TRIAD: 56.1 GB/s
The Core Question You’re Answering
“How does memory placement change achievable bandwidth?”
Concepts You Must Understand First
- Bandwidth vs latency
- Why does interleaving improve bandwidth?
- Book Reference: CS:APP Ch. 6
- STREAM kernels
- What do copy/scale/sum/triad represent?
- Book Reference: Computer Architecture Ch. 2
- NUMA policies
- When should you interleave vs bind?
- Book Reference: OSTEP Ch. 13-21
Questions to Guide Your Design
- What array sizes ensure you are measuring DRAM, not cache?
- How will you control thread count and affinity?
- How will you report results across nodes?
Thinking Exercise
If you run a single thread on node 0 with memory interleaved across nodes, what happens to latency and bandwidth? Why?
The Interview Questions They’ll Ask
- “What is the STREAM benchmark measuring?”
- “Why do you need large arrays for bandwidth measurement?”
- “How does interleaving affect bandwidth?”
- “What does a TRIAD kernel represent?”
Hints in Layers
Hint 1: Starting Point Implement copy/scale/sum/triad loops over large arrays.
Hint 2: Next Step
Time each kernel with clock_gettime and compute GB/s.
Hint 3: Key Technique
Use numactl --interleave=all to maximize bandwidth.
Hint 4: Debugging/Verification
Compare your numbers with STREAM results from similar hardware.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Memory bandwidth | Computer Architecture | Ch. 2 |
| Performance | CS:APP | Ch. 5 |
| Virtual memory | OSTEP | Ch. 13-21 |
Common Pitfalls & Debugging
Problem 1: “Bandwidth looks too high”
- Why: Arrays fit in cache
- Fix: Increase array size beyond LLC
- Quick test: Double size and see if bandwidth drops
Problem 2: “Results vary per run”
- Why: Thread migration or CPU scaling
- Fix: Pin threads and disable scaling
- Quick test: Run 10 times and compare variance
Problem 3: “Interleave slower than bind”
- Why: Single-threaded latency penalty outweighs bandwidth gains
- Fix: Use multiple threads to saturate channels
Definition of Done
- Benchmark reports copy/scale/sum/triad GB/s
- Shows measurable difference between bind and interleave
- Uses data sizes larger than LLC
- Runs are repeatable with low variance
Project 4: False Sharing Detector
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Internal Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cache Coherence / Performance Profiling
- Software or Tool: perf c2c, perf stat
- Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit
What you’ll build: A microbenchmark that creates false sharing and a profiling workflow that detects it using perf c2c.
Why it teaches coherence: False sharing is one of the most common performance killers in multi-core systems, and NUMA magnifies it.
Core challenges you’ll face:
- Generating false sharing → designing a workload that contends on one cache line
- Using perf c2c → recording and interpreting HITM events
- Fixing the issue → padding or reordering data to eliminate sharing
Real World Outcome
$ ./false_sharing_bench --threads=8 --padding=0
Throughput: 120 M ops/sec
$ perf c2c record -- ./false_sharing_bench --threads=8 --padding=0
$ perf c2c report
# Hottest cache line:
# 0x7f8c12345000 HITM: 92% CPUs: 0,1,2,3,8,9,10,11
$ ./false_sharing_bench --threads=8 --padding=64
Throughput: 680 M ops/sec
The Core Question You’re Answering
“How do I detect and eliminate false sharing on real hardware?”
Concepts You Must Understand First
- Cache lines and coherence
- Why does sharing a line matter?
- Book Reference: Art of Multiprocessor Programming Ch. 3-4
- HITM events
- What does HITM mean?
- Book Reference: Computer Architecture Ch. 5
- perf c2c workflow
- How do you record and interpret the report?
- Book Reference: CS:APP Ch. 5
Questions to Guide Your Design
- How will you generate predictable false sharing?
- What counters or tools confirm it?
- How will you prove the fix works?
Thinking Exercise
If each thread updates its own counter, why might they still contend? Sketch the cache line layout.
The Interview Questions They’ll Ask
- “What is false sharing and how do you fix it?”
- “What does HITM mean in cache-coherence profiling?”
- “Why is false sharing worse on NUMA?”
- “How would you detect false sharing without perf c2c?”
Hints in Layers
Hint 1: Starting Point Put thread counters in an array and increment them in a tight loop.
Hint 2: Next Step
Use perf c2c record while running the benchmark.
Hint 3: Key Technique Pad each counter to 64 bytes and rerun.
Hint 4: Debugging/Verification Compare throughput and HITM rates before and after padding.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Coherence | Art of Multiprocessor Programming | Ch. 3-4 |
| Performance profiling | CS:APP | Ch. 5 |
Common Pitfalls & Debugging
Problem 1: “perf c2c shows no data”
- Why: Hardware support missing or kernel not configured
- Fix: Check
perf c2c -e listand kernel version - Quick test: Use
perf statto ensure counters work
Problem 2: “Padding did not help”
- Why: Padding size is smaller than cache line
- Fix: Align to 64 bytes or use
alignas(64)
Problem 3: “No HITM events”
- Why: Workload not actually sharing lines
- Fix: Verify counters are adjacent in memory
Definition of Done
- Benchmark reliably produces false sharing
- perf c2c identifies the hot cache line
- Padding eliminates HITM-heavy line
- Throughput improvement is measurable
Project 5: NUMA-Aware Memory Allocator
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Allocation / NUMA Policies
- Software or Tool: libnuma, mbind, set_mempolicy
- Main Book: “C Interfaces and Implementations” by David Hanson
What you’ll build: A per-node allocator that maintains separate arenas for each NUMA node and exposes APIs like numa_malloc(node, size).
Why it teaches NUMA placement: Allocators are where placement decisions become real. This project makes you control where memory lives rather than relying on first-touch.
Core challenges you’ll face:
- Per-node arenas → managing separate pools
- Policy control → using libnuma to place pages
- Thread-local fast paths → reducing contention
- Statistics → measuring local vs remote allocation ratio
Real World Outcome
$ ./numa_alloc_demo --threads=8
Allocator stats:
Node 0: 3.2 GB allocated, 98.1% local
Node 1: 3.1 GB allocated, 97.6% local
Remote allocations: 2.1%
$ ./numa_alloc_demo --policy=interleave
Node 0: 3.1 GB allocated, 50.2% local
Node 1: 3.2 GB allocated, 49.8% local
The Core Question You’re Answering
“How do I guarantee that memory allocations land on the node I want?”
Concepts You Must Understand First
- Memory policies (bind/preferred/interleave)
- How do policies affect allocation?
- Book Reference: OSTEP Ch. 13-21
- libnuma allocation APIs
numa_alloc_onnode,numa_alloc_interleaved- Book Reference: TLPI Ch. 6
- Allocator design
- Why do arenas reduce contention?
- Book Reference: C Interfaces and Implementations Ch. 5-6
Questions to Guide Your Design
- How will you map threads to arenas?
- How will you handle cross-node allocations?
- What stats will you record to prove placement?
Thinking Exercise
Design a layout where each NUMA node has its own arena and a shared fallback arena. What are the pros and cons?
The Interview Questions They’ll Ask
- “How does first-touch differ from explicit allocation on a node?”
- “Why do allocators use per-thread or per-node arenas?”
- “How would you measure allocator locality?”
- “What happens if a node runs out of memory?”
Hints in Layers
Hint 1: Starting Point
Start with a simple wrapper around numa_alloc_onnode.
Hint 2: Next Step Add per-node free lists and reuse freed blocks.
Hint 3: Key Technique Keep per-thread caches for small allocations and refill from node arena.
Hint 4: Debugging/Verification
Use numastat -p <pid> to confirm local allocation percentages.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Allocation internals | C Interfaces and Implementations | Ch. 5-6 |
| Memory policies | OSTEP | Ch. 13-21 |
| Performance tuning | CS:APP | Ch. 5 |
Common Pitfalls & Debugging
Problem 1: “Allocations still remote”
- Why: Thread not pinned to the target node
- Fix: Pin thread or set policy before allocation
- Quick test: Compare with
numactl --membind
Problem 2: “Performance worse than malloc”
- Why: Contention or poor free list design
- Fix: Add per-thread caches and reduce locks
Problem 3: “Out of memory on node”
- Why: Strict binding with insufficient free pages
- Fix: Use preferred policy with fallback or handle errors
Definition of Done
- Allocator can allocate from a specified node
- Per-node arenas reduce cross-node allocations
- Local allocation ratio >95% under typical workload
- Fallback handling for node exhaustion
Project 6: NUMA-Aware Data Structure Library
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Concurrency / Data Structures
- Software or Tool: pthreads, libnuma
- Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit
What you’ll build: A library of NUMA-aware data structures (e.g., sharded hash map, per-node queue, replicated read cache).
Why it teaches NUMA-aware design: Data structure layout determines memory locality and coherence traffic. This project forces you to design for locality rather than naive sharing.
Core challenges you’ll face:
- Sharding → mapping keys or tasks to nodes
- Replication → maintaining consistency across nodes
- Contention management → reducing shared hotspots
Real World Outcome
$ ./ds_bench --structure=sharded_map --threads=16
Throughput: 4.2 M ops/sec
Remote reads: 3.1%
$ ./ds_bench --structure=global_map
Throughput: 1.7 M ops/sec
Remote reads: 41.6%
The Core Question You’re Answering
“How do I design data structures so that threads mostly touch local memory?”
Concepts You Must Understand First
- Cache coherence and false sharing
- Why does a global lock destroy scalability?
- Book Reference: Art of Multiprocessor Programming Ch. 3-4
- NUMA-aware allocation
- How does per-node allocation change locality?
- Book Reference: C Interfaces and Implementations Ch. 5-6
- Sharding/Replication
- When should you shard vs replicate?
- Book Reference: CS:APP Ch. 12
Questions to Guide Your Design
- How will you map keys to nodes?
- How will you handle cross-node operations?
- What metrics will show locality improvements?
Thinking Exercise
If you have a shared queue used by all nodes, how can you redesign it so that 90% of operations are local?
The Interview Questions They’ll Ask
- “What is the difference between sharding and replication?”
- “How does NUMA affect lock contention?”
- “How would you design a NUMA-aware queue?”
- “What metrics show success?”
Hints in Layers
Hint 1: Starting Point Implement a per-node shard and route operations by node ID.
Hint 2: Next Step Introduce a global aggregator for cross-node operations.
Hint 3: Key Technique Use padding and per-node counters to avoid false sharing.
Hint 4: Debugging/Verification
Track remote access percentage with numastat.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Concurrent data structures | Art of Multiprocessor Programming | Ch. 3-5 |
| Concurrency | CS:APP | Ch. 12 |
| Allocation internals | C Interfaces and Implementations | Ch. 5-6 |
Common Pitfalls & Debugging
Problem 1: “Shards unevenly loaded”
- Why: Skewed key distribution
- Fix: Use consistent hashing or rebalance
Problem 2: “Replication causes stale reads”
- Why: Update propagation lag
- Fix: Use versioning or epoch-based updates
Problem 3: “Still high remote access”
- Why: Threads not pinned to node of shard
- Fix: Pin threads and align shard placement
Definition of Done
- Library includes at least one sharded and one replicated structure
- Local access >90% under node-local workloads
- Throughput improves vs global shared structure
- False sharing eliminated in hot paths
Project 7: NUMA Memory Migration Tool
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 3: Genuinely Useful
- Business Potential: 2. The “Internal Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Management / Migration
- Software or Tool: move_pages, migrate_pages
- Main Book: “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau
What you’ll build: A CLI tool that inspects page locations for a process and migrates them to specific nodes, then verifies the result.
Why it teaches NUMA control: It demonstrates that placement is not static and can be corrected after the fact.
Core challenges you’ll face:
- Page location queries → using
move_pageswith nodes=NULL - Migration → handling errors and busy pages
- Verification → confirming pages moved to target nodes
Real World Outcome
$ ./numa_migrate --pid 1234 --range 1G --to-node 1
Pages examined: 262144
Moved: 248901
Busy: 1023
Failed: 121
$ ./numa_migrate --pid 1234 --check
Node 0: 4.2% pages
Node 1: 95.8% pages
The Core Question You’re Answering
“Can I repair bad page placement without restarting the process?”
Concepts You Must Understand First
- Page migration APIs
- How does
move_pageswork? - Book Reference: OSTEP Ch. 13-21
- How does
- NUMA policies
- Why might pages be on the wrong node?
- Book Reference: OSTEP Ch. 13-21
- Permissions and errors
- Why does migration fail with EBUSY or EACCES?
- Book Reference: TLPI Ch. 6
Questions to Guide Your Design
- How will you iterate over pages in a memory range?
- How will you report errors and partial migration?
- How will you verify success?
Thinking Exercise
If you migrate a page but keep threads on the old node, does performance improve or worsen? Why?
The Interview Questions They’ll Ask
- “What is the difference between move_pages and migrate_pages?”
- “Why might page migration fail?”
- “How does page migration interact with memory policy?”
Hints in Layers
Hint 1: Starting Point
Call move_pages with nodes = NULL to query locations.
Hint 2: Next Step
Provide a target nodes array and retry with MPOL_MF_MOVE.
Hint 3: Key Technique
Handle -EBUSY and retry later.
Hint 4: Debugging/Verification
Use /proc/<pid>/numa_maps to validate node distribution.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Virtual memory | OSTEP | Ch. 13-21 |
| Linux syscalls | TLPI | Ch. 6 |
Common Pitfalls & Debugging
Problem 1: “EPERM on migration”
- Why: Lacking CAP_SYS_NICE for moving other processes’ pages
- Fix: Run as root or target own process
Problem 2: “Many pages EBUSY”
- Why: Pages are in I/O or locked
- Fix: Retry or ignore busy pages
Problem 3: “Pages move back”
- Why: NUMA balancing or policy still favors old node
- Fix: Adjust policy or disable automatic balancing
Definition of Done
- Tool can query page locations
- Tool migrates pages and reports success/failure
- Verification step confirms new placement
- Handles errors gracefully
Project 8: Cache Coherence Simulator
- Main Programming Language: C++
- Alternative Programming Languages: Python, Rust
- Coolness Level: Level 5: Pure Magic
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Cache Coherence / Architecture
- Software or Tool: MESI/MOESI simulator
- Main Book: “Computer Architecture” by Hennessy & Patterson
What you’ll build: A small simulator that models MESI (or MOESI) states across multiple cores and prints state transitions for a sequence of reads/writes.
Why it teaches coherence deeply: Hardware coherence can feel mystical. A simulator makes it deterministic and observable.
Core challenges you’ll face:
- State machine design → implementing MESI transitions
- Bus/directory model → handling read/write requests
- Visualization → making transitions easy to follow
Real World Outcome
$ ./mesi_sim --sequence "R1 W1 R2 W2"
Step 1: R1 -> Line state: E
Step 2: W1 -> Line state: M
Step 3: R2 -> P1:M -> S, P2: S
Step 4: W2 -> P1:I, P2:M
The Core Question You’re Answering
“How do coherence protocols actually move cache lines between cores?”
Concepts You Must Understand First
- MESI state machine
- What do M/E/S/I states mean?
- Book Reference: Computer Architecture Ch. 5
- Bus vs directory
- What is snooping?
- Book Reference: Computer Architecture Ch. 5
- False sharing
- How do writes cause invalidation storms?
- Book Reference: Art of Multiprocessor Programming Ch. 3
Questions to Guide Your Design
- Will you simulate a snooping bus or directory?
- How will you represent caches and their lines?
- How will you visualize transitions?
Thinking Exercise
Simulate two cores writing alternatingly to the same line. How many invalidations occur?
The Interview Questions They’ll Ask
- “Explain the MESI protocol in your own words.”
- “What problem does MOESI solve?”
- “Why do directory protocols scale better?”
- “How does coherence relate to NUMA?”
Hints in Layers
Hint 1: Starting Point Implement a MESI enum and a table of transitions.
Hint 2: Next Step Create a cache object that handles local reads/writes.
Hint 3: Key Technique Implement a bus or directory broadcast for invalidations.
Hint 4: Debugging/Verification Use known MESI examples from textbooks and compare output.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Coherence protocols | Computer Architecture | Ch. 5 |
| Concurrency | Art of Multiprocessor Programming | Ch. 3 |
Common Pitfalls & Debugging
Problem 1: “States don’t match expected”
- Why: Missing a transition case
- Fix: Use a table-driven state machine
Problem 2: “Inconsistent sharing”
- Why: Not updating other caches on bus events
- Fix: Broadcast invalidations on write
Problem 3: “Simulator too complex”
- Why: Trying to model real hardware too closely
- Fix: Keep it minimal and educational
Definition of Done
- Simulator implements MESI or MOESI correctly
- State transitions are visible and explainable
- Includes examples for false sharing scenarios
- Can be used as a teaching tool
Project 9: NUMA-Aware Thread Pool
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Threading / NUMA Scheduling
- Software or Tool: pthreads, libnuma
- Main Book: “C++ Concurrency in Action” by Anthony Williams
What you’ll build: A thread pool that is topology-aware, with per-node queues and locality-aware work stealing.
Why it teaches NUMA threading: Standard thread pools ignore locality. NUMA-aware pools keep tasks near their data.
Core challenges you’ll face:
- Thread pinning → mapping workers to nodes
- Local queues → keeping tasks local
- Work stealing → prefer local, then nearest node
- Stats → measuring local vs remote execution
Real World Outcome
NUMAThreadPool pool(4); // 4 threads per node
pool.submit(task, preferred_node);
pool.stats();
// Node 0: 98.2% local tasks
// Node 1: 97.5% local tasks
// Cross-node steals: 0.6%
The Core Question You’re Answering
“How do I keep threads and tasks on the same NUMA node?”
Concepts You Must Understand First
- CPU affinity
- How do you pin threads?
- Book Reference: C++ Concurrency in Action Ch. 2
- NUMA topology
- How do you map CPUs to nodes?
- Book Reference: Computer Architecture Ch. 2
- Work stealing
- Why do you steal locally first?
- Book Reference: Art of Multiprocessor Programming Ch. 7
Questions to Guide Your Design
- How will you represent per-node queues?
- How will you choose steal order based on distance?
- How will you measure locality?
Thinking Exercise
If a task on node 0 needs data allocated on node 1, is it better to move the task or the data? Explain.
The Interview Questions They’ll Ask
- “How does a NUMA-aware thread pool differ from a normal one?”
- “What metrics show locality success?”
- “When should you allow cross-node stealing?”
- “How does affinity affect throughput?”
Hints in Layers
Hint 1: Starting Point Create a vector of per-node queues.
Hint 2: Next Step
Pin threads to node CPUs with pthread_setaffinity_np.
Hint 3: Key Technique Use distance matrix to order steal attempts.
Hint 4: Debugging/Verification Collect statistics: local tasks vs remote tasks.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Concurrency | C++ Concurrency in Action | Ch. 2-4 |
| NUMA basics | Computer Architecture | Ch. 2 |
Common Pitfalls & Debugging
Problem 1: “Threads migrate anyway”
- Why: Affinity not set or overridden by cpusets
- Fix: Verify with
taskset -cp <pid>
Problem 2: “Too much stealing”
- Why: Local queues too small or imbalanced
- Fix: Adjust load balancing thresholds
Problem 3: “Deadlocks in queue”
- Why: Incorrect locking or condition variable use
- Fix: Use per-queue mutex and careful notification
Definition of Done
- Threads pinned per node
- Local task execution >95%
- Work stealing respects distance
- Stats show measurable improvement vs global queue
Project 10: NUMA-Aware Database Buffer Pool
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 5: Pure Magic
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Database Systems / NUMA
- Software or Tool: libnuma, mmap, async I/O
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A NUMA-aware buffer pool manager that places pages on the node where they are most accessed and runs queries near those pages.
Why it teaches real-world NUMA: Databases are the canonical NUMA workload. This project integrates placement, threading, and data structures into a realistic system.
Core challenges you’ll face:
- Per-node caches → partitioned buffer pool
- Placement policy → deciding where pages live
- Execution locality → running queries near data
- Migration → moving pages when access patterns change
Real World Outcome
Buffer Pool Stats:
Node 0: 1.2M pages, 98.9% local hits
Node 1: 1.1M pages, 98.3% local hits
Remote hits: 1.2%
Page migrations: 842
The Core Question You’re Answering
“How do real systems like databases exploit NUMA locality at scale?”
Concepts You Must Understand First
- Buffer pools
- Why do databases cache pages?
- Book Reference: Database Internals Ch. 5
- NUMA-aware allocation
- How do you allocate pages per node?
- Book Reference: C Interfaces and Implementations Ch. 5-6
- Thread placement
- How do you run queries near data?
- Book Reference: C++ Concurrency in Action Ch. 2
Questions to Guide Your Design
- How will you map pages to nodes?
- How will you migrate pages when access changes?
- How will you schedule queries to local nodes?
Thinking Exercise
Suppose a table is hot on node 0 during the day and node 1 at night. How should your buffer pool adapt?
The Interview Questions They’ll Ask
- “Why is NUMA important for databases?”
- “How do you choose which node holds a page?”
- “What is the cost of page migration?”
- “How do you keep query execution local?”
Hints in Layers
Hint 1: Starting Point Partition your buffer pool into per-node pools.
Hint 2: Next Step
Use numa_alloc_onnode for page allocations.
Hint 3: Key Technique Track access stats per page and migrate if locality changes.
Hint 4: Debugging/Verification Measure local hit rate and remote hit rate per node.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Buffer pools | Database Internals | Ch. 5 |
| NUMA placement | Computer Architecture | Ch. 2 |
| Concurrency | C++ Concurrency in Action | Ch. 2-4 |
Common Pitfalls & Debugging
Problem 1: “Remote hits too high”
- Why: Queries not co-located with pages
- Fix: Pin worker threads to nodes and schedule per-node
Problem 2: “Migration thrashing”
- Why: Pages move too frequently
- Fix: Add hysteresis or threshold for migration
Problem 3: “Lock contention”
- Why: Global locks across nodes
- Fix: Partition locks per node or shard by table
Definition of Done
- Buffer pool is partitioned by node
- Local hit rate >95%
- Page migration improves locality without thrashing
- Query execution is co-located with pages
Project Comparison Table
| Project | Difficulty | Time | Depth | Focus |
|---|---|---|---|---|
| 1. Topology Explorer | Beginner | Weekend | ⭐⭐ | Discovery |
| 2. Latency Benchmark | Intermediate | 1 week | ⭐⭐⭐⭐ | Measurement |
| 3. Bandwidth Benchmark | Intermediate | 1 week | ⭐⭐⭐⭐ | Measurement |
| 4. False Sharing Detector | Intermediate | 1 week | ⭐⭐⭐⭐ | Coherence |
| 5. NUMA Allocator | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | Placement |
| 6. NUMA Data Structures | Advanced | 3-4 weeks | ⭐⭐⭐⭐⭐ | Locality |
| 7. Migration Tool | Advanced | 2 weeks | ⭐⭐⭐⭐ | Control |
| 8. Coherence Simulator | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | Theory |
| 9. NUMA Thread Pool | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | Scheduling |
| 10. Buffer Pool | Expert | 4-6 weeks | ⭐⭐⭐⭐⭐ | Integration |
Summary
| # | Project | Outcome |
|---|---|---|
| 1 | Topology Explorer | Discover nodes, CPUs, distances |
| 2 | Latency Benchmark | Measure local vs remote latency |
| 3 | Bandwidth Benchmark | Measure NUMA bandwidth effects |
| 4 | False Sharing Detector | Detect coherence contention |
| 5 | NUMA Allocator | Control page placement |
| 6 | NUMA Data Structures | Locality-aware structures |
| 7 | Migration Tool | Move and verify pages |
| 8 | Coherence Simulator | Understand MESI/MOESI |
| 9 | NUMA Thread Pool | Locality-aware scheduling |
| 10 | NUMA Buffer Pool | Real-world NUMA integration |
Master these projects and you will understand why some code scales and some code collapses on modern hardware.