LEARN NUMA VS UMA ARCHITECTURES DEEP DIVE
Learn NUMA vs UMA Architectures: From Memory Basics to High-Performance Systems
Goal: Deeply understand memory architectures in multiprocessor systems—from uniform memory access basics to NUMA topology, cache coherence, memory placement policies, and building NUMA-aware high-performance applications.
Why Memory Architecture Matters
Every modern server, workstation, and even high-end desktop uses NUMA architecture. When your database is slow, your parallel code doesn’t scale, or your application has mysterious performance cliffs—the answer often lies in memory architecture.
Understanding NUMA vs UMA means understanding:
- Why adding more CPU cores doesn’t always make things faster
- Why the same code runs at different speeds on different servers
- How memory placement affects cache behavior
- Why memory allocation strategy can 10x your performance
- How to write code that scales on modern hardware
After completing these projects, you will:
- Understand the physical reality of memory in modern systems
- Measure and visualize NUMA topology and latencies
- Write NUMA-aware applications that scale efficiently
- Implement custom memory allocators for NUMA systems
- Debug and optimize memory placement issues
- Understand cache coherence and its performance implications
Core Concept Analysis
The Physical Reality
Modern systems have a hierarchy of memory access times:
┌─────────────────────────────────────────────────────────────────┐
│ CPU Socket 0 │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Core 0 │ │ Core 1 │ │ Core 2 │ │ Core 3 │ │
│ │┌───────┐│ │┌───────┐│ │┌───────┐│ │┌───────┐│ │
│ ││ L1 ││ ││ L1 ││ ││ L1 ││ ││ L1 ││ ~1ns │
│ │└───────┘│ │└───────┘│ │└───────┘│ │└───────┘│ │
│ │┌───────┐│ │┌───────┐│ │┌───────┐│ │┌───────┐│ │
│ ││ L2 ││ ││ L2 ││ ││ L2 ││ ││ L2 ││ ~4ns │
│ │└───────┘│ │└───────┘│ │└───────┘│ │└───────┘│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ └──────────────┬───────────────────┘ │
│ ┌───────┴───────┐ │
│ │ Shared L3 │ ~20ns │
│ │ (LLC) │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ Memory Ctrl │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ Local DRAM │ ~80ns (LOCAL) │
│ │ (NUMA Node 0)│ │
│ └───────────────┘ │
└─────────────────────┬───────────────────────────────────────────┘
│
QPI/UPI Link (~40ns additional)
│
┌─────────────────────┴───────────────────────────────────────────┐
│ CPU Socket 1 │
│ ┌───────────────┐ │
│ │ Remote DRAM │ ~120ns (REMOTE) │
│ │ (NUMA Node 1)│ │
│ └───────────────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ Memory Ctrl │ │
│ └───────┬───────┘ │
│ │ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Core 4 │ │ Core 5 │ │ Core 6 │ │ Core 7 │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘
UMA (Uniform Memory Access)
In UMA systems, all processors have equal access time to all memory:
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ CPU 0 │ │ CPU 1 │ │ CPU 2 │ │ CPU 3 │
└───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘
│ │ │ │
└───────────┴─────┬─────┴───────────┘
│
┌───────┴───────┐
│ Shared Bus │
└───────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌───┴───┐ ┌───┴───┐ ┌───┴───┐
│ Bank 0│ │ Bank 1│ │ Bank 2│
└───────┘ └───────┘ └───────┘
SHARED MEMORY
(Equal latency from all CPUs)
Characteristics:
- Simple programming model
- All memory accesses take the same time
- Limited scalability (bus becomes bottleneck)
- Found in: older systems, small multicore chips
- Example: Single-socket desktop computers
NUMA (Non-Uniform Memory Access)
In NUMA systems, memory access time depends on memory location relative to the CPU:
NUMA Node 0 NUMA Node 1
┌─────────────────────┐ ┌─────────────────────┐
│ ┌────┐ ┌────┐ │ │ ┌────┐ ┌────┐ │
│ │CPU0│ │CPU1│ │ │ │CPU2│ │CPU3│ │
│ └──┬─┘ └──┬─┘ │ │ └──┬─┘ └──┬─┘ │
│ └────┬──┘ │ │ └────┬──┘ │
│ │ │ │ │ │
│ ┌─────┴─────┐ │◄──────►│ ┌─────┴─────┐ │
│ │Mem Ctrl 0 │ │ QPI/ │ │Mem Ctrl 1 │ │
│ └─────┬─────┘ │ UPI │ └─────┬─────┘ │
│ │ │ Inter- │ │ │
│ ┌─────┴─────┐ │connect │ ┌─────┴─────┐ │
│ │ Local RAM │ │ │ │ Local RAM │ │
│ │ (Fast) │ │ │ │ (Fast) │ │
│ │ ~80ns │ │ │ │ ~80ns │ │
│ └───────────┘ │ │ └───────────┘ │
└─────────────────────┘ └─────────────────────┘
│ │
│ Remote Access ~120ns │
└──────────────────────────────┘
Characteristics:
- Local memory access is fast (~80ns)
- Remote memory access is slower (~120-150ns or more)
- Scales better than UMA
- Requires careful memory placement for performance
- Found in: servers, workstations, multi-socket systems
Memory Access Latency Comparison
| Memory Level | Latency | Cycles (3GHz) | Notes |
|---|---|---|---|
| L1 Cache | ~1ns | 3-4 | Per-core |
| L2 Cache | ~4ns | 12 | Per-core |
| L3 Cache | ~20ns | 60 | Shared within socket |
| Local DRAM | ~80ns | 240 | Same NUMA node |
| Remote DRAM | ~120ns+ | 360+ | Different NUMA node |
| Remote (2 hop) | ~150ns+ | 450+ | Through another node |
Key Insight: Remote memory access can be 50-100% slower than local. This is why NUMA awareness matters!
Cache Coherence (MESI Protocol)
When multiple cores cache the same memory location, they must stay synchronized:
MESI States:
┌─────────────────────────────────────────────────────────────┐
│ Modified (M) │ Only copy, dirty (different from memory) │
├─────────────────────────────────────────────────────────────┤
│ Exclusive (E) │ Only copy, clean (same as memory) │
├─────────────────────────────────────────────────────────────┤
│ Shared (S) │ Multiple copies exist, all clean │
├─────────────────────────────────────────────────────────────┤
│ Invalid (I) │ Not valid, must fetch from memory/cache │
└─────────────────────────────────────────────────────────────┘
State Transitions:
┌─────────┐
┌───────►│ Invalid │◄───────┐
│ └────┬────┘ │
evict/ │ Read │ Remote Write
invalidate ▼ │
│ ┌───────┐ │
├───────►│ Shared│◄─────────┤
│ └───┬───┘ │
│ │ Local Write │
│ ▼ │
│ ┌─────────┐ │
│ │Exclusive│─────────┤
│ └────┬────┘ │
│ │ Local Write │
│ ▼ │
│ ┌─────────┐ │
└───────│Modified │─────────┘
└─────────┘
Why This Matters for NUMA:
- Cache coherence traffic crosses NUMA nodes
- “False sharing” causes cache line ping-pong between nodes
- Directory-based protocols (ccNUMA) scale better than bus snooping
NUMA Policies in Linux
Linux provides several memory allocation policies:
| Policy | Behavior |
|---|---|
MPOL_DEFAULT |
Allocate on local node (first-touch) |
MPOL_BIND |
Allocate only on specified nodes (fail if unavailable) |
MPOL_PREFERRED |
Prefer specified node, fallback to others |
MPOL_INTERLEAVE |
Round-robin across nodes (good for shared data) |
First-Touch Policy (Default):
- Memory is allocated on the node of the first thread to write to it
- This is why initialization patterns matter!
// BAD: Main thread allocates, workers use remote memory
char* buffer = malloc(1GB); // Allocated on main thread's node
memset(buffer, 0, 1GB); // All memory now on node 0
// Workers on node 1 access remote memory - SLOW!
// GOOD: Each worker initializes its own portion
char* buffer = malloc(1GB); // Allocated but not touched
#pragma omp parallel for
for (int i = 0; i < size; i++) {
buffer[i] = 0; // Each thread touches its portion
}
// Memory distributed across nodes based on which thread touches it
Project List
Projects are ordered from understanding concepts to building production-quality NUMA-aware systems.
Project 1: NUMA Topology Explorer
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: NUMA Topology / System Information
- Software or Tool: libnuma, Linux sysfs
- Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson
What you’ll build: A tool that discovers and visualizes the NUMA topology of any Linux system—showing nodes, CPUs, memory, and distances between nodes.
Why it teaches NUMA basics: Before optimizing for NUMA, you must understand what you’re working with. This project teaches you to read the hardware topology that determines performance.
Core challenges you’ll face:
- Reading /sys/devices/system/node/ → maps to understanding Linux NUMA interfaces
- Parsing CPU affinity masks → maps to understanding which CPUs belong to which nodes
- Interpreting distance matrix → maps to understanding relative memory access costs
- Using libnuma APIs → maps to programmatic topology discovery
Key Concepts:
- NUMA Topology: Linux Kernel NUMA Documentation
- libnuma Interface: numa(3) man page
- CPU Affinity: “Linux System Programming” Chapter 6 - Robert Love
- Memory Hierarchy: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, Linux familiarity
Real world outcome:
$ ./numa-explorer
=== NUMA Topology Report ===
System: 2 NUMA nodes, 16 CPUs, 64GB RAM
Node 0:
CPUs: 0-7
Memory: 32GB (28GB free)
Distance to Node 0: 10
Distance to Node 1: 21
Node 1:
CPUs: 8-15
Memory: 32GB (30GB free)
Distance to Node 0: 21
Distance to Node 1: 10
Distance Matrix (relative latency):
Node 0 Node 1
Node 0: 10 21
Node 1: 21 10
Cache Topology:
L1d: 32KB per core
L1i: 32KB per core
L2: 256KB per core
L3: 20MB shared per node
Interconnect: Intel QPI @ 8.0 GT/s
Implementation Hints:
- Discover NUMA Nodes:
#include <numa.h> if (numa_available() < 0) { printf("NUMA not available on this system\n"); return 1; } int num_nodes = numa_max_node() + 1; printf("System has %d NUMA nodes\n", num_nodes); - Read from sysfs (alternative to libnuma):
// Nodes are in /sys/devices/system/node/node*/ // CPUs per node: /sys/devices/system/node/node0/cpulist // Memory info: /sys/devices/system/node/node0/meminfo // Distances: /sys/devices/system/node/node0/distance - Get CPUs per Node:
struct bitmask* cpus = numa_allocate_cpumask(); for (int node = 0; node <= numa_max_node(); node++) { numa_node_to_cpus(node, cpus); printf("Node %d CPUs: ", node); for (int cpu = 0; cpu < numa_num_configured_cpus(); cpu++) { if (numa_bitmask_isbitset(cpus, cpu)) { printf("%d ", cpu); } } printf("\n"); } - Get Memory per Node:
long long free_mem; long long total_mem = numa_node_size64(node, &free_mem); printf("Node %d: %lld MB total, %lld MB free\n", node, total_mem / (1024*1024), free_mem / (1024*1024)); - Get Distance Matrix:
for (int i = 0; i <= numa_max_node(); i++) { for (int j = 0; j <= numa_max_node(); j++) { int distance = numa_distance(i, j); printf("%4d ", distance); } printf("\n"); }
Learning milestones:
- Tool discovers node count → You understand NUMA basics
- CPU mapping works → You understand node-CPU relationships
- Memory info works → You understand node memory
- Distance matrix displays → You understand NUMA distances
- Compare with numactl → Your tool matches system tools
Project 2: Memory Latency Microbenchmark
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Assembly
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Memory Performance / Benchmarking
- Software or Tool: libnuma, rdtsc/rdtscp
- Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson
What you’ll build: A microbenchmark that measures memory access latency to different NUMA nodes, cache levels, and memory regions—revealing the true cost of memory architecture decisions.
Why it teaches memory performance: Numbers like “120ns remote latency” are abstract until you measure them yourself. This project gives you intuition for memory performance that will guide all your optimization work.
Core challenges you’ll face:
- Pointer chasing to defeat prefetchers → maps to understanding hardware prefetching
- High-resolution timing → maps to using TSC, avoiding measurement overhead
- Controlling memory placement → maps to numa_alloc_onnode, mbind
- Statistical analysis → maps to handling variance in measurements
Key Concepts:
- Memory Latency Measurement: Intel MLC Documentation
- Pointer Chasing: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson
- TSC Timing: Intel Software Developer’s Manual
- NUMA Allocation: libnuma documentation
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 completed, understanding of memory hierarchy
Real world outcome:
$ ./numa-latency
Memory Latency Benchmark
========================
Testing from Node 0 (CPUs 0-7):
L1 Cache (32KB): 1.2 ns
L2 Cache (256KB): 4.1 ns
L3 Cache (20MB): 18.7 ns
Local DRAM: 78.3 ns
Remote DRAM (Node 1): 127.6 ns
Testing from Node 1 (CPUs 8-15):
L1 Cache: 1.2 ns
L2 Cache: 4.0 ns
L3 Cache: 19.1 ns
Local DRAM: 79.1 ns
Remote DRAM (Node 0): 125.4 ns
NUMA Penalty: 1.63x slower for remote access
Full Latency Matrix (ns):
To Node 0 To Node 1
From Node 0: 78.3 127.6
From Node 1: 125.4 79.1
Implementation Hints:
- Pointer Chasing (defeats prefetcher):
// Create a linked list with random order typedef struct Node { struct Node* next; char padding[56]; // Make each node 64 bytes (cache line) } Node; // Shuffle the list randomly so prefetcher can't predict void shuffle_list(Node* nodes, size_t count) { // Fisher-Yates shuffle of next pointers for (size_t i = count - 1; i > 0; i--) { size_t j = rand() % (i + 1); // Swap next pointers Node* temp = nodes[i].next; nodes[i].next = nodes[j].next; nodes[j].next = temp; } } - High-Resolution Timing:
static inline uint64_t rdtsc() { uint32_t lo, hi; __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } static inline uint64_t rdtscp() { uint32_t lo, hi, aux; __asm__ volatile ("rdtscp" : "=a" (lo), "=d" (hi), "=c" (aux)); return ((uint64_t)hi << 32) | lo; } // Measure latency uint64_t start = rdtsc(); for (int i = 0; i < iterations; i++) { p = p->next; // Pointer chase } uint64_t end = rdtscp(); uint64_t cycles = (end - start) / iterations; double ns = cycles / cpu_ghz; - Allocate on Specific Node:
#include <numa.h> void* buffer = numa_alloc_onnode(size, target_node); if (!buffer) { perror("numa_alloc_onnode failed"); } // Or use mbind for existing memory unsigned long nodemask = 1UL << target_node; mbind(buffer, size, MPOL_BIND, &nodemask, numa_max_node() + 2, 0); - Pin Thread to Specific CPU:
cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu_id, &cpuset); pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); - Test Different Sizes (to hit different cache levels):
size_t test_sizes[] = { 16 * 1024, // L1 (32KB) 128 * 1024, // L2 (256KB) 8 * 1024 * 1024, // L3 (varies) 256 * 1024 * 1024 // DRAM };
Learning milestones:
- Measure cache latencies → You understand cache hierarchy
- Measure local vs remote DRAM → You see NUMA penalty
- Results match Intel MLC → Your benchmark is accurate
- Understand variance → You understand measurement challenges
- Can predict performance → Real understanding achieved
Project 3: Memory Bandwidth Benchmark
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: C++ with intrinsics, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Memory Bandwidth / SIMD
- Software or Tool: libnuma, AVX intrinsics
- Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson
What you’ll build: A bandwidth benchmark (like STREAM) that measures memory throughput across NUMA nodes, with different access patterns and thread configurations.
Why it teaches bandwidth: Latency hides with pipelining; bandwidth doesn’t. Understanding bandwidth limits tells you when you’re memory-bound and how parallelism affects memory performance.
Core challenges you’ll face:
- Saturating memory bandwidth → maps to using SIMD, multiple threads
- NUMA-aware allocation → maps to controlling where data lives
- Measuring aggregate bandwidth → maps to combining thread results
- Understanding bandwidth scaling → maps to when more threads help/hurt
Key Concepts:
- STREAM Benchmark: STREAM Documentation
- Memory Bandwidth: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson
- SIMD Instructions: Intel Intrinsics Guide
- NUMA Bandwidth: Measuring NUMA effects with STREAM
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 2 completed, understanding of SIMD basics
Real world outcome:
$ ./numa-bandwidth --threads 16
NUMA Memory Bandwidth Benchmark
===============================
Single-threaded (pinned to Node 0):
Copy (Local): 12.4 GB/s
Copy (Remote): 6.8 GB/s
Scale (Local): 11.9 GB/s
Triad (Local): 13.2 GB/s
Multi-threaded bandwidth:
Threads Local BW Remote BW Interleaved
1 12.4 GB/s 6.8 GB/s 9.5 GB/s
2 24.6 GB/s 11.2 GB/s 17.8 GB/s
4 45.3 GB/s 18.7 GB/s 31.4 GB/s
8 68.2 GB/s 24.1 GB/s 43.6 GB/s
16 72.1 GB/s 25.3 GB/s 47.2 GB/s
Peak aggregate bandwidth: 72.1 GB/s (8 threads saturate)
Remote access penalty: 2.85x slower
Per-Node breakdown:
Node 0: 72.1 GB/s peak (8 channels × 2666 MT/s)
Node 1: 71.8 GB/s peak
Implementation Hints:
- STREAM-style Operations:
// Copy: c[i] = a[i] void copy(double* restrict c, double* restrict a, size_t n) { #pragma omp parallel for for (size_t i = 0; i < n; i++) { c[i] = a[i]; } } // Scale: b[i] = scalar * c[i] void scale(double* restrict b, double* restrict c, double scalar, size_t n) { #pragma omp parallel for for (size_t i = 0; i < n; i++) { b[i] = scalar * c[i]; } } // Add: c[i] = a[i] + b[i] void add(double* restrict c, double* restrict a, double* restrict b, size_t n) { #pragma omp parallel for for (size_t i = 0; i < n; i++) { c[i] = a[i] + b[i]; } } // Triad: a[i] = b[i] + scalar * c[i] void triad(double* restrict a, double* restrict b, double* restrict c, double scalar, size_t n) { #pragma omp parallel for for (size_t i = 0; i < n; i++) { a[i] = b[i] + scalar * c[i]; } } - AVX-512 for Maximum Bandwidth:
#include <immintrin.h> void copy_avx512(double* restrict dst, double* restrict src, size_t n) { for (size_t i = 0; i < n; i += 8) { __m512d v = _mm512_load_pd(&src[i]); _mm512_store_pd(&dst[i], v); } } - NUMA-Aware Allocation:
// Allocate on specific node double* local_array = numa_alloc_onnode(size, local_node); // Allocate interleaved across all nodes double* interleaved = numa_alloc_interleaved(size); // First-touch initialization for NUMA placement #pragma omp parallel for for (size_t i = 0; i < n; i++) { array[i] = 0.0; // Each thread touches its portion } - Bandwidth Calculation:
double bytes = n * sizeof(double) * 2; // Read + Write for Copy double seconds = end_time - start_time; double bandwidth_gb = (bytes / 1e9) / seconds; - Scaling Test:
for (int threads = 1; threads <= max_threads; threads *= 2) { omp_set_num_threads(threads); // Run benchmark and record bandwidth }
Learning milestones:
- Single-threaded bandwidth measured → You understand basic measurement
- Local vs remote difference clear → You see NUMA bandwidth penalty
- Scaling curve shows saturation → You understand bandwidth limits
- SIMD version faster → You understand importance of vectorization
- Interleaved vs local trade-offs → Deep NUMA bandwidth understanding
Project 4: False Sharing Detector
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Cache Coherence / Performance Analysis
- Software or Tool: perf, libnuma
- Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit
What you’ll build: A tool that demonstrates and detects false sharing—when threads access different data that happens to share a cache line, causing severe performance degradation.
Why it teaches cache coherence: False sharing is one of the most common performance bugs in parallel code. It’s invisible to correctness but devastating to performance—sometimes 10-100x slower.
Core challenges you’ll face:
- Creating false sharing conditions → maps to understanding cache line granularity
- Measuring cache coherence traffic → maps to using hardware performance counters
- Padding to eliminate false sharing → maps to cache-line alignment
- Detecting in real code → maps to analyzing memory access patterns
Key Concepts:
- False Sharing: “The Art of Multiprocessor Programming” Chapter 7 - Herlihy & Shavit
- MESI Protocol: ARM MESI Documentation
- Cache Line Size: Intel/AMD processor documentation
- Performance Counters: perf stat documentation
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 completed, understanding of caching
Real world outcome:
$ ./false-sharing-demo
=== False Sharing Demonstration ===
Test 1: Adjacent counters (FALSE SHARING)
Struct layout: [cnt0|cnt1|cnt2|cnt3] (8 bytes each, same cache line)
4 threads incrementing their own counter...
Time: 2847 ms
Throughput: 35.1 M ops/sec
Cache misses: 847,293,102
Test 2: Padded counters (NO FALSE SHARING)
Struct layout: [cnt0|pad...][cnt1|pad...] (64 bytes each, separate lines)
4 threads incrementing their own counter...
Time: 127 ms
Throughput: 787.4 M ops/sec
Cache misses: 1,234
Speedup from eliminating false sharing: 22.4x
=== Detecting False Sharing ===
Monitoring process 12345 for 5 seconds...
Potential false sharing detected:
Address range: 0x7f8a12340000 - 0x7f8a12340040
Threads accessing: 0, 1, 2, 3
Cache coherence invalidations: 12,847,293
Recommendation: Add padding between variables at offsets 0, 8, 16, 24
Implementation Hints:
- Create False Sharing:
// BAD: All counters on same cache line struct { volatile long counter[4]; // 32 bytes, fits in one 64-byte cache line } shared; void* increment_bad(void* arg) { int id = *(int*)arg; for (int i = 0; i < 100000000; i++) { shared.counter[id]++; // Causes cache line invalidation! } return NULL; } - Eliminate False Sharing with Padding:
// GOOD: Each counter on its own cache line #define CACHE_LINE_SIZE 64 struct alignas(CACHE_LINE_SIZE) { volatile long counter; char padding[CACHE_LINE_SIZE - sizeof(long)]; } padded[4]; void* increment_good(void* arg) { int id = *(int*)arg; for (int i = 0; i < 100000000; i++) { padded[id].counter++; // No false sharing } return NULL; } - Measure with Performance Counters:
# Run with perf to see cache effects perf stat -e cache-misses,cache-references,\ LLC-load-misses,LLC-store-misses \ ./false_sharing_demo # Or use perf c2c for cache-to-cache analysis perf c2c record ./false_sharing_demo perf c2c report - Programmatic Counter Access:
#include <linux/perf_event.h> #include <sys/syscall.h> static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags); } // Setup counter for cache misses struct perf_event_attr pe; memset(&pe, 0, sizeof(pe)); pe.type = PERF_TYPE_HARDWARE; pe.config = PERF_COUNT_HW_CACHE_MISSES; pe.disabled = 1; pe.exclude_kernel = 1; int fd = perf_event_open(&pe, 0, -1, -1, 0); - Alignment Utilities:
// C11/C++11 alignas struct alignas(64) CacheAlignedCounter { long value; }; // Or manual padding struct PaddedCounter { long value; char padding[64 - sizeof(long)]; } __attribute__((aligned(64)));
Learning milestones:
- Demo shows massive slowdown → You see false sharing impact
- Padding fixes the problem → You understand the solution
- Perf counters confirm diagnosis → You can detect it
- You can explain MESI states → Deep understanding
- Find false sharing in real code → Practical application
Project 5: NUMA-Aware Memory Allocator
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Allocators / NUMA
- Software or Tool: libnuma, mmap
- Main Book: “The Art of Memory Management” and AMD NUMA allocator paper
What you’ll build: A custom memory allocator that is NUMA-aware—allocating memory from the local node when possible and providing explicit control over memory placement.
Why it teaches NUMA allocation: Standard malloc() doesn’t know about NUMA. High-performance applications need allocators that understand memory topology and place data strategically.
Core challenges you’ll face:
- Per-node memory pools → maps to maintaining separate free lists per node
- Thread-to-node mapping → maps to determining which node a thread runs on
- Allocation strategies → maps to local-first vs interleaved
- Fragmentation management → maps to coalescing, large page support
Key Concepts:
- NUMA Allocator Design: AMD NUMA-aware heap manager
- Memory Allocator Basics: “The Art of Multiprocessor Programming” Chapter 14 - Herlihy & Shavit
- libnuma Allocation: numa(3) man page
- Page Allocation: Linux mmap() documentation
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed, understanding of memory allocators
Real world outcome:
// Your allocator API
void* numa_malloc(size_t size); // Allocate on local node
void* numa_malloc_node(size_t size, int node); // Allocate on specific node
void* numa_malloc_interleaved(size_t size); // Interleave across nodes
void numa_free(void* ptr);
// Statistics
numa_allocator_stats();
// Output:
// Node 0: 1.2 GB allocated, 847 allocations, 0 remote accesses
// Node 1: 0.8 GB allocated, 523 allocations, 0 remote accesses
// Thread-local cache hits: 94.2%
Implementation Hints:
- Per-Node Arena Structure:
#define MAX_NODES 8 #define ARENA_SIZE (64 * 1024 * 1024) // 64MB per arena typedef struct { void* base; // mmap'd region size_t size; // Total size size_t used; // Bytes allocated pthread_mutex_t lock; FreeList* free_lists[NUM_SIZE_CLASSES]; } NodeArena; typedef struct { NodeArena arenas[MAX_NODES]; int num_nodes; } NUMAAllocator; - Initialize Per-Node Arenas:
void numa_allocator_init(NUMAAllocator* alloc) { alloc->num_nodes = numa_max_node() + 1; for (int node = 0; node < alloc->num_nodes; node++) { // Allocate arena memory on specific node void* base = numa_alloc_onnode(ARENA_SIZE, node); if (!base) { perror("numa_alloc_onnode failed"); exit(1); } alloc->arenas[node].base = base; alloc->arenas[node].size = ARENA_SIZE; alloc->arenas[node].used = 0; pthread_mutex_init(&alloc->arenas[node].lock, NULL); } } - Determine Current Node:
int get_current_node() { int cpu = sched_getcpu(); return numa_node_of_cpu(cpu); } - Local Allocation:
void* numa_malloc(size_t size) { int node = get_current_node(); return allocate_from_node(&allocator.arenas[node], size); } void* allocate_from_node(NodeArena* arena, size_t size) { pthread_mutex_lock(&arena->lock); // Find appropriate size class int size_class = get_size_class(size); // Check free list if (arena->free_lists[size_class]) { void* ptr = arena->free_lists[size_class]; arena->free_lists[size_class] = *(void**)ptr; pthread_mutex_unlock(&arena->lock); return ptr; } // Allocate from arena size_t alloc_size = size_class_to_size(size_class); if (arena->used + alloc_size > arena->size) { pthread_mutex_unlock(&arena->lock); return NULL; // Arena full } void* ptr = (char*)arena->base + arena->used; arena->used += alloc_size; pthread_mutex_unlock(&arena->lock); return ptr; } - Thread-Local Caching (for performance):
__thread struct { void* cache[NUM_SIZE_CLASSES][CACHE_SIZE]; int count[NUM_SIZE_CLASSES]; int current_node; } tls_cache; void* numa_malloc_fast(size_t size) { int sc = get_size_class(size); if (tls_cache.count[sc] > 0) { return tls_cache.cache[sc][--tls_cache.count[sc]]; } // Slow path: go to arena return numa_malloc(size); } - Interleaved Allocation:
__thread int interleave_node = 0; void* numa_malloc_interleaved(size_t size) { int node = interleave_node; interleave_node = (interleave_node + 1) % allocator.num_nodes; return allocate_from_node(&allocator.arenas[node], size); }
Learning milestones:
- Basic per-node allocation works → You understand the structure
- Correct node determined → You understand thread-node mapping
- Free lists work → You understand memory reuse
- Thread-local caching works → You understand performance optimization
- Benchmark shows improvement → Your allocator beats standard malloc for NUMA
Project 6: NUMA-Aware Data Structure Library
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Structures / NUMA Optimization
- Software or Tool: libnuma, C++ allocators
- Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit
What you’ll build: A library of NUMA-aware data structures—hash tables, queues, and trees that understand memory locality and minimize cross-node traffic.
Why it teaches NUMA data structures: Generic data structures ignore NUMA. High-performance systems need data structures that partition data across nodes and minimize remote accesses.
Core challenges you’ll face:
- Partitioned hash tables → maps to per-node buckets, local lookups
- NUMA-aware queues → maps to minimizing producer-consumer cross-node traffic
- Tree partitioning → maps to locality-aware tree construction
- Balancing locality vs load → maps to when to access remote data
Key Concepts:
- Concurrent Data Structures: “The Art of Multiprocessor Programming” Chapters 9-13 - Herlihy & Shavit
- NUMA-Aware Hashing: Research papers on NUMA hash tables
- Work Stealing: “Cilk” scheduler documentation
- Lock-Free Programming: “C++ Concurrency in Action” by Anthony Williams
Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 5 completed, understanding of concurrent data structures
Real world outcome:
// NUMA-aware hash table
NUMAHashTable<std::string, UserData> table(numa_nodes);
// Insert - data placed on local node
table.insert("user123", userData); // O(1), local memory
// Lookup - checks local node first
auto result = table.find("user123"); // Usually local
// NUMA-aware work queue
NUMAWorkQueue<Task> queue;
queue.push(task); // Goes to local node's queue
// Worker pops from local queue first, steals from remote if empty
auto task = queue.pop(); // Local-first, then steal
// Statistics
table.stats();
// Output:
// Node 0: 1.2M entries, 99.2% local lookups
// Node 1: 1.1M entries, 98.7% local lookups
// Cross-node lookups: 1.1%
Implementation Hints:
- NUMA-Aware Hash Table:
template<typename K, typename V> class NUMAHashTable { private: struct NodePartition { std::vector<Bucket> buckets; mutable std::shared_mutex mutex; alignas(64) std::atomic<size_t> count{0}; }; std::vector<NodePartition> partitions; int key_to_node(const K& key) { // Hash determines which node owns this key return std::hash<K>{}(key) % partitions.size(); } public: NUMAHashTable(int num_nodes) : partitions(num_nodes) { for (int i = 0; i < num_nodes; i++) { // Allocate buckets on specific node numa_run_on_node(i); partitions[i].buckets.resize(INITIAL_BUCKETS); } } void insert(const K& key, const V& value) { int node = key_to_node(key); auto& partition = partitions[node]; std::unique_lock lock(partition.mutex); // Insert into partition } std::optional<V> find(const K& key) { int node = key_to_node(key); auto& partition = partitions[node]; std::shared_lock lock(partition.mutex); // Lookup in partition } }; - NUMA-Aware Work-Stealing Queue:
template<typename T> class NUMAWorkQueue { private: struct alignas(64) NodeQueue { std::deque<T> local_queue; std::mutex mutex; }; std::vector<NodeQueue> queues; public: void push(T item) { int node = numa_node_of_cpu(sched_getcpu()); std::lock_guard lock(queues[node].mutex); queues[node].local_queue.push_back(std::move(item)); } std::optional<T> pop() { int my_node = numa_node_of_cpu(sched_getcpu()); // Try local queue first { std::lock_guard lock(queues[my_node].mutex); if (!queues[my_node].local_queue.empty()) { T item = std::move(queues[my_node].local_queue.back()); queues[my_node].local_queue.pop_back(); return item; } } // Steal from other nodes (in distance order) for (int node = 0; node < queues.size(); node++) { if (node == my_node) continue; std::lock_guard lock(queues[node].mutex); if (!queues[node].local_queue.empty()) { T item = std::move(queues[node].local_queue.front()); queues[node].local_queue.pop_front(); return item; } } return std::nullopt; } }; - NUMA-Aware Vector:
template<typename T> class NUMAVector { private: struct Chunk { T* data; size_t size; int node; }; std::vector<Chunk> chunks; size_t chunk_size = 1024 * 1024 / sizeof(T); // 1MB chunks public: void push_back(const T& value) { int node = get_current_node(); // Allocate new chunk on local node if needed if (needs_new_chunk()) { T* data = (T*)numa_alloc_onnode(chunk_size * sizeof(T), node); chunks.push_back({data, 0, node}); } // Add to current chunk } T& operator[](size_t index) { auto& chunk = find_chunk(index); return chunk.data[index % chunk_size]; } };
Learning milestones:
- Partitioned hash table works → You understand data partitioning
- Work stealing works → You understand locality-aware scheduling
- Benchmark shows improvement → Your structures are faster
- Statistics confirm locality → You’re achieving local access
- Can integrate with real application → Practical value achieved
Project 7: NUMA Memory Migration Tool
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Memory Management / Page Migration
- Software or Tool: libnuma, move_pages(), perf
- Main Book: Linux Kernel documentation on NUMA
What you’ll build: A tool that analyzes process memory placement and migrates pages to optimal NUMA nodes based on access patterns—like a “defragmenter” for NUMA locality.
Why it teaches memory migration: Sometimes memory ends up on the wrong node (first-touch gone wrong, thread migration). Understanding how to detect and fix this is crucial for NUMA optimization.
Core challenges you’ll face:
- Determining page node locations → maps to get_mempolicy(), /proc/pid/numa_maps
- Identifying hot pages → maps to performance counters, sampling
- Migrating pages → maps to move_pages() system call
- Balancing migration cost → maps to when migration pays off
Key Concepts:
- NUMA Memory Policy: Linux Kernel NUMA documentation
- move_pages(): Linux man pages
- Page Migration: Linux mm/migrate.c documentation
- numastat: numactl documentation
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-2 completed, understanding of virtual memory
Real world outcome:
$ ./numa-migrate --analyze --pid 12345
=== NUMA Memory Analysis for PID 12345 ===
Virtual Memory Map:
0x7f0000000000 - 0x7f0080000000 (2GB heap): 87% Node 0, 13% Node 1
0x7f0100000000 - 0x7f0140000000 (1GB mmap): 100% Node 1
Thread-Node Affinity:
Thread 1 (TID 12345): Running on Node 0
Thread 2 (TID 12346): Running on Node 0
Thread 3 (TID 12347): Running on Node 1
Thread 4 (TID 12348): Running on Node 1
Memory Access Analysis (sampled):
Node 0 threads accessing Node 1 memory: 847,293 accesses/sec
Node 1 threads accessing Node 0 memory: 23,412 accesses/sec
Recommendations:
1. Migrate 256MB from Node 1 to Node 0 (heap region)
Estimated improvement: 12% fewer remote accesses
$ ./numa-migrate --migrate --pid 12345
Migrating 256MB (65536 pages) from Node 1 to Node 0...
Migration complete in 1.2 seconds
Pages migrated: 65536
Pages failed: 0
Post-migration analysis:
Remote accesses reduced by 14%
Implementation Hints:
- Read /proc/pid/numa_maps:
void analyze_numa_maps(pid_t pid) { char path[256]; snprintf(path, sizeof(path), "/proc/%d/numa_maps", pid); FILE* f = fopen(path, "r"); char line[1024]; while (fgets(line, sizeof(line), f)) { // Parse lines like: // 7f8a1c000000 default heap anon=524288 dirty=524288 N0=400000 N1=124288 unsigned long addr; char policy[32], type[32]; sscanf(line, "%lx %s %s", &addr, policy, type); // Parse node counts: N0=X N1=Y // Calculate percentage on each node } } - Get Page Node Locations:
#include <numaif.h> void get_page_nodes(void* addr, size_t length) { size_t page_size = getpagesize(); size_t num_pages = length / page_size; void** pages = malloc(num_pages * sizeof(void*)); int* nodes = malloc(num_pages * sizeof(int)); int* status = malloc(num_pages * sizeof(int)); for (size_t i = 0; i < num_pages; i++) { pages[i] = (char*)addr + i * page_size; } // Query page locations if (move_pages(0, num_pages, pages, NULL, status, 0) < 0) { perror("move_pages query failed"); } // status[i] contains the node number for each page for (size_t i = 0; i < num_pages; i++) { printf("Page %zu: Node %d\n", i, status[i]); } } - Migrate Pages:
int migrate_pages_to_node(void** pages, size_t num_pages, int target_node) { int* nodes = malloc(num_pages * sizeof(int)); int* status = malloc(num_pages * sizeof(int)); // Set all pages to target node for (size_t i = 0; i < num_pages; i++) { nodes[i] = target_node; } // Perform migration int result = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE); if (result < 0) { perror("move_pages migration failed"); return -1; } // Check status for each page int success = 0, failed = 0; for (size_t i = 0; i < num_pages; i++) { if (status[i] == target_node) { success++; } else if (status[i] < 0) { failed++; // status[i] is -errno } } printf("Migrated %d pages, failed %d\n", success, failed); return success; } - Sample Memory Accesses (using perf):
// Use perf_event_open to sample memory accesses struct perf_event_attr pe = { .type = PERF_TYPE_RAW, .config = 0x01d3, // MEM_LOAD_RETIRED.L3_MISS (varies by CPU) .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_ADDR, .sample_period = 1000, .precise_ip = 2, }; // Read samples to determine which addresses cause remote accesses - Migration Cost Model:
// Estimate if migration is worth it double migration_cost_ns = pages_to_migrate * 1000; // ~1us per page double savings_per_access_ns = 40; // Remote vs local latency difference double accesses_per_second = measured_remote_accesses; double break_even_seconds = migration_cost_ns / (savings_per_access_ns * accesses_per_second); if (break_even_seconds < 1.0) { printf("Migration recommended: pays off in %.2f seconds\n", break_even_seconds); }
Learning milestones:
- Can read numa_maps → You understand memory placement
- Can query page nodes → You understand move_pages()
- Migration works → You can fix bad placement
- Access sampling works → You can identify hot spots
- Tool improves real application → Practical value achieved
Project 8: Cache Coherence Simulator
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C++ or Rust
- Alternative Programming Languages: Python (for visualization)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Cache Coherence / Computer Architecture
- Software or Tool: Custom simulator
- Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson
What you’ll build: A visual simulator of the MESI cache coherence protocol, showing how cache lines transition between states as multiple cores read and write shared data.
Why it teaches cache coherence: Cache coherence is fundamental to parallel programming but hard to visualize. Building a simulator forces you to understand every state transition and see why false sharing is so expensive.
Core challenges you’ll face:
- Modeling cache states → maps to MESI state machine
- Simulating bus/directory traffic → maps to coherence protocol messages
- Tracking cache line ownership → maps to which core has which state
- Visualizing transitions → maps to making the invisible visible
Key Concepts:
- MESI Protocol: Wikipedia MESI
- Cache Coherence: “Computer Architecture: A Quantitative Approach” Chapter 5 - Hennessy & Patterson
- Directory-Based Protocols: UMD Cache Coherence
- SyncdSim: Coherence Simulator
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Understanding of caching, state machines
Real world outcome:
$ ./mesi-simulator
=== MESI Cache Coherence Simulator ===
Initial State:
Cache Line 0x1000: Memory value = 42
Core 0 Cache: [I]nvalid
Core 1 Cache: [I]nvalid
Core 2 Cache: [I]nvalid
Core 3 Cache: [I]nvalid
> core0 read 0x1000
Core 0: I → E (exclusive, loaded from memory)
Bus: ReadExclusive 0x1000
Memory read: 1
> core1 read 0x1000
Core 0: E → S (downgrade, sharing)
Core 1: I → S (shared, from Core 0's cache)
Bus: Read 0x1000, Snoop hit on Core 0
Cache-to-cache transfer: 1
> core0 write 0x1000 = 100
Core 0: S → M (modified, invalidating others)
Core 1: S → I (invalidated)
Bus: Invalidate 0x1000
Invalidations sent: 1
> core1 read 0x1000
Core 0: M → S (writeback + share)
Core 1: I → S (shared)
Bus: Read 0x1000, Snoop hit dirty
Memory writeback: 1
Cache-to-cache transfer: 1
Statistics:
Memory reads: 1
Memory writes: 1
Cache-to-cache transfers: 2
Invalidations: 1
Bus transactions: 4
Implementation Hints:
- Cache Line State:
enum class MESIState { Invalid, Shared, Exclusive, Modified }; struct CacheLine { MESIState state = MESIState::Invalid; uint64_t tag = 0; uint64_t data = 0; }; struct Cache { int core_id; std::unordered_map<uint64_t, CacheLine> lines; MESIState get_state(uint64_t addr) { auto it = lines.find(addr); return it != lines.end() ? it->second.state : MESIState::Invalid; } }; - Bus/Directory:
struct CoherenceMessage { enum Type { Read, ReadExclusive, Invalidate, Writeback, Data }; Type type; int source_core; uint64_t address; uint64_t data; }; class Bus { public: std::vector<Cache*> caches; std::function<uint64_t(uint64_t)> read_memory; std::function<void(uint64_t, uint64_t)> write_memory; void broadcast(CoherenceMessage msg) { // All caches snoop the bus for (auto* cache : caches) { if (cache->core_id != msg.source_core) { cache->snoop(msg); } } } }; - MESI State Transitions:
void Cache::handle_local_read(uint64_t addr) { auto& line = lines[addr]; switch (line.state) { case MESIState::Invalid: // Need to fetch from memory/other cache if (any_cache_has(addr)) { // Cache-to-cache transfer line.state = MESIState::Shared; } else { // Exclusive access line.state = MESIState::Exclusive; } line.data = fetch_data(addr); break; case MESIState::Shared: case MESIState::Exclusive: case MESIState::Modified: // Already have valid data break; } } void Cache::handle_local_write(uint64_t addr, uint64_t data) { auto& line = lines[addr]; switch (line.state) { case MESIState::Invalid: // Read-for-ownership, then modify bus->broadcast({Invalidate, core_id, addr}); line.state = MESIState::Modified; break; case MESIState::Shared: // Upgrade: invalidate other copies bus->broadcast({Invalidate, core_id, addr}); line.state = MESIState::Modified; break; case MESIState::Exclusive: // Silent upgrade line.state = MESIState::Modified; break; case MESIState::Modified: // Already have exclusive dirty copy break; } line.data = data; } - Snoop Logic:
void Cache::snoop(CoherenceMessage msg) { auto it = lines.find(msg.address); if (it == lines.end()) return; auto& line = it->second; switch (msg.type) { case Read: if (line.state == MESIState::Exclusive) { line.state = MESIState::Shared; } else if (line.state == MESIState::Modified) { // Writeback and share bus->write_memory(msg.address, line.data); line.state = MESIState::Shared; } break; case Invalidate: line.state = MESIState::Invalid; break; } }
Learning milestones:
- State transitions correct → You understand MESI
- Bus traffic visible → You see coherence cost
- False sharing visible → You see why it’s expensive
- Directory variant works → You understand scalability
- Can explain to others → Deep understanding achieved
Project 9: NUMA-Aware Thread Pool
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Threading / NUMA Scheduling
- Software or Tool: libnuma, pthreads
- Main Book: “C++ Concurrency in Action” by Anthony Williams
What you’ll build: A thread pool that understands NUMA topology—pinning threads to nodes, scheduling tasks for locality, and implementing work stealing that prefers local nodes.
Why it teaches NUMA threading: Standard thread pools ignore NUMA. High-performance systems need thread pools that co-locate threads with their data and minimize cross-node communication.
Core challenges you’ll face:
- Thread pinning → maps to CPU affinity, NUMA node binding
- Locality-aware scheduling → maps to tasks prefer nodes where their data lives
- NUMA-aware work stealing → maps to steal from local nodes first
- Load balancing vs locality → maps to trade-off decisions
Key Concepts:
- Thread Affinity: pthread_setaffinity_np documentation
- Work Stealing: “Cilk” and Intel TBB documentation
- NUMA Binding: libnuma numa_run_on_node
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5-6 completed, understanding of thread pools
Real world outcome:
// Create NUMA-aware thread pool
NUMAThreadPool pool(/* threads_per_node= */ 4);
// Submit task with node hint
pool.submit([data] {
process(data); // Runs on node where 'data' lives
}, numa_node_of(data));
// Submit without hint (runs on any available thread)
pool.submit([] { generic_work(); });
// Statistics
pool.stats();
// Output:
// Node 0: 4 threads, 12847 tasks, 98.3% local execution
// Node 1: 4 threads, 11923 tasks, 97.9% local execution
// Work steals: 312 (1.3% of tasks)
// Cross-node steals: 47 (0.2% of tasks)
Implementation Hints:
- Per-Node Thread Groups:
struct NodeThreads { int node_id; std::vector<std::thread> threads; std::deque<std::function<void()>> local_queue; std::mutex queue_mutex; std::condition_variable cv; std::atomic<bool> stop{false}; }; class NUMAThreadPool { std::vector<NodeThreads> node_groups; }; - Pin Threads to Nodes:
void worker_thread(NodeThreads* group) { // Pin this thread to the node numa_run_on_node(group->node_id); // Or more specifically to CPUs on this node cpu_set_t cpuset; CPU_ZERO(&cpuset); struct bitmask* cpus = numa_allocate_cpumask(); numa_node_to_cpus(group->node_id, cpus); for (int i = 0; i < numa_num_configured_cpus(); i++) { if (numa_bitmask_isbitset(cpus, i)) { CPU_SET(i, &cpuset); } } pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); // Now run task loop while (!group->stop) { auto task = get_task(group); if (task) task(); } } - Submit with Node Affinity:
template<typename F> void submit(F&& func, int preferred_node = -1) { if (preferred_node < 0) { // No preference: use current node or round-robin preferred_node = get_current_node(); } auto& group = node_groups[preferred_node]; { std::lock_guard lock(group.queue_mutex); group.local_queue.push_back(std::forward<F>(func)); } group.cv.notify_one(); } - NUMA-Aware Work Stealing:
std::function<void()> get_task(NodeThreads* my_group) { // Try local queue first { std::unique_lock lock(my_group->queue_mutex); if (!my_group->local_queue.empty()) { auto task = std::move(my_group->local_queue.front()); my_group->local_queue.pop_front(); return task; } } // Try to steal from other nodes (prefer closer nodes) std::vector<int> steal_order = get_nodes_by_distance(my_group->node_id); for (int node : steal_order) { if (node == my_group->node_id) continue; auto& victim = node_groups[node]; std::unique_lock lock(victim.queue_mutex, std::try_to_lock); if (lock.owns_lock() && !victim.local_queue.empty()) { // Steal from back (opposite end from pop) auto task = std::move(victim.local_queue.back()); victim.local_queue.pop_back(); return task; } } // Nothing to steal, wait on local queue std::unique_lock lock(my_group->queue_mutex); my_group->cv.wait_for(lock, std::chrono::milliseconds(10)); return nullptr; } - Task with Data Locality:
template<typename F, typename Data> void submit_with_data(F&& func, Data* data) { int node = numa_node_of(data); // Determine data's node submit([func = std::forward<F>(func), data]() { func(data); }, node); } int numa_node_of(void* ptr) { int status; void* pages[] = {ptr}; move_pages(0, 1, pages, nullptr, &status, 0); return status >= 0 ? status : 0; }
Learning milestones:
- Threads pinned correctly → You understand affinity
- Local tasks stay local → You understand scheduling
- Work stealing works → You understand load balancing
- Steal order respects distance → NUMA-aware stealing
- Benchmark shows improvement → Real performance gain
Project 10: NUMA-Aware Database Buffer Pool
- File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
- Main Programming Language: C++
- Alternative Programming Languages: C, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Database Systems / NUMA
- Software or Tool: libnuma, mmap, AIO
- Main Book: “Database Internals” by Alex Petrov
What you’ll build: A database buffer pool manager that is NUMA-aware—caching pages on the optimal node, handling page migration, and coordinating with a NUMA-aware query executor.
Why it teaches real-world NUMA: Databases are where NUMA matters most. Large in-memory databases can see 2-3x performance improvements from NUMA-aware design. This project applies everything you’ve learned.
Core challenges you’ll face:
- Per-node buffer pools → maps to partitioned caching
- Page placement strategy → maps to which node gets which pages
- Query executor integration → maps to running queries near their data
- Page migration on access pattern changes → maps to adaptive placement
Key Concepts:
- Buffer Pool Management: “Database Internals” Chapter 5 - Alex Petrov
- NUMA Databases: Research papers on NUMA-aware databases
- Page Caching: Linux page cache documentation
- Query Processing: “Database System Concepts” by Silberschatz et al.
Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Projects 5-9 completed, understanding of database internals
Real world outcome:
// NUMA-aware buffer pool
NUMABufferPool pool(/* per_node_size= */ 4_GB);
// Read page - automatically cached on optimal node
auto page = pool.get_page(table_id, page_id);
// Query executor uses NUMA hints
pool.execute_query(query, [](Page* page) {
// This lambda runs on the node where 'page' lives
return process(page);
});
// Statistics
pool.stats();
// Output:
// Buffer Pool:
// Node 0: 4GB, 1.2M pages, 99.1% hit rate
// Node 1: 4GB, 1.1M pages, 98.7% hit rate
//
// Access Patterns:
// Local page access: 97.3%
// Remote page access: 2.7%
// Page migrations: 1,234 (optimizing placement)
//
// Query Execution:
// Queries co-located with data: 94.2%
// Cross-node data shipping: 5.8%
Implementation Hints:
- Per-Node Buffer Pool:
struct NodeBufferPool { int node_id; void* buffer_memory; // numa_alloc_onnode size_t capacity; LRUCache<PageId, Page*> cache; std::shared_mutex mutex; std::atomic<size_t> hits{0}; std::atomic<size_t> misses{0}; }; class NUMABufferPool { std::vector<NodeBufferPool> node_pools; std::unordered_map<PageId, int> page_node_map; // Which node has each page }; - Page Placement Strategy:
int choose_node_for_page(PageId page_id, const AccessPattern& pattern) { // Strategy 1: Hash-based (deterministic) // return page_id.table_id % num_nodes; // Strategy 2: Range-based (for range scans) // return (page_id.page_num / pages_per_node) % num_nodes; // Strategy 3: Access-pattern based if (pattern.is_single_thread()) { return pattern.primary_thread_node(); } if (pattern.is_read_heavy()) { return pattern.most_frequent_reader_node(); } // Interleave for write-heavy shared data return page_id.page_num % num_nodes; } - Page Fetch with NUMA Awareness:
Page* get_page(PageId page_id) { int node = determine_best_node(page_id); auto& pool = node_pools[node]; { std::shared_lock lock(pool.mutex); if (auto* page = pool.cache.get(page_id)) { pool.hits++; return page; } } // Cache miss - need to load pool.misses++; return load_page_to_node(page_id, node); } Page* load_page_to_node(PageId page_id, int node) { auto& pool = node_pools[node]; std::unique_lock lock(pool.mutex); // Allocate page on target node Page* page = allocate_page_on_node(node); // Read from disk read_page_from_disk(page_id, page); // Add to cache, possibly evicting if (pool.cache.is_full()) { auto evicted = pool.cache.evict_lru(); free_page(evicted); } pool.cache.insert(page_id, page); return page; } - Query Execution with Data Locality:
template<typename F> auto execute_on_page(PageId page_id, F&& func) { Page* page = get_page(page_id); int page_node = page_node_map[page_id]; int current_node = get_current_node(); if (page_node == current_node) { // Already on correct node - execute directly return func(page); } // Execute on the page's node return numa_thread_pool.submit_and_wait([&]() { return func(page); }, page_node); } - Adaptive Page Migration:
void maybe_migrate_page(PageId page_id, int accessing_node) { auto& stats = page_access_stats[page_id]; stats.access_count[accessing_node]++; // Check if migration would help int current_node = page_node_map[page_id]; int best_node = stats.most_frequent_accessor(); if (best_node != current_node) { double current_local_ratio = stats.local_access_ratio(current_node); double potential_local_ratio = stats.local_access_ratio(best_node); // Migrate if significant improvement expected if (potential_local_ratio > current_local_ratio + 0.1) { migrate_page(page_id, best_node); } } }
Learning milestones:
- Per-node caching works → You understand partitioned buffers
- Page placement is deterministic → You understand placement
- Queries run near data → You understand co-location
- Migration improves locality → Adaptive optimization works
- Benchmark shows 2x+ improvement → Real database speedup
Project Comparison Table
| Project | Difficulty | Time | Depth | Fun Factor |
|---|---|---|---|---|
| 1. NUMA Topology Explorer | Beginner | Weekend | ⭐⭐ | ⭐⭐ |
| 2. Memory Latency Benchmark | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 3. Memory Bandwidth Benchmark | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 4. False Sharing Detector | Intermediate | 1 week | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 5. NUMA-Aware Allocator | Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| 6. NUMA Data Structures | Advanced | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 7. Memory Migration Tool | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 8. Cache Coherence Simulator | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. NUMA Thread Pool | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 10. NUMA Buffer Pool | Expert | 4-6 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Phase 1: Understanding (Weeks 1-2)
- Project 1: NUMA Topology Explorer - See your system
- Project 2: Memory Latency Benchmark - Measure the reality
- Project 3: Memory Bandwidth Benchmark - Understand limits
Phase 2: Problems (Weeks 3-4)
- Project 4: False Sharing Detector - See cache coherence cost
- Project 8: Cache Coherence Simulator - Understand MESI deeply
Phase 3: Solutions (Weeks 5-8)
- Project 5: NUMA-Aware Allocator - Control memory placement
- Project 6: NUMA Data Structures - Build locality-aware code
- Project 9: NUMA Thread Pool - Control thread placement
Phase 4: Integration (Weeks 9-12)
- Project 7: Memory Migration Tool - Fix bad placement
- Project 10: NUMA Buffer Pool - Real-world application
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | NUMA Topology Explorer | C |
| 2 | Memory Latency Benchmark | C |
| 3 | Memory Bandwidth Benchmark | C |
| 4 | False Sharing Detector | C |
| 5 | NUMA-Aware Memory Allocator | C |
| 6 | NUMA-Aware Data Structure Library | C++ |
| 7 | NUMA Memory Migration Tool | C |
| 8 | Cache Coherence Simulator | C++ |
| 9 | NUMA-Aware Thread Pool | C++ |
| 10 | NUMA-Aware Database Buffer Pool | C++ |
Essential Resources
Books
- “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson - Chapter 5 on memory hierarchy and multiprocessors
- “The Art of Multiprocessor Programming” by Herlihy & Shavit - Concurrent data structures and cache coherence
- “Linux System Programming” by Robert Love - Low-level Linux interfaces
- “Database Internals” by Alex Petrov - Buffer pool management
Online Resources
- Linux Kernel NUMA Documentation
- libnuma man page
- Intel Memory Latency Checker
- Awesome NUMA on GitHub
- numactl GitHub Repository
- MESI Protocol Wikipedia
Tools
- numactl - NUMA policy control
- numastat - NUMA memory statistics
- numatop - Real-time NUMA monitoring
- Intel MLC - Memory latency checker
- perf - Linux performance counters
- perf c2c - Cache-to-cache analysis
Master these projects and you’ll understand why some code runs fast and some runs slow—and have the skills to make the slow code fast.