← Back to all projects

LEARN NUMA VS UMA ARCHITECTURES DEEP DIVE

Learn NUMA vs UMA Architectures: From Memory Basics to High-Performance Systems

Goal: Deeply understand memory architectures in multiprocessor systems—from uniform memory access basics to NUMA topology, cache coherence, memory placement policies, and building NUMA-aware high-performance applications.


Why Memory Architecture Matters

Every modern server, workstation, and even high-end desktop uses NUMA architecture. When your database is slow, your parallel code doesn’t scale, or your application has mysterious performance cliffs—the answer often lies in memory architecture.

Understanding NUMA vs UMA means understanding:

  • Why adding more CPU cores doesn’t always make things faster
  • Why the same code runs at different speeds on different servers
  • How memory placement affects cache behavior
  • Why memory allocation strategy can 10x your performance
  • How to write code that scales on modern hardware

After completing these projects, you will:

  • Understand the physical reality of memory in modern systems
  • Measure and visualize NUMA topology and latencies
  • Write NUMA-aware applications that scale efficiently
  • Implement custom memory allocators for NUMA systems
  • Debug and optimize memory placement issues
  • Understand cache coherence and its performance implications

Core Concept Analysis

The Physical Reality

Modern systems have a hierarchy of memory access times:

┌─────────────────────────────────────────────────────────────────┐
│                         CPU Socket 0                             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐               │
│  │ Core 0  │ │ Core 1  │ │ Core 2  │ │ Core 3  │               │
│  │┌───────┐│ │┌───────┐│ │┌───────┐│ │┌───────┐│               │
│  ││  L1   ││ ││  L1   ││ ││  L1   ││ ││  L1   ││  ~1ns         │
│  │└───────┘│ │└───────┘│ │└───────┘│ │└───────┘│               │
│  │┌───────┐│ │┌───────┐│ │┌───────┐│ │┌───────┐│               │
│  ││  L2   ││ ││  L2   ││ ││  L2   ││ ││  L2   ││  ~4ns         │
│  │└───────┘│ │└───────┘│ │└───────┘│ │└───────┘│               │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘               │
│       └──────────────┬───────────────────┘                      │
│              ┌───────┴───────┐                                  │
│              │  Shared L3    │  ~20ns                           │
│              │  (LLC)        │                                  │
│              └───────┬───────┘                                  │
│                      │                                          │
│              ┌───────┴───────┐                                  │
│              │ Memory Ctrl   │                                  │
│              └───────┬───────┘                                  │
│                      │                                          │
│              ┌───────┴───────┐                                  │
│              │  Local DRAM   │  ~80ns (LOCAL)                   │
│              │  (NUMA Node 0)│                                  │
│              └───────────────┘                                  │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                QPI/UPI Link  (~40ns additional)
                      │
┌─────────────────────┴───────────────────────────────────────────┐
│                         CPU Socket 1                             │
│              ┌───────────────┐                                  │
│              │  Remote DRAM  │  ~120ns (REMOTE)                 │
│              │  (NUMA Node 1)│                                  │
│              └───────────────┘                                  │
│                      │                                          │
│              ┌───────┴───────┐                                  │
│              │ Memory Ctrl   │                                  │
│              └───────┬───────┘                                  │
│                      │                                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐               │
│  │ Core 4  │ │ Core 5  │ │ Core 6  │ │ Core 7  │               │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘               │
└─────────────────────────────────────────────────────────────────┘

UMA (Uniform Memory Access)

In UMA systems, all processors have equal access time to all memory:

┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
│ CPU 0  │  │ CPU 1  │  │ CPU 2  │  │ CPU 3  │
└───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘
    │           │           │           │
    └───────────┴─────┬─────┴───────────┘
                      │
              ┌───────┴───────┐
              │   Shared Bus   │
              └───────┬───────┘
                      │
         ┌────────────┼────────────┐
         │            │            │
     ┌───┴───┐   ┌───┴───┐   ┌───┴───┐
     │ Bank 0│   │ Bank 1│   │ Bank 2│
     └───────┘   └───────┘   └───────┘
              SHARED MEMORY
              (Equal latency from all CPUs)

Characteristics:

  • Simple programming model
  • All memory accesses take the same time
  • Limited scalability (bus becomes bottleneck)
  • Found in: older systems, small multicore chips
  • Example: Single-socket desktop computers

NUMA (Non-Uniform Memory Access)

In NUMA systems, memory access time depends on memory location relative to the CPU:

     NUMA Node 0                    NUMA Node 1
┌─────────────────────┐        ┌─────────────────────┐
│  ┌────┐  ┌────┐     │        │  ┌────┐  ┌────┐     │
│  │CPU0│  │CPU1│     │        │  │CPU2│  │CPU3│     │
│  └──┬─┘  └──┬─┘     │        │  └──┬─┘  └──┬─┘     │
│     └────┬──┘       │        │     └────┬──┘       │
│          │          │        │          │          │
│    ┌─────┴─────┐    │◄──────►│    ┌─────┴─────┐    │
│    │Mem Ctrl 0 │    │  QPI/  │    │Mem Ctrl 1 │    │
│    └─────┬─────┘    │  UPI   │    └─────┬─────┘    │
│          │          │ Inter- │          │          │
│    ┌─────┴─────┐    │connect │    ┌─────┴─────┐    │
│    │ Local RAM │    │        │    │ Local RAM │    │
│    │  (Fast)   │    │        │    │  (Fast)   │    │
│    │  ~80ns    │    │        │    │  ~80ns    │    │
│    └───────────┘    │        │    └───────────┘    │
└─────────────────────┘        └─────────────────────┘
         │                              │
         │     Remote Access ~120ns     │
         └──────────────────────────────┘

Characteristics:

  • Local memory access is fast (~80ns)
  • Remote memory access is slower (~120-150ns or more)
  • Scales better than UMA
  • Requires careful memory placement for performance
  • Found in: servers, workstations, multi-socket systems

Memory Access Latency Comparison

Memory Level Latency Cycles (3GHz) Notes
L1 Cache ~1ns 3-4 Per-core
L2 Cache ~4ns 12 Per-core
L3 Cache ~20ns 60 Shared within socket
Local DRAM ~80ns 240 Same NUMA node
Remote DRAM ~120ns+ 360+ Different NUMA node
Remote (2 hop) ~150ns+ 450+ Through another node

Key Insight: Remote memory access can be 50-100% slower than local. This is why NUMA awareness matters!

Cache Coherence (MESI Protocol)

When multiple cores cache the same memory location, they must stay synchronized:

MESI States:
┌─────────────────────────────────────────────────────────────┐
│  Modified (M)  │ Only copy, dirty (different from memory)  │
├─────────────────────────────────────────────────────────────┤
│  Exclusive (E) │ Only copy, clean (same as memory)         │
├─────────────────────────────────────────────────────────────┤
│  Shared (S)    │ Multiple copies exist, all clean          │
├─────────────────────────────────────────────────────────────┤
│  Invalid (I)   │ Not valid, must fetch from memory/cache   │
└─────────────────────────────────────────────────────────────┘

State Transitions:
                    ┌─────────┐
           ┌───────►│ Invalid │◄───────┐
           │        └────┬────┘        │
       evict/           │ Read         │ Remote Write
       invalidate       ▼              │
           │        ┌───────┐          │
           ├───────►│ Shared│◄─────────┤
           │        └───┬───┘          │
           │            │ Local Write  │
           │            ▼              │
           │       ┌─────────┐         │
           │       │Exclusive│─────────┤
           │       └────┬────┘         │
           │            │ Local Write  │
           │            ▼              │
           │       ┌─────────┐         │
           └───────│Modified │─────────┘
                   └─────────┘

Why This Matters for NUMA:

  • Cache coherence traffic crosses NUMA nodes
  • “False sharing” causes cache line ping-pong between nodes
  • Directory-based protocols (ccNUMA) scale better than bus snooping

NUMA Policies in Linux

Linux provides several memory allocation policies:

Policy Behavior
MPOL_DEFAULT Allocate on local node (first-touch)
MPOL_BIND Allocate only on specified nodes (fail if unavailable)
MPOL_PREFERRED Prefer specified node, fallback to others
MPOL_INTERLEAVE Round-robin across nodes (good for shared data)

First-Touch Policy (Default):

  • Memory is allocated on the node of the first thread to write to it
  • This is why initialization patterns matter!
// BAD: Main thread allocates, workers use remote memory
char* buffer = malloc(1GB);  // Allocated on main thread's node
memset(buffer, 0, 1GB);      // All memory now on node 0
// Workers on node 1 access remote memory - SLOW!

// GOOD: Each worker initializes its own portion
char* buffer = malloc(1GB);  // Allocated but not touched
#pragma omp parallel for
for (int i = 0; i < size; i++) {
    buffer[i] = 0;  // Each thread touches its portion
}
// Memory distributed across nodes based on which thread touches it

Project List

Projects are ordered from understanding concepts to building production-quality NUMA-aware systems.


Project 1: NUMA Topology Explorer

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Python
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: NUMA Topology / System Information
  • Software or Tool: libnuma, Linux sysfs
  • Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A tool that discovers and visualizes the NUMA topology of any Linux system—showing nodes, CPUs, memory, and distances between nodes.

Why it teaches NUMA basics: Before optimizing for NUMA, you must understand what you’re working with. This project teaches you to read the hardware topology that determines performance.

Core challenges you’ll face:

  • Reading /sys/devices/system/node/ → maps to understanding Linux NUMA interfaces
  • Parsing CPU affinity masks → maps to understanding which CPUs belong to which nodes
  • Interpreting distance matrix → maps to understanding relative memory access costs
  • Using libnuma APIs → maps to programmatic topology discovery

Key Concepts:

  • NUMA Topology: Linux Kernel NUMA Documentation
  • libnuma Interface: numa(3) man page
  • CPU Affinity: “Linux System Programming” Chapter 6 - Robert Love
  • Memory Hierarchy: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, Linux familiarity

Real world outcome:

$ ./numa-explorer

=== NUMA Topology Report ===

System: 2 NUMA nodes, 16 CPUs, 64GB RAM

Node 0:
  CPUs: 0-7
  Memory: 32GB (28GB free)
  Distance to Node 0: 10
  Distance to Node 1: 21

Node 1:
  CPUs: 8-15
  Memory: 32GB (30GB free)
  Distance to Node 0: 21
  Distance to Node 1: 10

Distance Matrix (relative latency):
        Node 0  Node 1
Node 0:   10      21
Node 1:   21      10

Cache Topology:
  L1d: 32KB per core
  L1i: 32KB per core
  L2: 256KB per core
  L3: 20MB shared per node

Interconnect: Intel QPI @ 8.0 GT/s

Implementation Hints:

  1. Discover NUMA Nodes:
    #include <numa.h>
    
    if (numa_available() < 0) {
        printf("NUMA not available on this system\n");
        return 1;
    }
    
    int num_nodes = numa_max_node() + 1;
    printf("System has %d NUMA nodes\n", num_nodes);
    
  2. Read from sysfs (alternative to libnuma):
    // Nodes are in /sys/devices/system/node/node*/
    // CPUs per node: /sys/devices/system/node/node0/cpulist
    // Memory info: /sys/devices/system/node/node0/meminfo
    // Distances: /sys/devices/system/node/node0/distance
    
  3. Get CPUs per Node:
    struct bitmask* cpus = numa_allocate_cpumask();
    for (int node = 0; node <= numa_max_node(); node++) {
        numa_node_to_cpus(node, cpus);
        printf("Node %d CPUs: ", node);
        for (int cpu = 0; cpu < numa_num_configured_cpus(); cpu++) {
            if (numa_bitmask_isbitset(cpus, cpu)) {
                printf("%d ", cpu);
            }
        }
        printf("\n");
    }
    
  4. Get Memory per Node:
    long long free_mem;
    long long total_mem = numa_node_size64(node, &free_mem);
    printf("Node %d: %lld MB total, %lld MB free\n",
           node, total_mem / (1024*1024), free_mem / (1024*1024));
    
  5. Get Distance Matrix:
    for (int i = 0; i <= numa_max_node(); i++) {
        for (int j = 0; j <= numa_max_node(); j++) {
            int distance = numa_distance(i, j);
            printf("%4d ", distance);
        }
        printf("\n");
    }
    

Learning milestones:

  1. Tool discovers node count → You understand NUMA basics
  2. CPU mapping works → You understand node-CPU relationships
  3. Memory info works → You understand node memory
  4. Distance matrix displays → You understand NUMA distances
  5. Compare with numactl → Your tool matches system tools

Project 2: Memory Latency Microbenchmark

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Assembly
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Memory Performance / Benchmarking
  • Software or Tool: libnuma, rdtsc/rdtscp
  • Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A microbenchmark that measures memory access latency to different NUMA nodes, cache levels, and memory regions—revealing the true cost of memory architecture decisions.

Why it teaches memory performance: Numbers like “120ns remote latency” are abstract until you measure them yourself. This project gives you intuition for memory performance that will guide all your optimization work.

Core challenges you’ll face:

  • Pointer chasing to defeat prefetchers → maps to understanding hardware prefetching
  • High-resolution timing → maps to using TSC, avoiding measurement overhead
  • Controlling memory placement → maps to numa_alloc_onnode, mbind
  • Statistical analysis → maps to handling variance in measurements

Key Concepts:

  • Memory Latency Measurement: Intel MLC Documentation
  • Pointer Chasing: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson
  • TSC Timing: Intel Software Developer’s Manual
  • NUMA Allocation: libnuma documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 completed, understanding of memory hierarchy

Real world outcome:

$ ./numa-latency

Memory Latency Benchmark
========================

Testing from Node 0 (CPUs 0-7):

  L1 Cache (32KB):     1.2 ns
  L2 Cache (256KB):    4.1 ns
  L3 Cache (20MB):    18.7 ns
  Local DRAM:         78.3 ns
  Remote DRAM (Node 1): 127.6 ns

Testing from Node 1 (CPUs 8-15):

  L1 Cache:            1.2 ns
  L2 Cache:            4.0 ns
  L3 Cache:           19.1 ns
  Local DRAM:         79.1 ns
  Remote DRAM (Node 0): 125.4 ns

NUMA Penalty: 1.63x slower for remote access

Full Latency Matrix (ns):
            To Node 0   To Node 1
From Node 0:   78.3       127.6
From Node 1:  125.4        79.1

Implementation Hints:

  1. Pointer Chasing (defeats prefetcher):
    // Create a linked list with random order
    typedef struct Node {
        struct Node* next;
        char padding[56];  // Make each node 64 bytes (cache line)
    } Node;
    
    // Shuffle the list randomly so prefetcher can't predict
    void shuffle_list(Node* nodes, size_t count) {
        // Fisher-Yates shuffle of next pointers
        for (size_t i = count - 1; i > 0; i--) {
            size_t j = rand() % (i + 1);
            // Swap next pointers
            Node* temp = nodes[i].next;
            nodes[i].next = nodes[j].next;
            nodes[j].next = temp;
        }
    }
    
  2. High-Resolution Timing:
    static inline uint64_t rdtsc() {
        uint32_t lo, hi;
        __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
        return ((uint64_t)hi << 32) | lo;
    }
    
    static inline uint64_t rdtscp() {
        uint32_t lo, hi, aux;
        __asm__ volatile ("rdtscp" : "=a" (lo), "=d" (hi), "=c" (aux));
        return ((uint64_t)hi << 32) | lo;
    }
    
    // Measure latency
    uint64_t start = rdtsc();
    for (int i = 0; i < iterations; i++) {
        p = p->next;  // Pointer chase
    }
    uint64_t end = rdtscp();
    uint64_t cycles = (end - start) / iterations;
    double ns = cycles / cpu_ghz;
    
  3. Allocate on Specific Node:
    #include <numa.h>
    
    void* buffer = numa_alloc_onnode(size, target_node);
    if (!buffer) {
        perror("numa_alloc_onnode failed");
    }
    
    // Or use mbind for existing memory
    unsigned long nodemask = 1UL << target_node;
    mbind(buffer, size, MPOL_BIND, &nodemask, numa_max_node() + 2, 0);
    
  4. Pin Thread to Specific CPU:
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    
  5. Test Different Sizes (to hit different cache levels):
    size_t test_sizes[] = {
        16 * 1024,        // L1 (32KB)
        128 * 1024,       // L2 (256KB)
        8 * 1024 * 1024,  // L3 (varies)
        256 * 1024 * 1024 // DRAM
    };
    

Learning milestones:

  1. Measure cache latencies → You understand cache hierarchy
  2. Measure local vs remote DRAM → You see NUMA penalty
  3. Results match Intel MLC → Your benchmark is accurate
  4. Understand variance → You understand measurement challenges
  5. Can predict performance → Real understanding achieved

Project 3: Memory Bandwidth Benchmark

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++ with intrinsics, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Memory Bandwidth / SIMD
  • Software or Tool: libnuma, AVX intrinsics
  • Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A bandwidth benchmark (like STREAM) that measures memory throughput across NUMA nodes, with different access patterns and thread configurations.

Why it teaches bandwidth: Latency hides with pipelining; bandwidth doesn’t. Understanding bandwidth limits tells you when you’re memory-bound and how parallelism affects memory performance.

Core challenges you’ll face:

  • Saturating memory bandwidth → maps to using SIMD, multiple threads
  • NUMA-aware allocation → maps to controlling where data lives
  • Measuring aggregate bandwidth → maps to combining thread results
  • Understanding bandwidth scaling → maps to when more threads help/hurt

Key Concepts:

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 2 completed, understanding of SIMD basics

Real world outcome:

$ ./numa-bandwidth --threads 16

NUMA Memory Bandwidth Benchmark
===============================

Single-threaded (pinned to Node 0):
  Copy (Local):   12.4 GB/s
  Copy (Remote):   6.8 GB/s
  Scale (Local):  11.9 GB/s
  Triad (Local):  13.2 GB/s

Multi-threaded bandwidth:

Threads   Local BW    Remote BW   Interleaved
   1      12.4 GB/s    6.8 GB/s     9.5 GB/s
   2      24.6 GB/s   11.2 GB/s    17.8 GB/s
   4      45.3 GB/s   18.7 GB/s    31.4 GB/s
   8      68.2 GB/s   24.1 GB/s    43.6 GB/s
  16      72.1 GB/s   25.3 GB/s    47.2 GB/s

Peak aggregate bandwidth: 72.1 GB/s (8 threads saturate)
Remote access penalty: 2.85x slower

Per-Node breakdown:
  Node 0: 72.1 GB/s peak (8 channels × 2666 MT/s)
  Node 1: 71.8 GB/s peak

Implementation Hints:

  1. STREAM-style Operations:
    // Copy: c[i] = a[i]
    void copy(double* restrict c, double* restrict a, size_t n) {
        #pragma omp parallel for
        for (size_t i = 0; i < n; i++) {
            c[i] = a[i];
        }
    }
    
    // Scale: b[i] = scalar * c[i]
    void scale(double* restrict b, double* restrict c, double scalar, size_t n) {
        #pragma omp parallel for
        for (size_t i = 0; i < n; i++) {
            b[i] = scalar * c[i];
        }
    }
    
    // Add: c[i] = a[i] + b[i]
    void add(double* restrict c, double* restrict a, double* restrict b, size_t n) {
        #pragma omp parallel for
        for (size_t i = 0; i < n; i++) {
            c[i] = a[i] + b[i];
        }
    }
    
    // Triad: a[i] = b[i] + scalar * c[i]
    void triad(double* restrict a, double* restrict b, double* restrict c,
               double scalar, size_t n) {
        #pragma omp parallel for
        for (size_t i = 0; i < n; i++) {
            a[i] = b[i] + scalar * c[i];
        }
    }
    
  2. AVX-512 for Maximum Bandwidth:
    #include <immintrin.h>
    
    void copy_avx512(double* restrict dst, double* restrict src, size_t n) {
        for (size_t i = 0; i < n; i += 8) {
            __m512d v = _mm512_load_pd(&src[i]);
            _mm512_store_pd(&dst[i], v);
        }
    }
    
  3. NUMA-Aware Allocation:
    // Allocate on specific node
    double* local_array = numa_alloc_onnode(size, local_node);
    
    // Allocate interleaved across all nodes
    double* interleaved = numa_alloc_interleaved(size);
    
    // First-touch initialization for NUMA placement
    #pragma omp parallel for
    for (size_t i = 0; i < n; i++) {
        array[i] = 0.0;  // Each thread touches its portion
    }
    
  4. Bandwidth Calculation:
    double bytes = n * sizeof(double) * 2;  // Read + Write for Copy
    double seconds = end_time - start_time;
    double bandwidth_gb = (bytes / 1e9) / seconds;
    
  5. Scaling Test:
    for (int threads = 1; threads <= max_threads; threads *= 2) {
        omp_set_num_threads(threads);
        // Run benchmark and record bandwidth
    }
    

Learning milestones:

  1. Single-threaded bandwidth measured → You understand basic measurement
  2. Local vs remote difference clear → You see NUMA bandwidth penalty
  3. Scaling curve shows saturation → You understand bandwidth limits
  4. SIMD version faster → You understand importance of vectorization
  5. Interleaved vs local trade-offs → Deep NUMA bandwidth understanding

Project 4: False Sharing Detector

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Cache Coherence / Performance Analysis
  • Software or Tool: perf, libnuma
  • Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A tool that demonstrates and detects false sharing—when threads access different data that happens to share a cache line, causing severe performance degradation.

Why it teaches cache coherence: False sharing is one of the most common performance bugs in parallel code. It’s invisible to correctness but devastating to performance—sometimes 10-100x slower.

Core challenges you’ll face:

  • Creating false sharing conditions → maps to understanding cache line granularity
  • Measuring cache coherence traffic → maps to using hardware performance counters
  • Padding to eliminate false sharing → maps to cache-line alignment
  • Detecting in real code → maps to analyzing memory access patterns

Key Concepts:

  • False Sharing: “The Art of Multiprocessor Programming” Chapter 7 - Herlihy & Shavit
  • MESI Protocol: ARM MESI Documentation
  • Cache Line Size: Intel/AMD processor documentation
  • Performance Counters: perf stat documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 completed, understanding of caching

Real world outcome:

$ ./false-sharing-demo

=== False Sharing Demonstration ===

Test 1: Adjacent counters (FALSE SHARING)
  Struct layout: [cnt0|cnt1|cnt2|cnt3] (8 bytes each, same cache line)
  4 threads incrementing their own counter...
  Time: 2847 ms
  Throughput: 35.1 M ops/sec
  Cache misses: 847,293,102

Test 2: Padded counters (NO FALSE SHARING)
  Struct layout: [cnt0|pad...][cnt1|pad...] (64 bytes each, separate lines)
  4 threads incrementing their own counter...
  Time: 127 ms
  Throughput: 787.4 M ops/sec
  Cache misses: 1,234

Speedup from eliminating false sharing: 22.4x

=== Detecting False Sharing ===
Monitoring process 12345 for 5 seconds...

Potential false sharing detected:
  Address range: 0x7f8a12340000 - 0x7f8a12340040
  Threads accessing: 0, 1, 2, 3
  Cache coherence invalidations: 12,847,293
  Recommendation: Add padding between variables at offsets 0, 8, 16, 24

Implementation Hints:

  1. Create False Sharing:
    // BAD: All counters on same cache line
    struct {
        volatile long counter[4];  // 32 bytes, fits in one 64-byte cache line
    } shared;
    
    void* increment_bad(void* arg) {
        int id = *(int*)arg;
        for (int i = 0; i < 100000000; i++) {
            shared.counter[id]++;  // Causes cache line invalidation!
        }
        return NULL;
    }
    
  2. Eliminate False Sharing with Padding:
    // GOOD: Each counter on its own cache line
    #define CACHE_LINE_SIZE 64
    
    struct alignas(CACHE_LINE_SIZE) {
        volatile long counter;
        char padding[CACHE_LINE_SIZE - sizeof(long)];
    } padded[4];
    
    void* increment_good(void* arg) {
        int id = *(int*)arg;
        for (int i = 0; i < 100000000; i++) {
            padded[id].counter++;  // No false sharing
        }
        return NULL;
    }
    
  3. Measure with Performance Counters:
    # Run with perf to see cache effects
    perf stat -e cache-misses,cache-references,\
               LLC-load-misses,LLC-store-misses \
               ./false_sharing_demo
    
    # Or use perf c2c for cache-to-cache analysis
    perf c2c record ./false_sharing_demo
    perf c2c report
    
  4. Programmatic Counter Access:
    #include <linux/perf_event.h>
    #include <sys/syscall.h>
    
    static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                                int cpu, int group_fd, unsigned long flags) {
        return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
    }
    
    // Setup counter for cache misses
    struct perf_event_attr pe;
    memset(&pe, 0, sizeof(pe));
    pe.type = PERF_TYPE_HARDWARE;
    pe.config = PERF_COUNT_HW_CACHE_MISSES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    
    int fd = perf_event_open(&pe, 0, -1, -1, 0);
    
  5. Alignment Utilities:
    // C11/C++11 alignas
    struct alignas(64) CacheAlignedCounter {
        long value;
    };
    
    // Or manual padding
    struct PaddedCounter {
        long value;
        char padding[64 - sizeof(long)];
    } __attribute__((aligned(64)));
    

Learning milestones:

  1. Demo shows massive slowdown → You see false sharing impact
  2. Padding fixes the problem → You understand the solution
  3. Perf counters confirm diagnosis → You can detect it
  4. You can explain MESI states → Deep understanding
  5. Find false sharing in real code → Practical application

Project 5: NUMA-Aware Memory Allocator

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Memory Allocators / NUMA
  • Software or Tool: libnuma, mmap
  • Main Book: “The Art of Memory Management” and AMD NUMA allocator paper

What you’ll build: A custom memory allocator that is NUMA-aware—allocating memory from the local node when possible and providing explicit control over memory placement.

Why it teaches NUMA allocation: Standard malloc() doesn’t know about NUMA. High-performance applications need allocators that understand memory topology and place data strategically.

Core challenges you’ll face:

  • Per-node memory pools → maps to maintaining separate free lists per node
  • Thread-to-node mapping → maps to determining which node a thread runs on
  • Allocation strategies → maps to local-first vs interleaved
  • Fragmentation management → maps to coalescing, large page support

Key Concepts:

  • NUMA Allocator Design: AMD NUMA-aware heap manager
  • Memory Allocator Basics: “The Art of Multiprocessor Programming” Chapter 14 - Herlihy & Shavit
  • libnuma Allocation: numa(3) man page
  • Page Allocation: Linux mmap() documentation

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed, understanding of memory allocators

Real world outcome:

// Your allocator API
void* numa_malloc(size_t size);              // Allocate on local node
void* numa_malloc_node(size_t size, int node); // Allocate on specific node
void* numa_malloc_interleaved(size_t size);  // Interleave across nodes
void numa_free(void* ptr);

// Statistics
numa_allocator_stats();
// Output:
// Node 0: 1.2 GB allocated, 847 allocations, 0 remote accesses
// Node 1: 0.8 GB allocated, 523 allocations, 0 remote accesses
// Thread-local cache hits: 94.2%

Implementation Hints:

  1. Per-Node Arena Structure:
    #define MAX_NODES 8
    #define ARENA_SIZE (64 * 1024 * 1024)  // 64MB per arena
    
    typedef struct {
        void* base;           // mmap'd region
        size_t size;          // Total size
        size_t used;          // Bytes allocated
        pthread_mutex_t lock;
        FreeList* free_lists[NUM_SIZE_CLASSES];
    } NodeArena;
    
    typedef struct {
        NodeArena arenas[MAX_NODES];
        int num_nodes;
    } NUMAAllocator;
    
  2. Initialize Per-Node Arenas:
    void numa_allocator_init(NUMAAllocator* alloc) {
        alloc->num_nodes = numa_max_node() + 1;
    
        for (int node = 0; node < alloc->num_nodes; node++) {
            // Allocate arena memory on specific node
            void* base = numa_alloc_onnode(ARENA_SIZE, node);
            if (!base) {
                perror("numa_alloc_onnode failed");
                exit(1);
            }
    
            alloc->arenas[node].base = base;
            alloc->arenas[node].size = ARENA_SIZE;
            alloc->arenas[node].used = 0;
            pthread_mutex_init(&alloc->arenas[node].lock, NULL);
        }
    }
    
  3. Determine Current Node:
    int get_current_node() {
        int cpu = sched_getcpu();
        return numa_node_of_cpu(cpu);
    }
    
  4. Local Allocation:
    void* numa_malloc(size_t size) {
        int node = get_current_node();
        return allocate_from_node(&allocator.arenas[node], size);
    }
    
    void* allocate_from_node(NodeArena* arena, size_t size) {
        pthread_mutex_lock(&arena->lock);
    
        // Find appropriate size class
        int size_class = get_size_class(size);
    
        // Check free list
        if (arena->free_lists[size_class]) {
            void* ptr = arena->free_lists[size_class];
            arena->free_lists[size_class] = *(void**)ptr;
            pthread_mutex_unlock(&arena->lock);
            return ptr;
        }
    
        // Allocate from arena
        size_t alloc_size = size_class_to_size(size_class);
        if (arena->used + alloc_size > arena->size) {
            pthread_mutex_unlock(&arena->lock);
            return NULL;  // Arena full
        }
    
        void* ptr = (char*)arena->base + arena->used;
        arena->used += alloc_size;
        pthread_mutex_unlock(&arena->lock);
        return ptr;
    }
    
  5. Thread-Local Caching (for performance):
    __thread struct {
        void* cache[NUM_SIZE_CLASSES][CACHE_SIZE];
        int count[NUM_SIZE_CLASSES];
        int current_node;
    } tls_cache;
    
    void* numa_malloc_fast(size_t size) {
        int sc = get_size_class(size);
        if (tls_cache.count[sc] > 0) {
            return tls_cache.cache[sc][--tls_cache.count[sc]];
        }
        // Slow path: go to arena
        return numa_malloc(size);
    }
    
  6. Interleaved Allocation:
    __thread int interleave_node = 0;
    
    void* numa_malloc_interleaved(size_t size) {
        int node = interleave_node;
        interleave_node = (interleave_node + 1) % allocator.num_nodes;
        return allocate_from_node(&allocator.arenas[node], size);
    }
    

Learning milestones:

  1. Basic per-node allocation works → You understand the structure
  2. Correct node determined → You understand thread-node mapping
  3. Free lists work → You understand memory reuse
  4. Thread-local caching works → You understand performance optimization
  5. Benchmark shows improvement → Your allocator beats standard malloc for NUMA

Project 6: NUMA-Aware Data Structure Library

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C++
  • Alternative Programming Languages: C, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Structures / NUMA Optimization
  • Software or Tool: libnuma, C++ allocators
  • Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A library of NUMA-aware data structures—hash tables, queues, and trees that understand memory locality and minimize cross-node traffic.

Why it teaches NUMA data structures: Generic data structures ignore NUMA. High-performance systems need data structures that partition data across nodes and minimize remote accesses.

Core challenges you’ll face:

  • Partitioned hash tables → maps to per-node buckets, local lookups
  • NUMA-aware queues → maps to minimizing producer-consumer cross-node traffic
  • Tree partitioning → maps to locality-aware tree construction
  • Balancing locality vs load → maps to when to access remote data

Key Concepts:

  • Concurrent Data Structures: “The Art of Multiprocessor Programming” Chapters 9-13 - Herlihy & Shavit
  • NUMA-Aware Hashing: Research papers on NUMA hash tables
  • Work Stealing: “Cilk” scheduler documentation
  • Lock-Free Programming: “C++ Concurrency in Action” by Anthony Williams

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 5 completed, understanding of concurrent data structures

Real world outcome:

// NUMA-aware hash table
NUMAHashTable<std::string, UserData> table(numa_nodes);

// Insert - data placed on local node
table.insert("user123", userData);  // O(1), local memory

// Lookup - checks local node first
auto result = table.find("user123");  // Usually local

// NUMA-aware work queue
NUMAWorkQueue<Task> queue;
queue.push(task);  // Goes to local node's queue

// Worker pops from local queue first, steals from remote if empty
auto task = queue.pop();  // Local-first, then steal

// Statistics
table.stats();
// Output:
// Node 0: 1.2M entries, 99.2% local lookups
// Node 1: 1.1M entries, 98.7% local lookups
// Cross-node lookups: 1.1%

Implementation Hints:

  1. NUMA-Aware Hash Table:
    template<typename K, typename V>
    class NUMAHashTable {
    private:
        struct NodePartition {
            std::vector<Bucket> buckets;
            mutable std::shared_mutex mutex;
            alignas(64) std::atomic<size_t> count{0};
        };
    
        std::vector<NodePartition> partitions;
    
        int key_to_node(const K& key) {
            // Hash determines which node owns this key
            return std::hash<K>{}(key) % partitions.size();
        }
    
    public:
        NUMAHashTable(int num_nodes) : partitions(num_nodes) {
            for (int i = 0; i < num_nodes; i++) {
                // Allocate buckets on specific node
                numa_run_on_node(i);
                partitions[i].buckets.resize(INITIAL_BUCKETS);
            }
        }
    
        void insert(const K& key, const V& value) {
            int node = key_to_node(key);
            auto& partition = partitions[node];
            std::unique_lock lock(partition.mutex);
            // Insert into partition
        }
    
        std::optional<V> find(const K& key) {
            int node = key_to_node(key);
            auto& partition = partitions[node];
            std::shared_lock lock(partition.mutex);
            // Lookup in partition
        }
    };
    
  2. NUMA-Aware Work-Stealing Queue:
    template<typename T>
    class NUMAWorkQueue {
    private:
        struct alignas(64) NodeQueue {
            std::deque<T> local_queue;
            std::mutex mutex;
        };
    
        std::vector<NodeQueue> queues;
    
    public:
        void push(T item) {
            int node = numa_node_of_cpu(sched_getcpu());
            std::lock_guard lock(queues[node].mutex);
            queues[node].local_queue.push_back(std::move(item));
        }
    
        std::optional<T> pop() {
            int my_node = numa_node_of_cpu(sched_getcpu());
    
            // Try local queue first
            {
                std::lock_guard lock(queues[my_node].mutex);
                if (!queues[my_node].local_queue.empty()) {
                    T item = std::move(queues[my_node].local_queue.back());
                    queues[my_node].local_queue.pop_back();
                    return item;
                }
            }
    
            // Steal from other nodes (in distance order)
            for (int node = 0; node < queues.size(); node++) {
                if (node == my_node) continue;
                std::lock_guard lock(queues[node].mutex);
                if (!queues[node].local_queue.empty()) {
                    T item = std::move(queues[node].local_queue.front());
                    queues[node].local_queue.pop_front();
                    return item;
                }
            }
    
            return std::nullopt;
        }
    };
    
  3. NUMA-Aware Vector:
    template<typename T>
    class NUMAVector {
    private:
        struct Chunk {
            T* data;
            size_t size;
            int node;
        };
    
        std::vector<Chunk> chunks;
        size_t chunk_size = 1024 * 1024 / sizeof(T);  // 1MB chunks
    
    public:
        void push_back(const T& value) {
            int node = get_current_node();
            // Allocate new chunk on local node if needed
            if (needs_new_chunk()) {
                T* data = (T*)numa_alloc_onnode(chunk_size * sizeof(T), node);
                chunks.push_back({data, 0, node});
            }
            // Add to current chunk
        }
    
        T& operator[](size_t index) {
            auto& chunk = find_chunk(index);
            return chunk.data[index % chunk_size];
        }
    };
    

Learning milestones:

  1. Partitioned hash table works → You understand data partitioning
  2. Work stealing works → You understand locality-aware scheduling
  3. Benchmark shows improvement → Your structures are faster
  4. Statistics confirm locality → You’re achieving local access
  5. Can integrate with real application → Practical value achieved

Project 7: NUMA Memory Migration Tool

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Memory Management / Page Migration
  • Software or Tool: libnuma, move_pages(), perf
  • Main Book: Linux Kernel documentation on NUMA

What you’ll build: A tool that analyzes process memory placement and migrates pages to optimal NUMA nodes based on access patterns—like a “defragmenter” for NUMA locality.

Why it teaches memory migration: Sometimes memory ends up on the wrong node (first-touch gone wrong, thread migration). Understanding how to detect and fix this is crucial for NUMA optimization.

Core challenges you’ll face:

  • Determining page node locations → maps to get_mempolicy(), /proc/pid/numa_maps
  • Identifying hot pages → maps to performance counters, sampling
  • Migrating pages → maps to move_pages() system call
  • Balancing migration cost → maps to when migration pays off

Key Concepts:

  • NUMA Memory Policy: Linux Kernel NUMA documentation
  • move_pages(): Linux man pages
  • Page Migration: Linux mm/migrate.c documentation
  • numastat: numactl documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-2 completed, understanding of virtual memory

Real world outcome:

$ ./numa-migrate --analyze --pid 12345

=== NUMA Memory Analysis for PID 12345 ===

Virtual Memory Map:
  0x7f0000000000 - 0x7f0080000000 (2GB heap): 87% Node 0, 13% Node 1
  0x7f0100000000 - 0x7f0140000000 (1GB mmap): 100% Node 1

Thread-Node Affinity:
  Thread 1 (TID 12345): Running on Node 0
  Thread 2 (TID 12346): Running on Node 0
  Thread 3 (TID 12347): Running on Node 1
  Thread 4 (TID 12348): Running on Node 1

Memory Access Analysis (sampled):
  Node 0 threads accessing Node 1 memory: 847,293 accesses/sec
  Node 1 threads accessing Node 0 memory: 23,412 accesses/sec

Recommendations:
  1. Migrate 256MB from Node 1 to Node 0 (heap region)
     Estimated improvement: 12% fewer remote accesses

$ ./numa-migrate --migrate --pid 12345

Migrating 256MB (65536 pages) from Node 1 to Node 0...
Migration complete in 1.2 seconds
Pages migrated: 65536
Pages failed: 0

Post-migration analysis:
  Remote accesses reduced by 14%

Implementation Hints:

  1. Read /proc/pid/numa_maps:
    void analyze_numa_maps(pid_t pid) {
        char path[256];
        snprintf(path, sizeof(path), "/proc/%d/numa_maps", pid);
    
        FILE* f = fopen(path, "r");
        char line[1024];
    
        while (fgets(line, sizeof(line), f)) {
            // Parse lines like:
            // 7f8a1c000000 default heap anon=524288 dirty=524288 N0=400000 N1=124288
            unsigned long addr;
            char policy[32], type[32];
            sscanf(line, "%lx %s %s", &addr, policy, type);
    
            // Parse node counts: N0=X N1=Y
            // Calculate percentage on each node
        }
    }
    
  2. Get Page Node Locations:
    #include <numaif.h>
    
    void get_page_nodes(void* addr, size_t length) {
        size_t page_size = getpagesize();
        size_t num_pages = length / page_size;
    
        void** pages = malloc(num_pages * sizeof(void*));
        int* nodes = malloc(num_pages * sizeof(int));
        int* status = malloc(num_pages * sizeof(int));
    
        for (size_t i = 0; i < num_pages; i++) {
            pages[i] = (char*)addr + i * page_size;
        }
    
        // Query page locations
        if (move_pages(0, num_pages, pages, NULL, status, 0) < 0) {
            perror("move_pages query failed");
        }
    
        // status[i] contains the node number for each page
        for (size_t i = 0; i < num_pages; i++) {
            printf("Page %zu: Node %d\n", i, status[i]);
        }
    }
    
  3. Migrate Pages:
    int migrate_pages_to_node(void** pages, size_t num_pages, int target_node) {
        int* nodes = malloc(num_pages * sizeof(int));
        int* status = malloc(num_pages * sizeof(int));
    
        // Set all pages to target node
        for (size_t i = 0; i < num_pages; i++) {
            nodes[i] = target_node;
        }
    
        // Perform migration
        int result = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
    
        if (result < 0) {
            perror("move_pages migration failed");
            return -1;
        }
    
        // Check status for each page
        int success = 0, failed = 0;
        for (size_t i = 0; i < num_pages; i++) {
            if (status[i] == target_node) {
                success++;
            } else if (status[i] < 0) {
                failed++;
                // status[i] is -errno
            }
        }
    
        printf("Migrated %d pages, failed %d\n", success, failed);
        return success;
    }
    
  4. Sample Memory Accesses (using perf):
    // Use perf_event_open to sample memory accesses
    struct perf_event_attr pe = {
        .type = PERF_TYPE_RAW,
        .config = 0x01d3,  // MEM_LOAD_RETIRED.L3_MISS (varies by CPU)
        .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_ADDR,
        .sample_period = 1000,
        .precise_ip = 2,
    };
    
    // Read samples to determine which addresses cause remote accesses
    
  5. Migration Cost Model:
    // Estimate if migration is worth it
    double migration_cost_ns = pages_to_migrate * 1000;  // ~1us per page
    double savings_per_access_ns = 40;  // Remote vs local latency difference
    double accesses_per_second = measured_remote_accesses;
    
    double break_even_seconds = migration_cost_ns / (savings_per_access_ns * accesses_per_second);
    
    if (break_even_seconds < 1.0) {
        printf("Migration recommended: pays off in %.2f seconds\n", break_even_seconds);
    }
    

Learning milestones:

  1. Can read numa_maps → You understand memory placement
  2. Can query page nodes → You understand move_pages()
  3. Migration works → You can fix bad placement
  4. Access sampling works → You can identify hot spots
  5. Tool improves real application → Practical value achieved

Project 8: Cache Coherence Simulator

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C++ or Rust
  • Alternative Programming Languages: Python (for visualization)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Cache Coherence / Computer Architecture
  • Software or Tool: Custom simulator
  • Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A visual simulator of the MESI cache coherence protocol, showing how cache lines transition between states as multiple cores read and write shared data.

Why it teaches cache coherence: Cache coherence is fundamental to parallel programming but hard to visualize. Building a simulator forces you to understand every state transition and see why false sharing is so expensive.

Core challenges you’ll face:

  • Modeling cache states → maps to MESI state machine
  • Simulating bus/directory traffic → maps to coherence protocol messages
  • Tracking cache line ownership → maps to which core has which state
  • Visualizing transitions → maps to making the invisible visible

Key Concepts:

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Understanding of caching, state machines

Real world outcome:

$ ./mesi-simulator

=== MESI Cache Coherence Simulator ===

Initial State:
  Cache Line 0x1000: Memory value = 42

  Core 0 Cache: [I]nvalid
  Core 1 Cache: [I]nvalid
  Core 2 Cache: [I]nvalid
  Core 3 Cache: [I]nvalid

> core0 read 0x1000
  Core 0: I → E (exclusive, loaded from memory)
  Bus: ReadExclusive 0x1000
  Memory read: 1

> core1 read 0x1000
  Core 0: E → S (downgrade, sharing)
  Core 1: I → S (shared, from Core 0's cache)
  Bus: Read 0x1000, Snoop hit on Core 0
  Cache-to-cache transfer: 1

> core0 write 0x1000 = 100
  Core 0: S → M (modified, invalidating others)
  Core 1: S → I (invalidated)
  Bus: Invalidate 0x1000
  Invalidations sent: 1

> core1 read 0x1000
  Core 0: M → S (writeback + share)
  Core 1: I → S (shared)
  Bus: Read 0x1000, Snoop hit dirty
  Memory writeback: 1
  Cache-to-cache transfer: 1

Statistics:
  Memory reads: 1
  Memory writes: 1
  Cache-to-cache transfers: 2
  Invalidations: 1
  Bus transactions: 4

Implementation Hints:

  1. Cache Line State:
    enum class MESIState { Invalid, Shared, Exclusive, Modified };
    
    struct CacheLine {
        MESIState state = MESIState::Invalid;
        uint64_t tag = 0;
        uint64_t data = 0;
    };
    
    struct Cache {
        int core_id;
        std::unordered_map<uint64_t, CacheLine> lines;
    
        MESIState get_state(uint64_t addr) {
            auto it = lines.find(addr);
            return it != lines.end() ? it->second.state : MESIState::Invalid;
        }
    };
    
  2. Bus/Directory:
    struct CoherenceMessage {
        enum Type { Read, ReadExclusive, Invalidate, Writeback, Data };
        Type type;
        int source_core;
        uint64_t address;
        uint64_t data;
    };
    
    class Bus {
    public:
        std::vector<Cache*> caches;
        std::function<uint64_t(uint64_t)> read_memory;
        std::function<void(uint64_t, uint64_t)> write_memory;
    
        void broadcast(CoherenceMessage msg) {
            // All caches snoop the bus
            for (auto* cache : caches) {
                if (cache->core_id != msg.source_core) {
                    cache->snoop(msg);
                }
            }
        }
    };
    
  3. MESI State Transitions:
    void Cache::handle_local_read(uint64_t addr) {
        auto& line = lines[addr];
    
        switch (line.state) {
            case MESIState::Invalid:
                // Need to fetch from memory/other cache
                if (any_cache_has(addr)) {
                    // Cache-to-cache transfer
                    line.state = MESIState::Shared;
                } else {
                    // Exclusive access
                    line.state = MESIState::Exclusive;
                }
                line.data = fetch_data(addr);
                break;
    
            case MESIState::Shared:
            case MESIState::Exclusive:
            case MESIState::Modified:
                // Already have valid data
                break;
        }
    }
    
    void Cache::handle_local_write(uint64_t addr, uint64_t data) {
        auto& line = lines[addr];
    
        switch (line.state) {
            case MESIState::Invalid:
                // Read-for-ownership, then modify
                bus->broadcast({Invalidate, core_id, addr});
                line.state = MESIState::Modified;
                break;
    
            case MESIState::Shared:
                // Upgrade: invalidate other copies
                bus->broadcast({Invalidate, core_id, addr});
                line.state = MESIState::Modified;
                break;
    
            case MESIState::Exclusive:
                // Silent upgrade
                line.state = MESIState::Modified;
                break;
    
            case MESIState::Modified:
                // Already have exclusive dirty copy
                break;
        }
        line.data = data;
    }
    
  4. Snoop Logic:
    void Cache::snoop(CoherenceMessage msg) {
        auto it = lines.find(msg.address);
        if (it == lines.end()) return;
    
        auto& line = it->second;
    
        switch (msg.type) {
            case Read:
                if (line.state == MESIState::Exclusive) {
                    line.state = MESIState::Shared;
                } else if (line.state == MESIState::Modified) {
                    // Writeback and share
                    bus->write_memory(msg.address, line.data);
                    line.state = MESIState::Shared;
                }
                break;
    
            case Invalidate:
                line.state = MESIState::Invalid;
                break;
        }
    }
    

Learning milestones:

  1. State transitions correct → You understand MESI
  2. Bus traffic visible → You see coherence cost
  3. False sharing visible → You see why it’s expensive
  4. Directory variant works → You understand scalability
  5. Can explain to others → Deep understanding achieved

Project 9: NUMA-Aware Thread Pool

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C++
  • Alternative Programming Languages: C, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Threading / NUMA Scheduling
  • Software or Tool: libnuma, pthreads
  • Main Book: “C++ Concurrency in Action” by Anthony Williams

What you’ll build: A thread pool that understands NUMA topology—pinning threads to nodes, scheduling tasks for locality, and implementing work stealing that prefers local nodes.

Why it teaches NUMA threading: Standard thread pools ignore NUMA. High-performance systems need thread pools that co-locate threads with their data and minimize cross-node communication.

Core challenges you’ll face:

  • Thread pinning → maps to CPU affinity, NUMA node binding
  • Locality-aware scheduling → maps to tasks prefer nodes where their data lives
  • NUMA-aware work stealing → maps to steal from local nodes first
  • Load balancing vs locality → maps to trade-off decisions

Key Concepts:

  • Thread Affinity: pthread_setaffinity_np documentation
  • Work Stealing: “Cilk” and Intel TBB documentation
  • NUMA Binding: libnuma numa_run_on_node

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5-6 completed, understanding of thread pools

Real world outcome:

// Create NUMA-aware thread pool
NUMAThreadPool pool(/* threads_per_node= */ 4);

// Submit task with node hint
pool.submit([data] {
    process(data);  // Runs on node where 'data' lives
}, numa_node_of(data));

// Submit without hint (runs on any available thread)
pool.submit([] { generic_work(); });

// Statistics
pool.stats();
// Output:
// Node 0: 4 threads, 12847 tasks, 98.3% local execution
// Node 1: 4 threads, 11923 tasks, 97.9% local execution
// Work steals: 312 (1.3% of tasks)
// Cross-node steals: 47 (0.2% of tasks)

Implementation Hints:

  1. Per-Node Thread Groups:
    struct NodeThreads {
        int node_id;
        std::vector<std::thread> threads;
        std::deque<std::function<void()>> local_queue;
        std::mutex queue_mutex;
        std::condition_variable cv;
        std::atomic<bool> stop{false};
    };
    
    class NUMAThreadPool {
        std::vector<NodeThreads> node_groups;
    };
    
  2. Pin Threads to Nodes:
    void worker_thread(NodeThreads* group) {
        // Pin this thread to the node
        numa_run_on_node(group->node_id);
    
        // Or more specifically to CPUs on this node
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        struct bitmask* cpus = numa_allocate_cpumask();
        numa_node_to_cpus(group->node_id, cpus);
        for (int i = 0; i < numa_num_configured_cpus(); i++) {
            if (numa_bitmask_isbitset(cpus, i)) {
                CPU_SET(i, &cpuset);
            }
        }
        pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    
        // Now run task loop
        while (!group->stop) {
            auto task = get_task(group);
            if (task) task();
        }
    }
    
  3. Submit with Node Affinity:
    template<typename F>
    void submit(F&& func, int preferred_node = -1) {
        if (preferred_node < 0) {
            // No preference: use current node or round-robin
            preferred_node = get_current_node();
        }
    
        auto& group = node_groups[preferred_node];
        {
            std::lock_guard lock(group.queue_mutex);
            group.local_queue.push_back(std::forward<F>(func));
        }
        group.cv.notify_one();
    }
    
  4. NUMA-Aware Work Stealing:
    std::function<void()> get_task(NodeThreads* my_group) {
        // Try local queue first
        {
            std::unique_lock lock(my_group->queue_mutex);
            if (!my_group->local_queue.empty()) {
                auto task = std::move(my_group->local_queue.front());
                my_group->local_queue.pop_front();
                return task;
            }
        }
    
        // Try to steal from other nodes (prefer closer nodes)
        std::vector<int> steal_order = get_nodes_by_distance(my_group->node_id);
    
        for (int node : steal_order) {
            if (node == my_group->node_id) continue;
    
            auto& victim = node_groups[node];
            std::unique_lock lock(victim.queue_mutex, std::try_to_lock);
            if (lock.owns_lock() && !victim.local_queue.empty()) {
                // Steal from back (opposite end from pop)
                auto task = std::move(victim.local_queue.back());
                victim.local_queue.pop_back();
                return task;
            }
        }
    
        // Nothing to steal, wait on local queue
        std::unique_lock lock(my_group->queue_mutex);
        my_group->cv.wait_for(lock, std::chrono::milliseconds(10));
        return nullptr;
    }
    
  5. Task with Data Locality:
    template<typename F, typename Data>
    void submit_with_data(F&& func, Data* data) {
        int node = numa_node_of(data);  // Determine data's node
        submit([func = std::forward<F>(func), data]() {
            func(data);
        }, node);
    }
    
    int numa_node_of(void* ptr) {
        int status;
        void* pages[] = {ptr};
        move_pages(0, 1, pages, nullptr, &status, 0);
        return status >= 0 ? status : 0;
    }
    

Learning milestones:

  1. Threads pinned correctly → You understand affinity
  2. Local tasks stay local → You understand scheduling
  3. Work stealing works → You understand load balancing
  4. Steal order respects distance → NUMA-aware stealing
  5. Benchmark shows improvement → Real performance gain

Project 10: NUMA-Aware Database Buffer Pool

  • File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
  • Main Programming Language: C++
  • Alternative Programming Languages: C, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Database Systems / NUMA
  • Software or Tool: libnuma, mmap, AIO
  • Main Book: “Database Internals” by Alex Petrov

What you’ll build: A database buffer pool manager that is NUMA-aware—caching pages on the optimal node, handling page migration, and coordinating with a NUMA-aware query executor.

Why it teaches real-world NUMA: Databases are where NUMA matters most. Large in-memory databases can see 2-3x performance improvements from NUMA-aware design. This project applies everything you’ve learned.

Core challenges you’ll face:

  • Per-node buffer pools → maps to partitioned caching
  • Page placement strategy → maps to which node gets which pages
  • Query executor integration → maps to running queries near their data
  • Page migration on access pattern changes → maps to adaptive placement

Key Concepts:

  • Buffer Pool Management: “Database Internals” Chapter 5 - Alex Petrov
  • NUMA Databases: Research papers on NUMA-aware databases
  • Page Caching: Linux page cache documentation
  • Query Processing: “Database System Concepts” by Silberschatz et al.

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Projects 5-9 completed, understanding of database internals

Real world outcome:

// NUMA-aware buffer pool
NUMABufferPool pool(/* per_node_size= */ 4_GB);

// Read page - automatically cached on optimal node
auto page = pool.get_page(table_id, page_id);

// Query executor uses NUMA hints
pool.execute_query(query, [](Page* page) {
    // This lambda runs on the node where 'page' lives
    return process(page);
});

// Statistics
pool.stats();
// Output:
// Buffer Pool:
//   Node 0: 4GB, 1.2M pages, 99.1% hit rate
//   Node 1: 4GB, 1.1M pages, 98.7% hit rate
//
// Access Patterns:
//   Local page access: 97.3%
//   Remote page access: 2.7%
//   Page migrations: 1,234 (optimizing placement)
//
// Query Execution:
//   Queries co-located with data: 94.2%
//   Cross-node data shipping: 5.8%

Implementation Hints:

  1. Per-Node Buffer Pool:
    struct NodeBufferPool {
        int node_id;
        void* buffer_memory;  // numa_alloc_onnode
        size_t capacity;
        LRUCache<PageId, Page*> cache;
        std::shared_mutex mutex;
    
        std::atomic<size_t> hits{0};
        std::atomic<size_t> misses{0};
    };
    
    class NUMABufferPool {
        std::vector<NodeBufferPool> node_pools;
        std::unordered_map<PageId, int> page_node_map;  // Which node has each page
    };
    
  2. Page Placement Strategy:
    int choose_node_for_page(PageId page_id, const AccessPattern& pattern) {
        // Strategy 1: Hash-based (deterministic)
        // return page_id.table_id % num_nodes;
    
        // Strategy 2: Range-based (for range scans)
        // return (page_id.page_num / pages_per_node) % num_nodes;
    
        // Strategy 3: Access-pattern based
        if (pattern.is_single_thread()) {
            return pattern.primary_thread_node();
        }
        if (pattern.is_read_heavy()) {
            return pattern.most_frequent_reader_node();
        }
        // Interleave for write-heavy shared data
        return page_id.page_num % num_nodes;
    }
    
  3. Page Fetch with NUMA Awareness:
    Page* get_page(PageId page_id) {
        int node = determine_best_node(page_id);
        auto& pool = node_pools[node];
    
        {
            std::shared_lock lock(pool.mutex);
            if (auto* page = pool.cache.get(page_id)) {
                pool.hits++;
                return page;
            }
        }
    
        // Cache miss - need to load
        pool.misses++;
        return load_page_to_node(page_id, node);
    }
    
    Page* load_page_to_node(PageId page_id, int node) {
        auto& pool = node_pools[node];
    
        std::unique_lock lock(pool.mutex);
    
        // Allocate page on target node
        Page* page = allocate_page_on_node(node);
    
        // Read from disk
        read_page_from_disk(page_id, page);
    
        // Add to cache, possibly evicting
        if (pool.cache.is_full()) {
            auto evicted = pool.cache.evict_lru();
            free_page(evicted);
        }
        pool.cache.insert(page_id, page);
    
        return page;
    }
    
  4. Query Execution with Data Locality:
    template<typename F>
    auto execute_on_page(PageId page_id, F&& func) {
        Page* page = get_page(page_id);
        int page_node = page_node_map[page_id];
        int current_node = get_current_node();
    
        if (page_node == current_node) {
            // Already on correct node - execute directly
            return func(page);
        }
    
        // Execute on the page's node
        return numa_thread_pool.submit_and_wait([&]() {
            return func(page);
        }, page_node);
    }
    
  5. Adaptive Page Migration:
    void maybe_migrate_page(PageId page_id, int accessing_node) {
        auto& stats = page_access_stats[page_id];
        stats.access_count[accessing_node]++;
    
        // Check if migration would help
        int current_node = page_node_map[page_id];
        int best_node = stats.most_frequent_accessor();
    
        if (best_node != current_node) {
            double current_local_ratio = stats.local_access_ratio(current_node);
            double potential_local_ratio = stats.local_access_ratio(best_node);
    
            // Migrate if significant improvement expected
            if (potential_local_ratio > current_local_ratio + 0.1) {
                migrate_page(page_id, best_node);
            }
        }
    }
    

Learning milestones:

  1. Per-node caching works → You understand partitioned buffers
  2. Page placement is deterministic → You understand placement
  3. Queries run near data → You understand co-location
  4. Migration improves locality → Adaptive optimization works
  5. Benchmark shows 2x+ improvement → Real database speedup

Project Comparison Table

Project Difficulty Time Depth Fun Factor
1. NUMA Topology Explorer Beginner Weekend ⭐⭐ ⭐⭐
2. Memory Latency Benchmark Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐
3. Memory Bandwidth Benchmark Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐
4. False Sharing Detector Intermediate 1 week ⭐⭐⭐⭐ ⭐⭐⭐⭐
5. NUMA-Aware Allocator Advanced 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐
6. NUMA Data Structures Advanced 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
7. Memory Migration Tool Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
8. Cache Coherence Simulator Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
9. NUMA Thread Pool Advanced 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
10. NUMA Buffer Pool Expert 4-6 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

Phase 1: Understanding (Weeks 1-2)

  1. Project 1: NUMA Topology Explorer - See your system
  2. Project 2: Memory Latency Benchmark - Measure the reality
  3. Project 3: Memory Bandwidth Benchmark - Understand limits

Phase 2: Problems (Weeks 3-4)

  1. Project 4: False Sharing Detector - See cache coherence cost
  2. Project 8: Cache Coherence Simulator - Understand MESI deeply

Phase 3: Solutions (Weeks 5-8)

  1. Project 5: NUMA-Aware Allocator - Control memory placement
  2. Project 6: NUMA Data Structures - Build locality-aware code
  3. Project 9: NUMA Thread Pool - Control thread placement

Phase 4: Integration (Weeks 9-12)

  1. Project 7: Memory Migration Tool - Fix bad placement
  2. Project 10: NUMA Buffer Pool - Real-world application

Summary

# Project Main Language
1 NUMA Topology Explorer C
2 Memory Latency Benchmark C
3 Memory Bandwidth Benchmark C
4 False Sharing Detector C
5 NUMA-Aware Memory Allocator C
6 NUMA-Aware Data Structure Library C++
7 NUMA Memory Migration Tool C
8 Cache Coherence Simulator C++
9 NUMA-Aware Thread Pool C++
10 NUMA-Aware Database Buffer Pool C++

Essential Resources

Books

  • “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson - Chapter 5 on memory hierarchy and multiprocessors
  • “The Art of Multiprocessor Programming” by Herlihy & Shavit - Concurrent data structures and cache coherence
  • “Linux System Programming” by Robert Love - Low-level Linux interfaces
  • “Database Internals” by Alex Petrov - Buffer pool management

Online Resources

Tools

  • numactl - NUMA policy control
  • numastat - NUMA memory statistics
  • numatop - Real-time NUMA monitoring
  • Intel MLC - Memory latency checker
  • perf - Linux performance counters
  • perf c2c - Cache-to-cache analysis

Master these projects and you’ll understand why some code runs fast and some runs slow—and have the skills to make the slow code fast.