LEARN NUMA VS UMA ARCHITECTURES DEEP DIVE

Learn NUMA vs UMA Architectures: From Memory Basics to High-Performance Systems

Goal: Deeply understand memory architectures in multiprocessor systems—from uniform memory access basics to NUMA topology, cache coherence, memory placement policies, and building NUMA-aware high-performance applications.

Why Memory Architecture Matters

Every modern server, workstation, and even high-end desktop uses NUMA architecture. When your database is slow, your parallel code doesn’t scale, or your application has mysterious performance cliffs—the answer often lies in memory architecture.

Understanding NUMA vs UMA means understanding:

Why adding more CPU cores doesn’t always make things faster
Why the same code runs at different speeds on different servers
How memory placement affects cache behavior
Why memory allocation strategy can 10x your performance
How to write code that scales on modern hardware

After completing these projects, you will:

Understand the physical reality of memory in modern systems
Measure and visualize NUMA topology and latencies
Write NUMA-aware applications that scale efficiently
Implement custom memory allocators for NUMA systems
Debug and optimize memory placement issues
Understand cache coherence and its performance implications

Core Concept Analysis

The Physical Reality

Modern systems have a hierarchy of memory access times:

┌─────────────────────────────────────────────────────────────────┐
│                         CPU Socket 0                             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐               │
│  │ Core 0  │ │ Core 1  │ │ Core 2  │ │ Core 3  │               │
│  │┌───────┐│ │┌───────┐│ │┌───────┐│ │┌───────┐│               │
│  ││  L1   ││ ││  L1   ││ ││  L1   ││ ││  L1   ││  ~1ns         │
│  │└───────┘│ │└───────┘│ │└───────┘│ │└───────┘│               │
│  │┌───────┐│ │┌───────┐│ │┌───────┐│ │┌───────┐│               │
│  ││  L2   ││ ││  L2   ││ ││  L2   ││ ││  L2   ││  ~4ns         │
│  │└───────┘│ │└───────┘│ │└───────┘│ │└───────┘│               │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘               │
│       └──────────────┬───────────────────┘                      │
│              ┌───────┴───────┐                                  │
│              │  Shared L3    │  ~20ns                           │
│              │  (LLC)        │                                  │
│              └───────┬───────┘                                  │
│                      │                                          │
│              ┌───────┴───────┐                                  │
│              │ Memory Ctrl   │                                  │
│              └───────┬───────┘                                  │
│                      │                                          │
│              ┌───────┴───────┐                                  │
│              │  Local DRAM   │  ~80ns (LOCAL)                   │
│              │  (NUMA Node 0)│                                  │
│              └───────────────┘                                  │
└─────────────────────┬───────────────────────────────────────────┘
                      │
                QPI/UPI Link  (~40ns additional)
                      │
┌─────────────────────┴───────────────────────────────────────────┐
│                         CPU Socket 1                             │
│              ┌───────────────┐                                  │
│              │  Remote DRAM  │  ~120ns (REMOTE)                 │
│              │  (NUMA Node 1)│                                  │
│              └───────────────┘                                  │
│                      │                                          │
│              ┌───────┴───────┐                                  │
│              │ Memory Ctrl   │                                  │
│              └───────┬───────┘                                  │
│                      │                                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐               │
│  │ Core 4  │ │ Core 5  │ │ Core 6  │ │ Core 7  │               │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘               │
└─────────────────────────────────────────────────────────────────┘

UMA (Uniform Memory Access)

In UMA systems, all processors have equal access time to all memory:

┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
│ CPU 0  │  │ CPU 1  │  │ CPU 2  │  │ CPU 3  │
└───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘
    │           │           │           │
    └───────────┴─────┬─────┴───────────┘
                      │
              ┌───────┴───────┐
              │   Shared Bus   │
              └───────┬───────┘
                      │
         ┌────────────┼────────────┐
         │            │            │
     ┌───┴───┐   ┌───┴───┐   ┌───┴───┐
     │ Bank 0│   │ Bank 1│   │ Bank 2│
     └───────┘   └───────┘   └───────┘
              SHARED MEMORY
              (Equal latency from all CPUs)

Characteristics:

Simple programming model
All memory accesses take the same time
Limited scalability (bus becomes bottleneck)
Found in: older systems, small multicore chips
Example: Single-socket desktop computers

NUMA (Non-Uniform Memory Access)

In NUMA systems, memory access time depends on memory location relative to the CPU:

     NUMA Node 0                    NUMA Node 1
┌─────────────────────┐        ┌─────────────────────┐
│  ┌────┐  ┌────┐     │        │  ┌────┐  ┌────┐     │
│  │CPU0│  │CPU1│     │        │  │CPU2│  │CPU3│     │
│  └──┬─┘  └──┬─┘     │        │  └──┬─┘  └──┬─┘     │
│     └────┬──┘       │        │     └────┬──┘       │
│          │          │        │          │          │
│    ┌─────┴─────┐    │◄──────►│    ┌─────┴─────┐    │
│    │Mem Ctrl 0 │    │  QPI/  │    │Mem Ctrl 1 │    │
│    └─────┬─────┘    │  UPI   │    └─────┬─────┘    │
│          │          │ Inter- │          │          │
│    ┌─────┴─────┐    │connect │    ┌─────┴─────┐    │
│    │ Local RAM │    │        │    │ Local RAM │    │
│    │  (Fast)   │    │        │    │  (Fast)   │    │
│    │  ~80ns    │    │        │    │  ~80ns    │    │
│    └───────────┘    │        │    └───────────┘    │
└─────────────────────┘        └─────────────────────┘
         │                              │
         │     Remote Access ~120ns     │
         └──────────────────────────────┘

Characteristics:

Local memory access is fast (~80ns)
Remote memory access is slower (~120-150ns or more)
Scales better than UMA
Requires careful memory placement for performance
Found in: servers, workstations, multi-socket systems

Memory Access Latency Comparison

Memory Level	Latency	Cycles (3GHz)	Notes
L1 Cache	~1ns	3-4	Per-core
L2 Cache	~4ns	12	Per-core
L3 Cache	~20ns	60	Shared within socket
Local DRAM	~80ns	240	Same NUMA node
Remote DRAM	~120ns+	360+	Different NUMA node
Remote (2 hop)	~150ns+	450+	Through another node

Key Insight: Remote memory access can be 50-100% slower than local. This is why NUMA awareness matters!

Cache Coherence (MESI Protocol)

When multiple cores cache the same memory location, they must stay synchronized:

MESI States:
┌─────────────────────────────────────────────────────────────┐
│  Modified (M)  │ Only copy, dirty (different from memory)  │
├─────────────────────────────────────────────────────────────┤
│  Exclusive (E) │ Only copy, clean (same as memory)         │
├─────────────────────────────────────────────────────────────┤
│  Shared (S)    │ Multiple copies exist, all clean          │
├─────────────────────────────────────────────────────────────┤
│  Invalid (I)   │ Not valid, must fetch from memory/cache   │
└─────────────────────────────────────────────────────────────┘

State Transitions:
                    ┌─────────┐
           ┌───────►│ Invalid │◄───────┐
           │        └────┬────┘        │
       evict/           │ Read         │ Remote Write
       invalidate       ▼              │
           │        ┌───────┐          │
           ├───────►│ Shared│◄─────────┤
           │        └───┬───┘          │
           │            │ Local Write  │
           │            ▼              │
           │       ┌─────────┐         │
           │       │Exclusive│─────────┤
           │       └────┬────┘         │
           │            │ Local Write  │
           │            ▼              │
           │       ┌─────────┐         │
           └───────│Modified │─────────┘
                   └─────────┘

Why This Matters for NUMA:

Cache coherence traffic crosses NUMA nodes
“False sharing” causes cache line ping-pong between nodes
Directory-based protocols (ccNUMA) scale better than bus snooping

NUMA Policies in Linux

Linux provides several memory allocation policies:

Policy	Behavior
`MPOL_DEFAULT`	Allocate on local node (first-touch)
`MPOL_BIND`	Allocate only on specified nodes (fail if unavailable)
`MPOL_PREFERRED`	Prefer specified node, fallback to others
`MPOL_INTERLEAVE`	Round-robin across nodes (good for shared data)

First-Touch Policy (Default):

Memory is allocated on the node of the first thread to write to it
This is why initialization patterns matter!

// BAD: Main thread allocates, workers use remote memory
char* buffer = malloc(1GB);  // Allocated on main thread's node
memset(buffer, 0, 1GB);      // All memory now on node 0
// Workers on node 1 access remote memory - SLOW!

// GOOD: Each worker initializes its own portion
char* buffer = malloc(1GB);  // Allocated but not touched
#pragma omp parallel for
for (int i = 0; i < size; i++) {
    buffer[i] = 0;  // Each thread touches its portion
}
// Memory distributed across nodes based on which thread touches it

Project List

Projects are ordered from understanding concepts to building production-quality NUMA-aware systems.

Project 1: NUMA Topology Explorer

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Python
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: NUMA Topology / System Information
Software or Tool: libnuma, Linux sysfs
Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A tool that discovers and visualizes the NUMA topology of any Linux system—showing nodes, CPUs, memory, and distances between nodes.

Why it teaches NUMA basics: Before optimizing for NUMA, you must understand what you’re working with. This project teaches you to read the hardware topology that determines performance.

Core challenges you’ll face:

Reading /sys/devices/system/node/ → maps to understanding Linux NUMA interfaces
Parsing CPU affinity masks → maps to understanding which CPUs belong to which nodes
Interpreting distance matrix → maps to understanding relative memory access costs
Using libnuma APIs → maps to programmatic topology discovery

Key Concepts:

NUMA Topology: Linux Kernel NUMA Documentation
libnuma Interface: numa(3) man page
CPU Affinity: “Linux System Programming” Chapter 6 - Robert Love
Memory Hierarchy: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C programming, Linux familiarity

Real world outcome:

$ ./numa-explorer

=== NUMA Topology Report ===

System: 2 NUMA nodes, 16 CPUs, 64GB RAM

Node 0:
  CPUs: 0-7
  Memory: 32GB (28GB free)
  Distance to Node 0: 10
  Distance to Node 1: 21

Node 1:
  CPUs: 8-15
  Memory: 32GB (30GB free)
  Distance to Node 0: 21
  Distance to Node 1: 10

Distance Matrix (relative latency):
        Node 0  Node 1
Node 0:   10      21
Node 1:   21      10

Cache Topology:
  L1d: 32KB per core
  L1i: 32KB per core
  L2: 256KB per core
  L3: 20MB shared per node

Interconnect: Intel QPI @ 8.0 GT/s

Implementation Hints:

Discover NUMA Nodes:

#include <numa.h>

if (numa_available() < 0) {
    printf("NUMA not available on this system\n");
    return 1;
}

int num_nodes = numa_max_node() + 1;
printf("System has %d NUMA nodes\n", num_nodes);

Read from sysfs (alternative to libnuma):

// Nodes are in /sys/devices/system/node/node*/
// CPUs per node: /sys/devices/system/node/node0/cpulist
// Memory info: /sys/devices/system/node/node0/meminfo
// Distances: /sys/devices/system/node/node0/distance

Get CPUs per Node:

struct bitmask* cpus = numa_allocate_cpumask();
for (int node = 0; node <= numa_max_node(); node++) {
    numa_node_to_cpus(node, cpus);
    printf("Node %d CPUs: ", node);
    for (int cpu = 0; cpu < numa_num_configured_cpus(); cpu++) {
        if (numa_bitmask_isbitset(cpus, cpu)) {
            printf("%d ", cpu);
        }
    }
    printf("\n");
}

Get Memory per Node:

long long free_mem;
long long total_mem = numa_node_size64(node, &free_mem);
printf("Node %d: %lld MB total, %lld MB free\n",
       node, total_mem / (1024*1024), free_mem / (1024*1024));

Get Distance Matrix:

for (int i = 0; i <= numa_max_node(); i++) {
    for (int j = 0; j <= numa_max_node(); j++) {
        int distance = numa_distance(i, j);
        printf("%4d ", distance);
    }
    printf("\n");
}

Learning milestones:

Tool discovers node count → You understand NUMA basics
CPU mapping works → You understand node-CPU relationships
Memory info works → You understand node memory
Distance matrix displays → You understand NUMA distances
Compare with numactl → Your tool matches system tools

Project 2: Memory Latency Microbenchmark

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Assembly
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Memory Performance / Benchmarking
Software or Tool: libnuma, rdtsc/rdtscp
Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A microbenchmark that measures memory access latency to different NUMA nodes, cache levels, and memory regions—revealing the true cost of memory architecture decisions.

Why it teaches memory performance: Numbers like “120ns remote latency” are abstract until you measure them yourself. This project gives you intuition for memory performance that will guide all your optimization work.

Core challenges you’ll face:

Pointer chasing to defeat prefetchers → maps to understanding hardware prefetching
High-resolution timing → maps to using TSC, avoiding measurement overhead
Controlling memory placement → maps to numa_alloc_onnode, mbind
Statistical analysis → maps to handling variance in measurements

Key Concepts:

Memory Latency Measurement: Intel MLC Documentation
Pointer Chasing: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson
TSC Timing: Intel Software Developer’s Manual
NUMA Allocation: libnuma documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 completed, understanding of memory hierarchy

Real world outcome:

$ ./numa-latency

Memory Latency Benchmark
========================

Testing from Node 0 (CPUs 0-7):

  L1 Cache (32KB):     1.2 ns
  L2 Cache (256KB):    4.1 ns
  L3 Cache (20MB):    18.7 ns
  Local DRAM:         78.3 ns
  Remote DRAM (Node 1): 127.6 ns

Testing from Node 1 (CPUs 8-15):

  L1 Cache:            1.2 ns
  L2 Cache:            4.0 ns
  L3 Cache:           19.1 ns
  Local DRAM:         79.1 ns
  Remote DRAM (Node 0): 125.4 ns

NUMA Penalty: 1.63x slower for remote access

Full Latency Matrix (ns):
            To Node 0   To Node 1
From Node 0:   78.3       127.6
From Node 1:  125.4        79.1

Implementation Hints:

Pointer Chasing (defeats prefetcher):

// Create a linked list with random order
typedef struct Node {
    struct Node* next;
    char padding[56];  // Make each node 64 bytes (cache line)
} Node;

// Shuffle the list randomly so prefetcher can't predict
void shuffle_list(Node* nodes, size_t count) {
    // Fisher-Yates shuffle of next pointers
    for (size_t i = count - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        // Swap next pointers
        Node* temp = nodes[i].next;
        nodes[i].next = nodes[j].next;
        nodes[j].next = temp;
    }
}

High-Resolution Timing:

static inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

static inline uint64_t rdtscp() {
    uint32_t lo, hi, aux;
    __asm__ volatile ("rdtscp" : "=a" (lo), "=d" (hi), "=c" (aux));
    return ((uint64_t)hi << 32) | lo;
}

// Measure latency
uint64_t start = rdtsc();
for (int i = 0; i < iterations; i++) {
    p = p->next;  // Pointer chase
}
uint64_t end = rdtscp();
uint64_t cycles = (end - start) / iterations;
double ns = cycles / cpu_ghz;

Allocate on Specific Node:

#include <numa.h>

void* buffer = numa_alloc_onnode(size, target_node);
if (!buffer) {
    perror("numa_alloc_onnode failed");
}

// Or use mbind for existing memory
unsigned long nodemask = 1UL << target_node;
mbind(buffer, size, MPOL_BIND, &nodemask, numa_max_node() + 2, 0);

Pin Thread to Specific CPU:

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu_id, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Test Different Sizes (to hit different cache levels):

size_t test_sizes[] = {
    16 * 1024,        // L1 (32KB)
    128 * 1024,       // L2 (256KB)
    8 * 1024 * 1024,  // L3 (varies)
    256 * 1024 * 1024 // DRAM
};

Learning milestones:

Measure cache latencies → You understand cache hierarchy
Measure local vs remote DRAM → You see NUMA penalty
Results match Intel MLC → Your benchmark is accurate
Understand variance → You understand measurement challenges
Can predict performance → Real understanding achieved

Project 3: Memory Bandwidth Benchmark

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++ with intrinsics, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Memory Bandwidth / SIMD
Software or Tool: libnuma, AVX intrinsics
Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A bandwidth benchmark (like STREAM) that measures memory throughput across NUMA nodes, with different access patterns and thread configurations.

Why it teaches bandwidth: Latency hides with pipelining; bandwidth doesn’t. Understanding bandwidth limits tells you when you’re memory-bound and how parallelism affects memory performance.

Core challenges you’ll face:

Saturating memory bandwidth → maps to using SIMD, multiple threads
NUMA-aware allocation → maps to controlling where data lives
Measuring aggregate bandwidth → maps to combining thread results
Understanding bandwidth scaling → maps to when more threads help/hurt

Key Concepts:

STREAM Benchmark: STREAM Documentation
Memory Bandwidth: “Computer Architecture: A Quantitative Approach” Chapter 2 - Hennessy & Patterson
SIMD Instructions: Intel Intrinsics Guide
NUMA Bandwidth: Measuring NUMA effects with STREAM

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 2 completed, understanding of SIMD basics

Real world outcome:

$ ./numa-bandwidth --threads 16

NUMA Memory Bandwidth Benchmark
===============================

Single-threaded (pinned to Node 0):
  Copy (Local):   12.4 GB/s
  Copy (Remote):   6.8 GB/s
  Scale (Local):  11.9 GB/s
  Triad (Local):  13.2 GB/s

Multi-threaded bandwidth:

Threads   Local BW    Remote BW   Interleaved
   1      12.4 GB/s    6.8 GB/s     9.5 GB/s
   2      24.6 GB/s   11.2 GB/s    17.8 GB/s
   4      45.3 GB/s   18.7 GB/s    31.4 GB/s
   8      68.2 GB/s   24.1 GB/s    43.6 GB/s
  16      72.1 GB/s   25.3 GB/s    47.2 GB/s

Peak aggregate bandwidth: 72.1 GB/s (8 threads saturate)
Remote access penalty: 2.85x slower

Per-Node breakdown:
  Node 0: 72.1 GB/s peak (8 channels × 2666 MT/s)
  Node 1: 71.8 GB/s peak

Implementation Hints:

STREAM-style Operations:

// Copy: c[i] = a[i]
void copy(double* restrict c, double* restrict a, size_t n) {
    #pragma omp parallel for
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i];
    }
}

// Scale: b[i] = scalar * c[i]
void scale(double* restrict b, double* restrict c, double scalar, size_t n) {
    #pragma omp parallel for
    for (size_t i = 0; i < n; i++) {
        b[i] = scalar * c[i];
    }
}

// Add: c[i] = a[i] + b[i]
void add(double* restrict c, double* restrict a, double* restrict b, size_t n) {
    #pragma omp parallel for
    for (size_t i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// Triad: a[i] = b[i] + scalar * c[i]
void triad(double* restrict a, double* restrict b, double* restrict c,
           double scalar, size_t n) {
    #pragma omp parallel for
    for (size_t i = 0; i < n; i++) {
        a[i] = b[i] + scalar * c[i];
    }
}

AVX-512 for Maximum Bandwidth:

#include <immintrin.h>

void copy_avx512(double* restrict dst, double* restrict src, size_t n) {
    for (size_t i = 0; i < n; i += 8) {
        __m512d v = _mm512_load_pd(&src[i]);
        _mm512_store_pd(&dst[i], v);
    }
}

NUMA-Aware Allocation:

// Allocate on specific node
double* local_array = numa_alloc_onnode(size, local_node);

// Allocate interleaved across all nodes
double* interleaved = numa_alloc_interleaved(size);

// First-touch initialization for NUMA placement
#pragma omp parallel for
for (size_t i = 0; i < n; i++) {
    array[i] = 0.0;  // Each thread touches its portion
}

Bandwidth Calculation:

double bytes = n * sizeof(double) * 2;  // Read + Write for Copy
double seconds = end_time - start_time;
double bandwidth_gb = (bytes / 1e9) / seconds;

Scaling Test:

for (int threads = 1; threads <= max_threads; threads *= 2) {
    omp_set_num_threads(threads);
    // Run benchmark and record bandwidth
}

Learning milestones:

Single-threaded bandwidth measured → You understand basic measurement
Local vs remote difference clear → You see NUMA bandwidth penalty
Scaling curve shows saturation → You understand bandwidth limits
SIMD version faster → You understand importance of vectorization
Interleaved vs local trade-offs → Deep NUMA bandwidth understanding

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Cache Coherence / Performance Analysis
Software or Tool: perf, libnuma
Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A tool that demonstrates and detects false sharing—when threads access different data that happens to share a cache line, causing severe performance degradation.

Why it teaches cache coherence: False sharing is one of the most common performance bugs in parallel code. It’s invisible to correctness but devastating to performance—sometimes 10-100x slower.

Core challenges you’ll face:

Creating false sharing conditions → maps to understanding cache line granularity
Measuring cache coherence traffic → maps to using hardware performance counters
Padding to eliminate false sharing → maps to cache-line alignment
Detecting in real code → maps to analyzing memory access patterns

Key Concepts:

False Sharing: “The Art of Multiprocessor Programming” Chapter 7 - Herlihy & Shavit
MESI Protocol: ARM MESI Documentation
Cache Line Size: Intel/AMD processor documentation
Performance Counters: perf stat documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 completed, understanding of caching

Real world outcome:

$ ./false-sharing-demo

=== False Sharing Demonstration ===

Test 1: Adjacent counters (FALSE SHARING)
  Struct layout: [cnt0|cnt1|cnt2|cnt3] (8 bytes each, same cache line)
  4 threads incrementing their own counter...
  Time: 2847 ms
  Throughput: 35.1 M ops/sec
  Cache misses: 847,293,102

Test 2: Padded counters (NO FALSE SHARING)
  Struct layout: [cnt0|pad...][cnt1|pad...] (64 bytes each, separate lines)
  4 threads incrementing their own counter...
  Time: 127 ms
  Throughput: 787.4 M ops/sec
  Cache misses: 1,234

Speedup from eliminating false sharing: 22.4x

=== Detecting False Sharing ===
Monitoring process 12345 for 5 seconds...

Potential false sharing detected:
  Address range: 0x7f8a12340000 - 0x7f8a12340040
  Threads accessing: 0, 1, 2, 3
  Cache coherence invalidations: 12,847,293
  Recommendation: Add padding between variables at offsets 0, 8, 16, 24

Implementation Hints:

Create False Sharing:

// BAD: All counters on same cache line
struct {
    volatile long counter[4];  // 32 bytes, fits in one 64-byte cache line
} shared;

void* increment_bad(void* arg) {
    int id = *(int*)arg;
    for (int i = 0; i < 100000000; i++) {
        shared.counter[id]++;  // Causes cache line invalidation!
    }
    return NULL;
}

Eliminate False Sharing with Padding:

// GOOD: Each counter on its own cache line
#define CACHE_LINE_SIZE 64

struct alignas(CACHE_LINE_SIZE) {
    volatile long counter;
    char padding[CACHE_LINE_SIZE - sizeof(long)];
} padded[4];

void* increment_good(void* arg) {
    int id = *(int*)arg;
    for (int i = 0; i < 100000000; i++) {
        padded[id].counter++;  // No false sharing
    }
    return NULL;
}

Measure with Performance Counters:

# Run with perf to see cache effects
perf stat -e cache-misses,cache-references,\
           LLC-load-misses,LLC-store-misses \
           ./false_sharing_demo

# Or use perf c2c for cache-to-cache analysis
perf c2c record ./false_sharing_demo
perf c2c report

Programmatic Counter Access:

#include <linux/perf_event.h>
#include <sys/syscall.h>

static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                            int cpu, int group_fd, unsigned long flags) {
    return syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}

// Setup counter for cache misses
struct perf_event_attr pe;
memset(&pe, 0, sizeof(pe));
pe.type = PERF_TYPE_HARDWARE;
pe.config = PERF_COUNT_HW_CACHE_MISSES;
pe.disabled = 1;
pe.exclude_kernel = 1;

int fd = perf_event_open(&pe, 0, -1, -1, 0);

Alignment Utilities:

// C11/C++11 alignas
struct alignas(64) CacheAlignedCounter {
    long value;
};

// Or manual padding
struct PaddedCounter {
    long value;
    char padding[64 - sizeof(long)];
} __attribute__((aligned(64)));

Learning milestones:

Demo shows massive slowdown → You see false sharing impact
Padding fixes the problem → You understand the solution
Perf counters confirm diagnosis → You can detect it
You can explain MESI states → Deep understanding
Find false sharing in real code → Practical application

Project 5: NUMA-Aware Memory Allocator

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Memory Allocators / NUMA
Software or Tool: libnuma, mmap
Main Book: “The Art of Memory Management” and AMD NUMA allocator paper

What you’ll build: A custom memory allocator that is NUMA-aware—allocating memory from the local node when possible and providing explicit control over memory placement.

Why it teaches NUMA allocation: Standard malloc() doesn’t know about NUMA. High-performance applications need allocators that understand memory topology and place data strategically.

Core challenges you’ll face:

Per-node memory pools → maps to maintaining separate free lists per node
Thread-to-node mapping → maps to determining which node a thread runs on
Allocation strategies → maps to local-first vs interleaved
Fragmentation management → maps to coalescing, large page support

Key Concepts:

NUMA Allocator Design: AMD NUMA-aware heap manager
Memory Allocator Basics: “The Art of Multiprocessor Programming” Chapter 14 - Herlihy & Shavit
libnuma Allocation: numa(3) man page
Page Allocation: Linux mmap() documentation

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-2 completed, understanding of memory allocators

Real world outcome:

// Your allocator API
void* numa_malloc(size_t size);              // Allocate on local node
void* numa_malloc_node(size_t size, int node); // Allocate on specific node
void* numa_malloc_interleaved(size_t size);  // Interleave across nodes
void numa_free(void* ptr);

// Statistics
numa_allocator_stats();
// Output:
// Node 0: 1.2 GB allocated, 847 allocations, 0 remote accesses
// Node 1: 0.8 GB allocated, 523 allocations, 0 remote accesses
// Thread-local cache hits: 94.2%

Implementation Hints:

Per-Node Arena Structure:

#define MAX_NODES 8
#define ARENA_SIZE (64 * 1024 * 1024)  // 64MB per arena

typedef struct {
    void* base;           // mmap'd region
    size_t size;          // Total size
    size_t used;          // Bytes allocated
    pthread_mutex_t lock;
    FreeList* free_lists[NUM_SIZE_CLASSES];
} NodeArena;

typedef struct {
    NodeArena arenas[MAX_NODES];
    int num_nodes;
} NUMAAllocator;

Initialize Per-Node Arenas:

void numa_allocator_init(NUMAAllocator* alloc) {
    alloc->num_nodes = numa_max_node() + 1;

    for (int node = 0; node < alloc->num_nodes; node++) {
        // Allocate arena memory on specific node
        void* base = numa_alloc_onnode(ARENA_SIZE, node);
        if (!base) {
            perror("numa_alloc_onnode failed");
            exit(1);
        }

        alloc->arenas[node].base = base;
        alloc->arenas[node].size = ARENA_SIZE;
        alloc->arenas[node].used = 0;
        pthread_mutex_init(&alloc->arenas[node].lock, NULL);
    }
}

Determine Current Node:

int get_current_node() {
    int cpu = sched_getcpu();
    return numa_node_of_cpu(cpu);
}

Local Allocation:

void* numa_malloc(size_t size) {
    int node = get_current_node();
    return allocate_from_node(&allocator.arenas[node], size);
}

void* allocate_from_node(NodeArena* arena, size_t size) {
    pthread_mutex_lock(&arena->lock);

    // Find appropriate size class
    int size_class = get_size_class(size);

    // Check free list
    if (arena->free_lists[size_class]) {
        void* ptr = arena->free_lists[size_class];
        arena->free_lists[size_class] = *(void**)ptr;
        pthread_mutex_unlock(&arena->lock);
        return ptr;
    }

    // Allocate from arena
    size_t alloc_size = size_class_to_size(size_class);
    if (arena->used + alloc_size > arena->size) {
        pthread_mutex_unlock(&arena->lock);
        return NULL;  // Arena full
    }

    void* ptr = (char*)arena->base + arena->used;
    arena->used += alloc_size;
    pthread_mutex_unlock(&arena->lock);
    return ptr;
}

Thread-Local Caching (for performance):

__thread struct {
    void* cache[NUM_SIZE_CLASSES][CACHE_SIZE];
    int count[NUM_SIZE_CLASSES];
    int current_node;
} tls_cache;

void* numa_malloc_fast(size_t size) {
    int sc = get_size_class(size);
    if (tls_cache.count[sc] > 0) {
        return tls_cache.cache[sc][--tls_cache.count[sc]];
    }
    // Slow path: go to arena
    return numa_malloc(size);
}

Interleaved Allocation:

__thread int interleave_node = 0;

void* numa_malloc_interleaved(size_t size) {
    int node = interleave_node;
    interleave_node = (interleave_node + 1) % allocator.num_nodes;
    return allocate_from_node(&allocator.arenas[node], size);
}

Learning milestones:

Basic per-node allocation works → You understand the structure
Correct node determined → You understand thread-node mapping
Free lists work → You understand memory reuse
Thread-local caching works → You understand performance optimization
Benchmark shows improvement → Your allocator beats standard malloc for NUMA

Project 6: NUMA-Aware Data Structure Library

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: C, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Data Structures / NUMA Optimization
Software or Tool: libnuma, C++ allocators
Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A library of NUMA-aware data structures—hash tables, queues, and trees that understand memory locality and minimize cross-node traffic.

Why it teaches NUMA data structures: Generic data structures ignore NUMA. High-performance systems need data structures that partition data across nodes and minimize remote accesses.

Core challenges you’ll face:

Partitioned hash tables → maps to per-node buckets, local lookups
NUMA-aware queues → maps to minimizing producer-consumer cross-node traffic
Tree partitioning → maps to locality-aware tree construction
Balancing locality vs load → maps to when to access remote data

Key Concepts:

Concurrent Data Structures: “The Art of Multiprocessor Programming” Chapters 9-13 - Herlihy & Shavit
NUMA-Aware Hashing: Research papers on NUMA hash tables
Work Stealing: “Cilk” scheduler documentation
Lock-Free Programming: “C++ Concurrency in Action” by Anthony Williams

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Project 5 completed, understanding of concurrent data structures

Real world outcome:

// NUMA-aware hash table
NUMAHashTable<std::string, UserData> table(numa_nodes);

// Insert - data placed on local node
table.insert("user123", userData);  // O(1), local memory

// Lookup - checks local node first
auto result = table.find("user123");  // Usually local

// NUMA-aware work queue
NUMAWorkQueue<Task> queue;
queue.push(task);  // Goes to local node's queue

// Worker pops from local queue first, steals from remote if empty
auto task = queue.pop();  // Local-first, then steal

// Statistics
table.stats();
// Output:
// Node 0: 1.2M entries, 99.2% local lookups
// Node 1: 1.1M entries, 98.7% local lookups
// Cross-node lookups: 1.1%

Implementation Hints:

NUMA-Aware Hash Table:

template<typename K, typename V>
class NUMAHashTable {
private:
    struct NodePartition {
        std::vector<Bucket> buckets;
        mutable std::shared_mutex mutex;
        alignas(64) std::atomic<size_t> count{0};
    };

    std::vector<NodePartition> partitions;

    int key_to_node(const K& key) {
        // Hash determines which node owns this key
        return std::hash<K>{}(key) % partitions.size();
    }

public:
    NUMAHashTable(int num_nodes) : partitions(num_nodes) {
        for (int i = 0; i < num_nodes; i++) {
            // Allocate buckets on specific node
            numa_run_on_node(i);
            partitions[i].buckets.resize(INITIAL_BUCKETS);
        }
    }

    void insert(const K& key, const V& value) {
        int node = key_to_node(key);
        auto& partition = partitions[node];
        std::unique_lock lock(partition.mutex);
        // Insert into partition
    }

    std::optional<V> find(const K& key) {
        int node = key_to_node(key);
        auto& partition = partitions[node];
        std::shared_lock lock(partition.mutex);
        // Lookup in partition
    }
};

NUMA-Aware Work-Stealing Queue:

template<typename T>
class NUMAWorkQueue {
private:
    struct alignas(64) NodeQueue {
        std::deque<T> local_queue;
        std::mutex mutex;
    };

    std::vector<NodeQueue> queues;

public:
    void push(T item) {
        int node = numa_node_of_cpu(sched_getcpu());
        std::lock_guard lock(queues[node].mutex);
        queues[node].local_queue.push_back(std::move(item));
    }

    std::optional<T> pop() {
        int my_node = numa_node_of_cpu(sched_getcpu());

        // Try local queue first
        {
            std::lock_guard lock(queues[my_node].mutex);
            if (!queues[my_node].local_queue.empty()) {
                T item = std::move(queues[my_node].local_queue.back());
                queues[my_node].local_queue.pop_back();
                return item;
            }
        }

        // Steal from other nodes (in distance order)
        for (int node = 0; node < queues.size(); node++) {
            if (node == my_node) continue;
            std::lock_guard lock(queues[node].mutex);
            if (!queues[node].local_queue.empty()) {
                T item = std::move(queues[node].local_queue.front());
                queues[node].local_queue.pop_front();
                return item;
            }
        }

        return std::nullopt;
    }
};

NUMA-Aware Vector:

template<typename T>
class NUMAVector {
private:
    struct Chunk {
        T* data;
        size_t size;
        int node;
    };

    std::vector<Chunk> chunks;
    size_t chunk_size = 1024 * 1024 / sizeof(T);  // 1MB chunks

public:
    void push_back(const T& value) {
        int node = get_current_node();
        // Allocate new chunk on local node if needed
        if (needs_new_chunk()) {
            T* data = (T*)numa_alloc_onnode(chunk_size * sizeof(T), node);
            chunks.push_back({data, 0, node});
        }
        // Add to current chunk
    }

    T& operator[](size_t index) {
        auto& chunk = find_chunk(index);
        return chunk.data[index % chunk_size];
    }
};

Learning milestones:

Partitioned hash table works → You understand data partitioning
Work stealing works → You understand locality-aware scheduling
Benchmark shows improvement → Your structures are faster
Statistics confirm locality → You’re achieving local access
Can integrate with real application → Practical value achieved

Project 7: NUMA Memory Migration Tool

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Memory Management / Page Migration
Software or Tool: libnuma, move_pages(), perf
Main Book: Linux Kernel documentation on NUMA

What you’ll build: A tool that analyzes process memory placement and migrates pages to optimal NUMA nodes based on access patterns—like a “defragmenter” for NUMA locality.

Why it teaches memory migration: Sometimes memory ends up on the wrong node (first-touch gone wrong, thread migration). Understanding how to detect and fix this is crucial for NUMA optimization.

Core challenges you’ll face:

Determining page node locations → maps to get_mempolicy(), /proc/pid/numa_maps
Identifying hot pages → maps to performance counters, sampling
Migrating pages → maps to move_pages() system call
Balancing migration cost → maps to when migration pays off

Key Concepts:

NUMA Memory Policy: Linux Kernel NUMA documentation
move_pages(): Linux man pages
Page Migration: Linux mm/migrate.c documentation
numastat: numactl documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Projects 1-2 completed, understanding of virtual memory

Real world outcome:

$ ./numa-migrate --analyze --pid 12345

=== NUMA Memory Analysis for PID 12345 ===

Virtual Memory Map:
  0x7f0000000000 - 0x7f0080000000 (2GB heap): 87% Node 0, 13% Node 1
  0x7f0100000000 - 0x7f0140000000 (1GB mmap): 100% Node 1

Thread-Node Affinity:
  Thread 1 (TID 12345): Running on Node 0
  Thread 2 (TID 12346): Running on Node 0
  Thread 3 (TID 12347): Running on Node 1
  Thread 4 (TID 12348): Running on Node 1

Memory Access Analysis (sampled):
  Node 0 threads accessing Node 1 memory: 847,293 accesses/sec
  Node 1 threads accessing Node 0 memory: 23,412 accesses/sec

Recommendations:
  1. Migrate 256MB from Node 1 to Node 0 (heap region)
     Estimated improvement: 12% fewer remote accesses

$ ./numa-migrate --migrate --pid 12345

Migrating 256MB (65536 pages) from Node 1 to Node 0...
Migration complete in 1.2 seconds
Pages migrated: 65536
Pages failed: 0

Post-migration analysis:
  Remote accesses reduced by 14%

Implementation Hints:

Read /proc/pid/numa_maps:

void analyze_numa_maps(pid_t pid) {
    char path[256];
    snprintf(path, sizeof(path), "/proc/%d/numa_maps", pid);

    FILE* f = fopen(path, "r");
    char line[1024];

    while (fgets(line, sizeof(line), f)) {
        // Parse lines like:
        // 7f8a1c000000 default heap anon=524288 dirty=524288 N0=400000 N1=124288
        unsigned long addr;
        char policy[32], type[32];
        sscanf(line, "%lx %s %s", &addr, policy, type);

        // Parse node counts: N0=X N1=Y
        // Calculate percentage on each node
    }
}

Get Page Node Locations:

#include <numaif.h>

void get_page_nodes(void* addr, size_t length) {
    size_t page_size = getpagesize();
    size_t num_pages = length / page_size;

    void** pages = malloc(num_pages * sizeof(void*));
    int* nodes = malloc(num_pages * sizeof(int));
    int* status = malloc(num_pages * sizeof(int));

    for (size_t i = 0; i < num_pages; i++) {
        pages[i] = (char*)addr + i * page_size;
    }

    // Query page locations
    if (move_pages(0, num_pages, pages, NULL, status, 0) < 0) {
        perror("move_pages query failed");
    }

    // status[i] contains the node number for each page
    for (size_t i = 0; i < num_pages; i++) {
        printf("Page %zu: Node %d\n", i, status[i]);
    }
}

Migrate Pages:

int migrate_pages_to_node(void** pages, size_t num_pages, int target_node) {
    int* nodes = malloc(num_pages * sizeof(int));
    int* status = malloc(num_pages * sizeof(int));

    // Set all pages to target node
    for (size_t i = 0; i < num_pages; i++) {
        nodes[i] = target_node;
    }

    // Perform migration
    int result = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);

    if (result < 0) {
        perror("move_pages migration failed");
        return -1;
    }

    // Check status for each page
    int success = 0, failed = 0;
    for (size_t i = 0; i < num_pages; i++) {
        if (status[i] == target_node) {
            success++;
        } else if (status[i] < 0) {
            failed++;
            // status[i] is -errno
        }
    }

    printf("Migrated %d pages, failed %d\n", success, failed);
    return success;
}

Sample Memory Accesses (using perf):

// Use perf_event_open to sample memory accesses
struct perf_event_attr pe = {
    .type = PERF_TYPE_RAW,
    .config = 0x01d3,  // MEM_LOAD_RETIRED.L3_MISS (varies by CPU)
    .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_ADDR,
    .sample_period = 1000,
    .precise_ip = 2,
};

// Read samples to determine which addresses cause remote accesses

Migration Cost Model:

// Estimate if migration is worth it
double migration_cost_ns = pages_to_migrate * 1000;  // ~1us per page
double savings_per_access_ns = 40;  // Remote vs local latency difference
double accesses_per_second = measured_remote_accesses;

double break_even_seconds = migration_cost_ns / (savings_per_access_ns * accesses_per_second);

if (break_even_seconds < 1.0) {
    printf("Migration recommended: pays off in %.2f seconds\n", break_even_seconds);
}

Learning milestones:

Can read numa_maps → You understand memory placement
Can query page nodes → You understand move_pages()
Migration works → You can fix bad placement
Access sampling works → You can identify hot spots
Tool improves real application → Practical value achieved

Project 8: Cache Coherence Simulator

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C++ or Rust
Alternative Programming Languages: Python (for visualization)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Cache Coherence / Computer Architecture
Software or Tool: Custom simulator
Main Book: “Computer Architecture: A Quantitative Approach” by Hennessy & Patterson

What you’ll build: A visual simulator of the MESI cache coherence protocol, showing how cache lines transition between states as multiple cores read and write shared data.

Why it teaches cache coherence: Cache coherence is fundamental to parallel programming but hard to visualize. Building a simulator forces you to understand every state transition and see why false sharing is so expensive.

Core challenges you’ll face:

Modeling cache states → maps to MESI state machine
Simulating bus/directory traffic → maps to coherence protocol messages
Tracking cache line ownership → maps to which core has which state
Visualizing transitions → maps to making the invisible visible

Key Concepts:

MESI Protocol: Wikipedia MESI
Cache Coherence: “Computer Architecture: A Quantitative Approach” Chapter 5 - Hennessy & Patterson
Directory-Based Protocols: UMD Cache Coherence
SyncdSim: Coherence Simulator

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Understanding of caching, state machines

Real world outcome:

$ ./mesi-simulator

=== MESI Cache Coherence Simulator ===

Initial State:
  Cache Line 0x1000: Memory value = 42

  Core 0 Cache: [I]nvalid
  Core 1 Cache: [I]nvalid
  Core 2 Cache: [I]nvalid
  Core 3 Cache: [I]nvalid

> core0 read 0x1000
  Core 0: I → E (exclusive, loaded from memory)
  Bus: ReadExclusive 0x1000
  Memory read: 1

> core1 read 0x1000
  Core 0: E → S (downgrade, sharing)
  Core 1: I → S (shared, from Core 0's cache)
  Bus: Read 0x1000, Snoop hit on Core 0
  Cache-to-cache transfer: 1

> core0 write 0x1000 = 100
  Core 0: S → M (modified, invalidating others)
  Core 1: S → I (invalidated)
  Bus: Invalidate 0x1000
  Invalidations sent: 1

> core1 read 0x1000
  Core 0: M → S (writeback + share)
  Core 1: I → S (shared)
  Bus: Read 0x1000, Snoop hit dirty
  Memory writeback: 1
  Cache-to-cache transfer: 1

Statistics:
  Memory reads: 1
  Memory writes: 1
  Cache-to-cache transfers: 2
  Invalidations: 1
  Bus transactions: 4

Implementation Hints:

Cache Line State:

enum class MESIState { Invalid, Shared, Exclusive, Modified };

struct CacheLine {
    MESIState state = MESIState::Invalid;
    uint64_t tag = 0;
    uint64_t data = 0;
};

struct Cache {
    int core_id;
    std::unordered_map<uint64_t, CacheLine> lines;

    MESIState get_state(uint64_t addr) {
        auto it = lines.find(addr);
        return it != lines.end() ? it->second.state : MESIState::Invalid;
    }
};

Bus/Directory:

struct CoherenceMessage {
    enum Type { Read, ReadExclusive, Invalidate, Writeback, Data };
    Type type;
    int source_core;
    uint64_t address;
    uint64_t data;
};

class Bus {
public:
    std::vector<Cache*> caches;
    std::function<uint64_t(uint64_t)> read_memory;
    std::function<void(uint64_t, uint64_t)> write_memory;

    void broadcast(CoherenceMessage msg) {
        // All caches snoop the bus
        for (auto* cache : caches) {
            if (cache->core_id != msg.source_core) {
                cache->snoop(msg);
            }
        }
    }
};

MESI State Transitions:

void Cache::handle_local_read(uint64_t addr) {
    auto& line = lines[addr];

    switch (line.state) {
        case MESIState::Invalid:
            // Need to fetch from memory/other cache
            if (any_cache_has(addr)) {
                // Cache-to-cache transfer
                line.state = MESIState::Shared;
            } else {
                // Exclusive access
                line.state = MESIState::Exclusive;
            }
            line.data = fetch_data(addr);
            break;

        case MESIState::Shared:
        case MESIState::Exclusive:
        case MESIState::Modified:
            // Already have valid data
            break;
    }
}

void Cache::handle_local_write(uint64_t addr, uint64_t data) {
    auto& line = lines[addr];

    switch (line.state) {
        case MESIState::Invalid:
            // Read-for-ownership, then modify
            bus->broadcast({Invalidate, core_id, addr});
            line.state = MESIState::Modified;
            break;

        case MESIState::Shared:
            // Upgrade: invalidate other copies
            bus->broadcast({Invalidate, core_id, addr});
            line.state = MESIState::Modified;
            break;

        case MESIState::Exclusive:
            // Silent upgrade
            line.state = MESIState::Modified;
            break;

        case MESIState::Modified:
            // Already have exclusive dirty copy
            break;
    }
    line.data = data;
}

Snoop Logic:

void Cache::snoop(CoherenceMessage msg) {
    auto it = lines.find(msg.address);
    if (it == lines.end()) return;

    auto& line = it->second;

    switch (msg.type) {
        case Read:
            if (line.state == MESIState::Exclusive) {
                line.state = MESIState::Shared;
            } else if (line.state == MESIState::Modified) {
                // Writeback and share
                bus->write_memory(msg.address, line.data);
                line.state = MESIState::Shared;
            }
            break;

        case Invalidate:
            line.state = MESIState::Invalid;
            break;
    }
}

Learning milestones:

State transitions correct → You understand MESI
Bus traffic visible → You see coherence cost
False sharing visible → You see why it’s expensive
Directory variant works → You understand scalability
Can explain to others → Deep understanding achieved

Project 9: NUMA-Aware Thread Pool

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: C, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Threading / NUMA Scheduling
Software or Tool: libnuma, pthreads
Main Book: “C++ Concurrency in Action” by Anthony Williams

What you’ll build: A thread pool that understands NUMA topology—pinning threads to nodes, scheduling tasks for locality, and implementing work stealing that prefers local nodes.

Why it teaches NUMA threading: Standard thread pools ignore NUMA. High-performance systems need thread pools that co-locate threads with their data and minimize cross-node communication.

Core challenges you’ll face:

Thread pinning → maps to CPU affinity, NUMA node binding
Locality-aware scheduling → maps to tasks prefer nodes where their data lives
NUMA-aware work stealing → maps to steal from local nodes first
Load balancing vs locality → maps to trade-off decisions

Key Concepts:

Thread Affinity: pthread_setaffinity_np documentation
Work Stealing: “Cilk” and Intel TBB documentation
NUMA Binding: libnuma numa_run_on_node

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5-6 completed, understanding of thread pools

Real world outcome:

// Create NUMA-aware thread pool
NUMAThreadPool pool(/* threads_per_node= */ 4);

// Submit task with node hint
pool.submit([data] {
    process(data);  // Runs on node where 'data' lives
}, numa_node_of(data));

// Submit without hint (runs on any available thread)
pool.submit([] { generic_work(); });

// Statistics
pool.stats();
// Output:
// Node 0: 4 threads, 12847 tasks, 98.3% local execution
// Node 1: 4 threads, 11923 tasks, 97.9% local execution
// Work steals: 312 (1.3% of tasks)
// Cross-node steals: 47 (0.2% of tasks)

Implementation Hints:

Per-Node Thread Groups:

struct NodeThreads {
    int node_id;
    std::vector<std::thread> threads;
    std::deque<std::function<void()>> local_queue;
    std::mutex queue_mutex;
    std::condition_variable cv;
    std::atomic<bool> stop{false};
};

class NUMAThreadPool {
    std::vector<NodeThreads> node_groups;
};

Pin Threads to Nodes:

void worker_thread(NodeThreads* group) {
    // Pin this thread to the node
    numa_run_on_node(group->node_id);

    // Or more specifically to CPUs on this node
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    struct bitmask* cpus = numa_allocate_cpumask();
    numa_node_to_cpus(group->node_id, cpus);
    for (int i = 0; i < numa_num_configured_cpus(); i++) {
        if (numa_bitmask_isbitset(cpus, i)) {
            CPU_SET(i, &cpuset);
        }
    }
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

    // Now run task loop
    while (!group->stop) {
        auto task = get_task(group);
        if (task) task();
    }
}

Submit with Node Affinity:

template<typename F>
void submit(F&& func, int preferred_node = -1) {
    if (preferred_node < 0) {
        // No preference: use current node or round-robin
        preferred_node = get_current_node();
    }

    auto& group = node_groups[preferred_node];
    {
        std::lock_guard lock(group.queue_mutex);
        group.local_queue.push_back(std::forward<F>(func));
    }
    group.cv.notify_one();
}

NUMA-Aware Work Stealing:

std::function<void()> get_task(NodeThreads* my_group) {
    // Try local queue first
    {
        std::unique_lock lock(my_group->queue_mutex);
        if (!my_group->local_queue.empty()) {
            auto task = std::move(my_group->local_queue.front());
            my_group->local_queue.pop_front();
            return task;
        }
    }

    // Try to steal from other nodes (prefer closer nodes)
    std::vector<int> steal_order = get_nodes_by_distance(my_group->node_id);

    for (int node : steal_order) {
        if (node == my_group->node_id) continue;

        auto& victim = node_groups[node];
        std::unique_lock lock(victim.queue_mutex, std::try_to_lock);
        if (lock.owns_lock() && !victim.local_queue.empty()) {
            // Steal from back (opposite end from pop)
            auto task = std::move(victim.local_queue.back());
            victim.local_queue.pop_back();
            return task;
        }
    }

    // Nothing to steal, wait on local queue
    std::unique_lock lock(my_group->queue_mutex);
    my_group->cv.wait_for(lock, std::chrono::milliseconds(10));
    return nullptr;
}

Task with Data Locality:

template<typename F, typename Data>
void submit_with_data(F&& func, Data* data) {
    int node = numa_node_of(data);  // Determine data's node
    submit([func = std::forward<F>(func), data]() {
        func(data);
    }, node);
}

int numa_node_of(void* ptr) {
    int status;
    void* pages[] = {ptr};
    move_pages(0, 1, pages, nullptr, &status, 0);
    return status >= 0 ? status : 0;
}

Learning milestones:

Threads pinned correctly → You understand affinity
Local tasks stay local → You understand scheduling
Work stealing works → You understand load balancing
Steal order respects distance → NUMA-aware stealing
Benchmark shows improvement → Real performance gain

Project 10: NUMA-Aware Database Buffer Pool

File: LEARN_NUMA_VS_UMA_ARCHITECTURES_DEEP_DIVE.md
Main Programming Language: C++
Alternative Programming Languages: C, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Database Systems / NUMA
Software or Tool: libnuma, mmap, AIO
Main Book: “Database Internals” by Alex Petrov

What you’ll build: A database buffer pool manager that is NUMA-aware—caching pages on the optimal node, handling page migration, and coordinating with a NUMA-aware query executor.

Why it teaches real-world NUMA: Databases are where NUMA matters most. Large in-memory databases can see 2-3x performance improvements from NUMA-aware design. This project applies everything you’ve learned.

Core challenges you’ll face:

Per-node buffer pools → maps to partitioned caching
Page placement strategy → maps to which node gets which pages
Query executor integration → maps to running queries near their data
Page migration on access pattern changes → maps to adaptive placement

Key Concepts:

Buffer Pool Management: “Database Internals” Chapter 5 - Alex Petrov
NUMA Databases: Research papers on NUMA-aware databases
Page Caching: Linux page cache documentation
Query Processing: “Database System Concepts” by Silberschatz et al.

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Projects 5-9 completed, understanding of database internals

Real world outcome:

// NUMA-aware buffer pool
NUMABufferPool pool(/* per_node_size= */ 4_GB);

// Read page - automatically cached on optimal node
auto page = pool.get_page(table_id, page_id);

// Query executor uses NUMA hints
pool.execute_query(query, [](Page* page) {
    // This lambda runs on the node where 'page' lives
    return process(page);
});

// Statistics
pool.stats();
// Output:
// Buffer Pool:
//   Node 0: 4GB, 1.2M pages, 99.1% hit rate
//   Node 1: 4GB, 1.1M pages, 98.7% hit rate
//
// Access Patterns:
//   Local page access: 97.3%
//   Remote page access: 2.7%
//   Page migrations: 1,234 (optimizing placement)
//
// Query Execution:
//   Queries co-located with data: 94.2%
//   Cross-node data shipping: 5.8%

Implementation Hints:

Per-Node Buffer Pool:

struct NodeBufferPool {
    int node_id;
    void* buffer_memory;  // numa_alloc_onnode
    size_t capacity;
    LRUCache<PageId, Page*> cache;
    std::shared_mutex mutex;

    std::atomic<size_t> hits{0};
    std::atomic<size_t> misses{0};
};

class NUMABufferPool {
    std::vector<NodeBufferPool> node_pools;
    std::unordered_map<PageId, int> page_node_map;  // Which node has each page
};

Page Placement Strategy:

int choose_node_for_page(PageId page_id, const AccessPattern& pattern) {
    // Strategy 1: Hash-based (deterministic)
    // return page_id.table_id % num_nodes;

    // Strategy 2: Range-based (for range scans)
    // return (page_id.page_num / pages_per_node) % num_nodes;

    // Strategy 3: Access-pattern based
    if (pattern.is_single_thread()) {
        return pattern.primary_thread_node();
    }
    if (pattern.is_read_heavy()) {
        return pattern.most_frequent_reader_node();
    }
    // Interleave for write-heavy shared data
    return page_id.page_num % num_nodes;
}

Page Fetch with NUMA Awareness:

Page* get_page(PageId page_id) {
    int node = determine_best_node(page_id);
    auto& pool = node_pools[node];

    {
        std::shared_lock lock(pool.mutex);
        if (auto* page = pool.cache.get(page_id)) {
            pool.hits++;
            return page;
        }
    }

    // Cache miss - need to load
    pool.misses++;
    return load_page_to_node(page_id, node);
}

Page* load_page_to_node(PageId page_id, int node) {
    auto& pool = node_pools[node];

    std::unique_lock lock(pool.mutex);

    // Allocate page on target node
    Page* page = allocate_page_on_node(node);

    // Read from disk
    read_page_from_disk(page_id, page);

    // Add to cache, possibly evicting
    if (pool.cache.is_full()) {
        auto evicted = pool.cache.evict_lru();
        free_page(evicted);
    }
    pool.cache.insert(page_id, page);

    return page;
}

Query Execution with Data Locality:

template<typename F>
auto execute_on_page(PageId page_id, F&& func) {
    Page* page = get_page(page_id);
    int page_node = page_node_map[page_id];
    int current_node = get_current_node();

    if (page_node == current_node) {
        // Already on correct node - execute directly
        return func(page);
    }

    // Execute on the page's node
    return numa_thread_pool.submit_and_wait([&]() {
        return func(page);
    }, page_node);
}

Adaptive Page Migration:

void maybe_migrate_page(PageId page_id, int accessing_node) {
    auto& stats = page_access_stats[page_id];
    stats.access_count[accessing_node]++;

    // Check if migration would help
    int current_node = page_node_map[page_id];
    int best_node = stats.most_frequent_accessor();

    if (best_node != current_node) {
        double current_local_ratio = stats.local_access_ratio(current_node);
        double potential_local_ratio = stats.local_access_ratio(best_node);

        // Migrate if significant improvement expected
        if (potential_local_ratio > current_local_ratio + 0.1) {
            migrate_page(page_id, best_node);
        }
    }
}

Learning milestones:

Per-node caching works → You understand partitioned buffers
Page placement is deterministic → You understand placement
Queries run near data → You understand co-location
Migration improves locality → Adaptive optimization works
Benchmark shows 2x+ improvement → Real database speedup

Project Comparison Table

Project	Difficulty	Time	Depth	Fun Factor
1. NUMA Topology Explorer	Beginner	Weekend	⭐⭐	⭐⭐
2. Memory Latency Benchmark	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐
3. Memory Bandwidth Benchmark	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐
4. False Sharing Detector	Intermediate	1 week	⭐⭐⭐⭐	⭐⭐⭐⭐
5. NUMA-Aware Allocator	Advanced	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐
6. NUMA Data Structures	Advanced	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
7. Memory Migration Tool	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐
8. Cache Coherence Simulator	Expert	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
9. NUMA Thread Pool	Advanced	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
10. NUMA Buffer Pool	Expert	4-6 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

Recommended Learning Path

Phase 1: Understanding (Weeks 1-2)

Project 1: NUMA Topology Explorer - See your system
Project 2: Memory Latency Benchmark - Measure the reality
Project 3: Memory Bandwidth Benchmark - Understand limits

Phase 2: Problems (Weeks 3-4)

Project 4: False Sharing Detector - See cache coherence cost
Project 8: Cache Coherence Simulator - Understand MESI deeply

Phase 3: Solutions (Weeks 5-8)

Project 5: NUMA-Aware Allocator - Control memory placement
Project 6: NUMA Data Structures - Build locality-aware code
Project 9: NUMA Thread Pool - Control thread placement

Phase 4: Integration (Weeks 9-12)

Project 7: Memory Migration Tool - Fix bad placement
Project 10: NUMA Buffer Pool - Real-world application

Summary

#	Project	Main Language
1	NUMA Topology Explorer	C
2	Memory Latency Benchmark	C
3	Memory Bandwidth Benchmark	C
4	False Sharing Detector	C
5	NUMA-Aware Memory Allocator	C
6	NUMA-Aware Data Structure Library	C++
7	NUMA Memory Migration Tool	C
8	Cache Coherence Simulator	C++
9	NUMA-Aware Thread Pool	C++
10	NUMA-Aware Database Buffer Pool	C++

Essential Resources

Books

“Computer Architecture: A Quantitative Approach” by Hennessy & Patterson - Chapter 5 on memory hierarchy and multiprocessors
“The Art of Multiprocessor Programming” by Herlihy & Shavit - Concurrent data structures and cache coherence
“Linux System Programming” by Robert Love - Low-level Linux interfaces
“Database Internals” by Alex Petrov - Buffer pool management

Online Resources

Tools

numactl - NUMA policy control
numastat - NUMA memory statistics
numatop - Real-time NUMA monitoring
Intel MLC - Memory latency checker
perf - Linux performance counters
perf c2c - Cache-to-cache analysis

Master these projects and you’ll understand why some code runs fast and some runs slow—and have the skills to make the slow code fast.

Learn NUMA vs UMA Architectures: From Memory Basics to High-Performance Systems

Why Memory Architecture Matters

Core Concept Analysis

The Physical Reality

UMA (Uniform Memory Access)

NUMA (Non-Uniform Memory Access)

Memory Access Latency Comparison

Cache Coherence (MESI Protocol)

NUMA Policies in Linux

Project List

Project 1: NUMA Topology Explorer

Project 2: Memory Latency Microbenchmark

Project 3: Memory Bandwidth Benchmark

Project 4: False Sharing Detector

Project 5: NUMA-Aware Memory Allocator

Project 6: NUMA-Aware Data Structure Library

Project 7: NUMA Memory Migration Tool

Project 8: Cache Coherence Simulator

Project 9: NUMA-Aware Thread Pool

Project 10: NUMA-Aware Database Buffer Pool

Project Comparison Table

Recommended Learning Path

Phase 1: Understanding (Weeks 1-2)

Phase 2: Problems (Weeks 3-4)

Phase 3: Solutions (Weeks 5-8)

Phase 4: Integration (Weeks 9-12)

Summary

Essential Resources

Books

Online Resources

Tools