Project 2: Memory Latency Microbenchmark

Build a NUMA-aware microbenchmark that measures local vs remote memory latency using pointer chasing and cycle-accurate timing.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Rust, Assembly)
Alternative Programming Languages	Rust, Assembly
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold”
Prerequisites	C pointers, CPU affinity basics, Linux CLI tools
Key Topics	pointer chasing, MLP, rdtsc/rdtscp timing, NUMA placement

1. Learning Objectives

By completing this project, you will:

Explain why pointer chasing reveals true memory latency.
Implement a reproducible pointer-chasing benchmark that avoids prefetching.
Use rdtscp to measure cycles with serialization guarantees.
Bind threads and memory to specific NUMA nodes.
Convert cycles to nanoseconds using calibrated CPU frequency.
Produce stable, deterministic benchmark output for comparison.
Interpret latency differences between local and remote memory.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Pointer Chasing and Memory-Level Parallelism (MLP)

Fundamentals

Modern CPUs can overlap multiple memory accesses, hiding latency with memory-level parallelism (MLP). If you measure a simple loop over an array, the CPU issues multiple loads in flight, so the observed time reflects bandwidth rather than latency. Pointer chasing forces a dependency chain: each load depends on the previous load’s result, so the CPU cannot overlap them. This serializes memory access and exposes true latency. A good latency benchmark therefore uses dependent loads, randomizes the access order to defeat prefetching, and uses a working set large enough to exceed cache capacity. Without these controls, you will measure cache effects or bandwidth, not raw DRAM latency.

Deep Dive into the Concept

MLP is a critical optimization in modern out-of-order cores. Each core can have dozens of outstanding cache misses; hardware prefetchers detect streaming patterns and pre-load cache lines; speculative execution may access memory ahead of the instruction pointer. These mechanisms mean that a naive loop such as sum += a[i] can hide most of the latency because multiple cache misses are in flight and the pipeline continues to execute independent instructions. The observed time is a function of throughput (bandwidth) rather than the latency of a single access.

Pointer chasing builds a strict dependency chain: the address of the next load is stored in the data of the current load. That means the CPU cannot issue the next load until the current load completes, eliminating MLP. This turns the benchmark into a serial measurement of latency. To make it effective, you must also defeat hardware prefetchers. Prefetchers can still learn simple patterns, so you randomize the pointer chain so that each cache line points to a seemingly random next line. You must also ensure the chain walks across a working set larger than the last-level cache (LLC), otherwise you will measure cache latency, not DRAM. Many practitioners choose 2-4x the LLC size and align the array to cache line boundaries.

Another subtlety: TLB behavior. If the pointer chain crosses many pages, TLB misses can add latency. You can either accept this as part of “real” latency or control it by using huge pages or by restricting the chain to a few pages. For this project, you should make the TLB effect explicit: provide a flag that enables huge pages or prints the effective page working set.

Finally, pointer chasing creates a single load stream. That is ideal for latency measurement but not for measuring bandwidth. Therefore the benchmark should explicitly state its goal and avoid misleading output. If a user wants bandwidth, they should use a STREAM-style test (Project 3). The key is to communicate this clearly in your output and documentation.

How this fits on projects

This concept is the heart of the latency benchmark. You will build the pointer chain generator, choose the working set size, and ensure that timing reflects serialized loads rather than throughput effects.

Definitions & Key Terms

Pointer chasing -> A data-dependent load chain that forces serialized memory access.
MLP (Memory-Level Parallelism) -> Overlapping multiple memory misses to hide latency.
Prefetcher -> Hardware that predicts and loads future cache lines.
Working set -> The set of memory locations actively used by a program.
Dependent load -> A load whose address depends on a prior load’s value.

Mental Model Diagram (ASCII)

A -> C -> F -> B -> E -> D
^    ^    ^    ^    ^    ^
|    |    |    |    |    |
load depends on previous load result

MLP-free chain = true latency measurement

How It Works (Step-by-Step)

Allocate a large array of pointers aligned to cache lines.
Randomly permute indices and build a linked cycle.
Touch each element once to fault pages in (first-touch policy).
Run a tight loop that follows the chain for many iterations.
Measure total cycles and divide by iterations to get latency.

Invariants: Each step depends on the previous pointer, so the CPU cannot issue parallel loads.

Failure modes: Sequential patterns enable prefetching; working set fits in cache; TLB misses dominate.

Minimal Concrete Example

// Build a randomized pointer cycle over N cache lines.
size_t lines = N;
uintptr_t *buf = aligned_alloc(64, lines * 64);
int *order = shuffle(lines);
for (size_t i = 0; i < lines; i++) {
    size_t cur = order[i];
    size_t next = order[(i + 1) % lines];
    buf[cur * 8] = (uintptr_t)&buf[next * 8];
}

// Chase the pointers.
uintptr_t *p = (uintptr_t*)buf[order[0] * 8];
for (size_t i = 0; i < ITERS; i++) {
    p = (uintptr_t*)*p;
}

Common Misconceptions

“Any loop over an array measures latency” -> It often measures bandwidth.
“Random access always defeats prefetchers” -> Some prefetchers detect patterns across pages.
“Short tests are fine” -> Short runs are noisy; you need enough iterations to stabilize.

Check-Your-Understanding Questions

Why does pointer chasing eliminate memory-level parallelism?
What happens if the working set fits in L3 cache?
How could TLB misses distort a latency measurement?
Why might a sequential access pattern still appear to have low latency?

Check-Your-Understanding Answers

Each load depends on the previous load’s result, preventing overlap.
You measure cache latency, which is much lower than DRAM latency.
TLB misses add extra latency per access, inflating the measured value.
Prefetchers fetch future lines, hiding the real miss penalty.

Real-World Applications

HPC codes diagnosing unexpected memory latency on large nodes.
Database tuning where query latency is dominated by random memory access.
Cloud providers evaluating NUMA penalties for VM placement.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.4, and Sec. 5.10.
Also used in: P03-memory-bandwidth-benchmark, P04-false-sharing-detector.

References

“Computer Systems: A Programmer’s Perspective” (Bryant & O’Hallaron) – Ch. 5-6
“Computer Architecture” (Hennessy & Patterson) – Ch. 5

Key Insights

Latency is only visible when you prevent the CPU from overlapping memory requests.

Summary

Pointer chasing builds a dependency chain that forces the CPU to wait for each memory access. By randomizing the access pattern and using a large working set, you eliminate prefetching and cache effects, turning the benchmark into a true latency measurement tool.

Homework/Exercises to Practice the Concept

Build a pointer chain over 1 MB and 1 GB, compare measured latency.
Replace random order with sequential order and observe the change.
Add a flag to optionally use huge pages and compare results.

Solutions to the Homework/Exercises

The 1 GB chain will show higher latency once it exceeds LLC.
Sequential order will often show lower latency due to prefetching.
Huge pages reduce TLB misses; latency becomes more stable.

2.2 High-Resolution Timing and NUMA Placement Control

Fundamentals

Accurate latency measurement requires high-resolution timing. The rdtsc instruction reads the CPU’s time-stamp counter, but it is not serialized; out-of-order execution can move loads across the timing calls. rdtscp provides serialization and is preferred for microbenchmarks. You must also control where the benchmark runs and where memory is allocated: pin the thread to a CPU and bind memory to a NUMA node using numactl, sched_setaffinity, or libnuma. Without placement control, the OS may migrate the thread or allocate memory on a different node, invalidating your results. Correct timing and placement together make the benchmark trustworthy.

Deep Dive into the Concept

The time-stamp counter (TSC) is a per-core counter that increments at a constant rate on modern systems (“constant TSC”). rdtsc reads this counter, but the CPU can reorder instructions around it. rdtscp is partially serialized: it waits until previous instructions complete and prevents later instructions from executing until after the timestamp is read. This makes it suitable for microbenchmarks, although you should still insert a cpuid or lfence if you need full serialization. A typical pattern is lfence; rdtsc; before and rdtscp; lfence; after the measured code.

Converting cycles to time requires knowing the TSC frequency. If the CPU runs at a fixed frequency, you can read it from /proc/cpuinfo or use clock_gettime to calibrate cycles over a known interval. Turbo and power-saving states can skew results, so you should either disable scaling (e.g., set the governor to performance) or report both cycles and nanoseconds to avoid misleading users.

Placement is equally important. Linux scheduling can migrate threads across CPUs, and first-touch policy can place memory on the node where the page is first accessed. If you allocate memory on node 0 but your thread migrates to node 1, your “local” measurement becomes remote. You need to pin the thread with sched_setaffinity or pthread_setaffinity_np and allocate memory with numa_alloc_onnode or numactl --membind. You should also verify placement using numastat or /proc/<pid>/numa_maps to ensure the intended node is used.

A robust benchmark therefore includes: (1) explicit affinity setting, (2) memory binding, (3) optional verification, and (4) a safety warning when the system lacks NUMA. It also offers a deterministic mode by fixing the random seed used for the pointer chain. This makes the benchmark output stable across runs and supports comparisons over time.

How this fits on projects

This concept ensures your benchmark is trustworthy. Without proper timing and placement control, you will measure noise rather than actual NUMA latency.

Definitions & Key Terms

TSC -> Time-stamp counter, read via rdtsc/rdtscp.
Serialization -> Ensuring instruction order around a timing boundary.
Affinity -> Binding a thread to specific CPUs.
First-touch -> Memory placement determined by the first access to a page.
membind -> Policy that forces memory allocation on a specific node.

Mental Model Diagram (ASCII)

[Thread pinned to CPU 0]  --->  [Memory bound to Node 0]
      |                                    |
      +-- rdtscp timing -- dependent loads -+

If thread migrates or memory is not bound,
"local" becomes "remote" and numbers lie.

How It Works (Step-by-Step)

Pin the benchmark thread to a chosen CPU (node-local).
Allocate memory using numa_alloc_onnode or --membind.
Warm up the pointer chain to fault pages in.
Use rdtscp around the chase loop to measure cycles.
Convert cycles to nanoseconds using calibrated frequency.

Invariants: Thread stays on the chosen CPU; memory remains bound to intended node.

Failure modes: CPU frequency scaling, thread migration, incorrect membind policy.

Minimal Concrete Example

static inline uint64_t rdtscp(void) {
    unsigned int aux;
    uint64_t r;
    __asm__ volatile("rdtscp" : "=a"(r), "=d"(aux) :: "rcx");
    return (aux << 32) | r;
}

// Pin to CPU 0
cpu_set_t set; CPU_ZERO(&set); CPU_SET(0, &set);
sched_setaffinity(0, sizeof(set), &set);

uint64_t start = rdtscp();
for (size_t i = 0; i < ITERS; i++) p = (uintptr_t*)*p;
uint64_t end = rdtscp();

Common Misconceptions

“rdtsc is always accurate” -> It can be reordered without fences.
“numactl guarantees placement” -> It only sets policy; first-touch still matters.
“Cycles equal time” -> You must convert using a stable frequency.

Check-Your-Understanding Questions

Why is rdtscp preferred over rdtsc for microbenchmarks?
What happens if you allocate memory before binding the thread?
Why might frequency scaling distort results?

Check-Your-Understanding Answers

rdtscp provides serialization, reducing instruction reordering noise.
First-touch may place pages on a different node than intended.
The TSC-to-time conversion becomes invalid if frequency changes.

Real-World Applications

Benchmarking DRAM latency for capacity planning.
Comparing hardware generations in data centers.
Validating NUMA penalties after BIOS upgrades.

Where You’ll Apply It

In this project: Sec. 3.7, Sec. 5.1, Sec. 5.10.
Also used in: P03-memory-bandwidth-benchmark, P09-numa-aware-thread-pool.

References

“Computer Systems: A Programmer’s Perspective” (Bryant & O’Hallaron) – Ch. 5
“Low-Level Programming” (Zhirkov) – Ch. 9

Key Insights

Timing without placement control is just noise; timing with placement reveals the NUMA truth.

Summary

Accurate latency measurement requires serialized timing and strict control over where threads and memory live. Use rdtscp with fences, pin the thread, bind memory, and report both cycles and nanoseconds to make results trustworthy and reproducible.

Homework/Exercises to Practice the Concept

Measure latency with and without sched_setaffinity and compare variance.
Calibrate TSC frequency using clock_gettime over 1 second.
Add a fixed random seed and verify repeated runs are identical.

Solutions to the Homework/Exercises

Without affinity, results vary as the thread migrates; with affinity, variance drops.
Measure cycles over a known interval and divide to get cycles/sec.
A fixed seed yields the same pointer chain and stable latency numbers.

3. Project Specification

3.1 What You Will Build

A CLI tool latency-bench that measures memory latency between NUMA nodes using pointer chasing. It supports binding CPU and memory to chosen nodes, reports cycles and nanoseconds, and outputs a stable JSON summary.

Included: pointer-chasing generator, rdtscp timing, node binding, JSON output. Excluded: bandwidth measurement, cache benchmarking, or GPU memory latency.

3.2 Functional Requirements

Pointer Chain Generator: Build a randomized cycle across cache-line-sized elements.
Placement Control: Bind thread and memory to selected nodes.
Cycle Timing: Measure cycles with rdtscp and fences.
Frequency Calibration: Convert cycles to nanoseconds.
Warm-up Phase: Ensure pages are faulted in on the correct node.
Repeat Runs: Run multiple trials and report median.
JSON Output: Emit stable schema with per-node-pair latency.
Deterministic Mode: Fixed random seed for reproducible results.
Validation: Confirm placement via numastat or /proc/<pid>/numa_maps.

3.3 Non-Functional Requirements

Performance: Low overhead outside the measured loop.
Reliability: Stable results across runs (variance under 5%).
Usability: Clear output with both cycles and nanoseconds.

3.4 Example Usage / Output

$ numactl --cpunodebind=0 --membind=0 ./latency-bench --iters=10000000
Node 0 -> Node 0: 240 cycles (80 ns)
Node 0 -> Node 1: 390 cycles (130 ns)

3.5 Data Formats / Schemas / Protocols

JSON output (--json):

{
  "node_pairs": [
    {"from": 0, "to": 0, "cycles": 240, "ns": 80},
    {"from": 0, "to": 1, "cycles": 390, "ns": 130}
  ],
  "iters": 10000000,
  "seed": 42,
  "freq_ghz": 3.0
}

3.6 Edge Cases

System without NUMA (single node).
CPU frequency scaling enabled (numbers unstable).
Pointer chain small enough to fit in cache.
Memory policy not honored due to cpusets.

3.7 Real World Outcome

A developer can run latency-bench and see a clear, reproducible local vs remote latency difference, validating the NUMA penalty on their machine.

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -lnuma -o latency-bench src/main.c
numactl --cpunodebind=0 --membind=0 ./latency-bench --iters=10000000 --seed=42

3.7.2 Golden Path Demo (Deterministic)

$ ./latency-bench --cpunode=0 --memnode=1 --iters=5000000 --seed=42
Node 0 -> Node 1: 395 cycles (132 ns)

3.7.3 If CLI: Exact Terminal Transcript

$ ./latency-bench --cpunode=0 --memnode=0 --iters=5000000 --seed=42
Node 0 -> Node 0: 238 cycles (79 ns)

$ ./latency-bench --cpunode=0 --memnode=1 --iters=5000000 --seed=42
Node 0 -> Node 1: 392 cycles (131 ns)

3.7.4 Failure Demo (Deterministic)

$ ./latency-bench --cpunode=9 --memnode=0
ERROR: invalid node id 9 (max node = 1)
EXIT: 1

Exit Codes:

0 success
1 invalid arguments
2 placement failure
3 timing error

4. Solution Architecture

4.1 High-Level Design

+-------------+   +--------------+   +------------------+
| Chain Gen   |-->| Benchmark Loop|-->| Report Renderer  |
+-------------+   +--------------+   +------------------+
        ^                   ^                  ^
        |                   |                  |
+-------------+      +--------------+   +---------------+
| Placement   |----->| Timing (TSC) |   | JSON Encoder  |
+-------------+      +--------------+   +---------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct {
    int from_node;
    int to_node;
    uint64_t cycles;
    double ns;
} Result;

4.4 Algorithm Overview

Key Algorithm: Latency Measurement

Build pointer chain across N cache lines.
Pin thread and bind memory to selected nodes.
Warm up by traversing the chain once.
Measure cycles over many iterations.
Repeat trials and compute median latency.

Complexity Analysis:

Time: O(iters) per trial
Space: O(N) for chain storage

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential libnuma-dev

5.2 Project Structure

latency-bench/
|-- src/
|   |-- main.c
|   |-- chain.c
|   |-- timing.c
|   |-- placement.c
|   +-- report.c
|-- tests/
|   +-- fixtures/
|-- Makefile
+-- README.md

5.3 The Core Question You’re Answering

“How much slower is remote memory on my machine, and how do I measure it correctly?”

5.4 Concepts You Must Understand First

Pointer chasing – why dependent loads reveal latency.
rdtscp timing – how to avoid reordering noise.
NUMA placement – how to bind threads and memory.

5.5 Questions to Guide Your Design

How will you ensure the chain exceeds LLC size?
How will you verify memory placement?
How many trials are needed for stable results?

5.6 Thinking Exercise

If you increase the chain size from 32 MB to 1 GB, what happens to latency and why?

5.7 The Interview Questions They’ll Ask

“Why is pointer chasing used for latency measurement?”
“What does rdtscp provide that rdtsc does not?”
“How does first-touch affect latency measurements?”

5.8 Hints in Layers

Hint 1: Build a fixed-seed shuffle of indices.

Hint 2: Use sched_setaffinity and numa_alloc_onnode.

Hint 3: Report cycles and nanoseconds for clarity.

Hint 4: Validate placement with /proc/<pid>/numa_maps.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Chain + Timing (3-4 days)

Goals: build pointer chain and rdtscp timing.

Tasks:

Implement chain generator with fixed seed.
Implement rdtscp timing helpers.

Checkpoint: cycles per access are stable within 5%.

Phase 2: Placement + Validation (3-4 days)

Goals: bind threads and memory to nodes.

Tasks:

Add CPU affinity binding.
Add memory binding and verification.

Checkpoint: local vs remote latency differs measurably.

Phase 3: Reporting + JSON (2-3 days)

Goals: produce deterministic output and JSON.

Tasks:

Add median aggregation across trials.
Implement JSON output schema.

Checkpoint: repeated runs produce identical output with fixed seed.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Fixed seed chain produces identical order across runs.
Invalid node returns exit code 1 with clear message.
Local vs remote yields measurable latency difference.

6.3 Test Data

seed=42, iters=1000000
expected_cycles_per_access~240 (local)
expected_cycles_per_access~390 (remote)

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Compare results under numactl --membind.
Use perf stat to see cache-miss rates.
Print TSC frequency and ensure it matches expectations.

7.3 Performance Traps

Measuring too few iterations introduces timer overhead.

8. Extensions & Challenges

8.1 Beginner Extensions

Add --csv output.
Add --warmup-iters flag.

8.2 Intermediate Extensions

Add huge page mode to reduce TLB effects.
Add per-node pair heatmap output.

8.3 Advanced Extensions

Measure latency distribution (p50/p95/p99).
Integrate hardware counters to track cache misses.

9. Real-World Connections

9.1 Industry Applications

Memory latency benchmarking for hardware procurement.
Cloud performance regression detection after BIOS updates.

lmbench – classic latency microbenchmarks.
STREAM – complementary bandwidth benchmark.

9.3 Interview Relevance

Explaining MLP vs latency vs bandwidth.
Showing how to control placement in Linux.

10. Resources

10.1 Essential Reading

“Computer Systems: A Programmer’s Perspective” – Ch. 5-6
“Operating Systems: Three Easy Pieces” – Ch. 13-21

10.2 Video Resources

CPU microbenchmarking lectures.
NUMA memory access talks.

10.3 Tools & Documentation

numactl – set memory policy and affinity.
perf – verify cache-miss rates.

P01: NUMA Topology Explorer – pick local/remote node pairs.
P03: Memory Bandwidth Benchmark – contrast bandwidth vs latency.

11. Self-Assessment Checklist

11.1 Understanding

I can explain why pointer chasing reveals latency.
I can describe how to serialize rdtscp timing.
I can explain first-touch effects.

11.2 Implementation

Results are stable across runs.
Placement is verified.
JSON output matches schema.

11.3 Growth

I can explain these results to a performance engineer.

12. Submission / Completion Criteria

Minimum Viable Completion:

Pointer-chasing benchmark runs and prints cycles.
Supports node binding for CPU and memory.

Full Completion:

Reports cycles and nanoseconds with calibration.
JSON output and deterministic seed mode.

Excellence (Going Above & Beyond):

p95/p99 latency distribution and counter integration.