Project 2: Memory Latency Microbenchmark
Build a NUMA-aware microbenchmark that measures local vs remote memory latency using pointer chasing and cycle-accurate timing.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1-2 weeks |
| Main Programming Language | C (Alternatives: Rust, Assembly) |
| Alternative Programming Languages | Rust, Assembly |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 1. The “Resume Gold” |
| Prerequisites | C pointers, CPU affinity basics, Linux CLI tools |
| Key Topics | pointer chasing, MLP, rdtsc/rdtscp timing, NUMA placement |
1. Learning Objectives
By completing this project, you will:
- Explain why pointer chasing reveals true memory latency.
- Implement a reproducible pointer-chasing benchmark that avoids prefetching.
- Use rdtscp to measure cycles with serialization guarantees.
- Bind threads and memory to specific NUMA nodes.
- Convert cycles to nanoseconds using calibrated CPU frequency.
- Produce stable, deterministic benchmark output for comparison.
- Interpret latency differences between local and remote memory.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Pointer Chasing and Memory-Level Parallelism (MLP)
Fundamentals
Modern CPUs can overlap multiple memory accesses, hiding latency with memory-level parallelism (MLP). If you measure a simple loop over an array, the CPU issues multiple loads in flight, so the observed time reflects bandwidth rather than latency. Pointer chasing forces a dependency chain: each load depends on the previous load’s result, so the CPU cannot overlap them. This serializes memory access and exposes true latency. A good latency benchmark therefore uses dependent loads, randomizes the access order to defeat prefetching, and uses a working set large enough to exceed cache capacity. Without these controls, you will measure cache effects or bandwidth, not raw DRAM latency.
Deep Dive into the Concept
MLP is a critical optimization in modern out-of-order cores. Each core can have dozens of outstanding cache misses; hardware prefetchers detect streaming patterns and pre-load cache lines; speculative execution may access memory ahead of the instruction pointer. These mechanisms mean that a naive loop such as sum += a[i] can hide most of the latency because multiple cache misses are in flight and the pipeline continues to execute independent instructions. The observed time is a function of throughput (bandwidth) rather than the latency of a single access.
Pointer chasing builds a strict dependency chain: the address of the next load is stored in the data of the current load. That means the CPU cannot issue the next load until the current load completes, eliminating MLP. This turns the benchmark into a serial measurement of latency. To make it effective, you must also defeat hardware prefetchers. Prefetchers can still learn simple patterns, so you randomize the pointer chain so that each cache line points to a seemingly random next line. You must also ensure the chain walks across a working set larger than the last-level cache (LLC), otherwise you will measure cache latency, not DRAM. Many practitioners choose 2-4x the LLC size and align the array to cache line boundaries.
Another subtlety: TLB behavior. If the pointer chain crosses many pages, TLB misses can add latency. You can either accept this as part of “real” latency or control it by using huge pages or by restricting the chain to a few pages. For this project, you should make the TLB effect explicit: provide a flag that enables huge pages or prints the effective page working set.
Finally, pointer chasing creates a single load stream. That is ideal for latency measurement but not for measuring bandwidth. Therefore the benchmark should explicitly state its goal and avoid misleading output. If a user wants bandwidth, they should use a STREAM-style test (Project 3). The key is to communicate this clearly in your output and documentation.
How this fits on projects
This concept is the heart of the latency benchmark. You will build the pointer chain generator, choose the working set size, and ensure that timing reflects serialized loads rather than throughput effects.
Definitions & Key Terms
- Pointer chasing -> A data-dependent load chain that forces serialized memory access.
- MLP (Memory-Level Parallelism) -> Overlapping multiple memory misses to hide latency.
- Prefetcher -> Hardware that predicts and loads future cache lines.
- Working set -> The set of memory locations actively used by a program.
- Dependent load -> A load whose address depends on a prior load’s value.
Mental Model Diagram (ASCII)
A -> C -> F -> B -> E -> D
^ ^ ^ ^ ^ ^
| | | | | |
load depends on previous load result
MLP-free chain = true latency measurement
How It Works (Step-by-Step)
- Allocate a large array of pointers aligned to cache lines.
- Randomly permute indices and build a linked cycle.
- Touch each element once to fault pages in (first-touch policy).
- Run a tight loop that follows the chain for many iterations.
- Measure total cycles and divide by iterations to get latency.
Invariants: Each step depends on the previous pointer, so the CPU cannot issue parallel loads.
Failure modes: Sequential patterns enable prefetching; working set fits in cache; TLB misses dominate.
Minimal Concrete Example
// Build a randomized pointer cycle over N cache lines.
size_t lines = N;
uintptr_t *buf = aligned_alloc(64, lines * 64);
int *order = shuffle(lines);
for (size_t i = 0; i < lines; i++) {
size_t cur = order[i];
size_t next = order[(i + 1) % lines];
buf[cur * 8] = (uintptr_t)&buf[next * 8];
}
// Chase the pointers.
uintptr_t *p = (uintptr_t*)buf[order[0] * 8];
for (size_t i = 0; i < ITERS; i++) {
p = (uintptr_t*)*p;
}
Common Misconceptions
- “Any loop over an array measures latency” -> It often measures bandwidth.
- “Random access always defeats prefetchers” -> Some prefetchers detect patterns across pages.
- “Short tests are fine” -> Short runs are noisy; you need enough iterations to stabilize.
Check-Your-Understanding Questions
- Why does pointer chasing eliminate memory-level parallelism?
- What happens if the working set fits in L3 cache?
- How could TLB misses distort a latency measurement?
- Why might a sequential access pattern still appear to have low latency?
Check-Your-Understanding Answers
- Each load depends on the previous load’s result, preventing overlap.
- You measure cache latency, which is much lower than DRAM latency.
- TLB misses add extra latency per access, inflating the measured value.
- Prefetchers fetch future lines, hiding the real miss penalty.
Real-World Applications
- HPC codes diagnosing unexpected memory latency on large nodes.
- Database tuning where query latency is dominated by random memory access.
- Cloud providers evaluating NUMA penalties for VM placement.
Where You’ll Apply It
- In this project: Sec. 3.1, Sec. 4.4, and Sec. 5.10.
- Also used in: P03-memory-bandwidth-benchmark, P04-false-sharing-detector.
References
- “Computer Systems: A Programmer’s Perspective” (Bryant & O’Hallaron) – Ch. 5-6
- “Computer Architecture” (Hennessy & Patterson) – Ch. 5
Key Insights
Latency is only visible when you prevent the CPU from overlapping memory requests.
Summary
Pointer chasing builds a dependency chain that forces the CPU to wait for each memory access. By randomizing the access pattern and using a large working set, you eliminate prefetching and cache effects, turning the benchmark into a true latency measurement tool.
Homework/Exercises to Practice the Concept
- Build a pointer chain over 1 MB and 1 GB, compare measured latency.
- Replace random order with sequential order and observe the change.
- Add a flag to optionally use huge pages and compare results.
Solutions to the Homework/Exercises
- The 1 GB chain will show higher latency once it exceeds LLC.
- Sequential order will often show lower latency due to prefetching.
- Huge pages reduce TLB misses; latency becomes more stable.
2.2 High-Resolution Timing and NUMA Placement Control
Fundamentals
Accurate latency measurement requires high-resolution timing. The rdtsc instruction reads the CPU’s time-stamp counter, but it is not serialized; out-of-order execution can move loads across the timing calls. rdtscp provides serialization and is preferred for microbenchmarks. You must also control where the benchmark runs and where memory is allocated: pin the thread to a CPU and bind memory to a NUMA node using numactl, sched_setaffinity, or libnuma. Without placement control, the OS may migrate the thread or allocate memory on a different node, invalidating your results. Correct timing and placement together make the benchmark trustworthy.
Deep Dive into the Concept
The time-stamp counter (TSC) is a per-core counter that increments at a constant rate on modern systems (“constant TSC”). rdtsc reads this counter, but the CPU can reorder instructions around it. rdtscp is partially serialized: it waits until previous instructions complete and prevents later instructions from executing until after the timestamp is read. This makes it suitable for microbenchmarks, although you should still insert a cpuid or lfence if you need full serialization. A typical pattern is lfence; rdtsc; before and rdtscp; lfence; after the measured code.
Converting cycles to time requires knowing the TSC frequency. If the CPU runs at a fixed frequency, you can read it from /proc/cpuinfo or use clock_gettime to calibrate cycles over a known interval. Turbo and power-saving states can skew results, so you should either disable scaling (e.g., set the governor to performance) or report both cycles and nanoseconds to avoid misleading users.
Placement is equally important. Linux scheduling can migrate threads across CPUs, and first-touch policy can place memory on the node where the page is first accessed. If you allocate memory on node 0 but your thread migrates to node 1, your “local” measurement becomes remote. You need to pin the thread with sched_setaffinity or pthread_setaffinity_np and allocate memory with numa_alloc_onnode or numactl --membind. You should also verify placement using numastat or /proc/<pid>/numa_maps to ensure the intended node is used.
A robust benchmark therefore includes: (1) explicit affinity setting, (2) memory binding, (3) optional verification, and (4) a safety warning when the system lacks NUMA. It also offers a deterministic mode by fixing the random seed used for the pointer chain. This makes the benchmark output stable across runs and supports comparisons over time.
How this fits on projects
This concept ensures your benchmark is trustworthy. Without proper timing and placement control, you will measure noise rather than actual NUMA latency.
Definitions & Key Terms
- TSC -> Time-stamp counter, read via
rdtsc/rdtscp. - Serialization -> Ensuring instruction order around a timing boundary.
- Affinity -> Binding a thread to specific CPUs.
- First-touch -> Memory placement determined by the first access to a page.
- membind -> Policy that forces memory allocation on a specific node.
Mental Model Diagram (ASCII)
[Thread pinned to CPU 0] ---> [Memory bound to Node 0]
| |
+-- rdtscp timing -- dependent loads -+
If thread migrates or memory is not bound,
"local" becomes "remote" and numbers lie.
How It Works (Step-by-Step)
- Pin the benchmark thread to a chosen CPU (node-local).
- Allocate memory using
numa_alloc_onnodeor--membind. - Warm up the pointer chain to fault pages in.
- Use
rdtscparound the chase loop to measure cycles. - Convert cycles to nanoseconds using calibrated frequency.
Invariants: Thread stays on the chosen CPU; memory remains bound to intended node.
Failure modes: CPU frequency scaling, thread migration, incorrect membind policy.
Minimal Concrete Example
static inline uint64_t rdtscp(void) {
unsigned int aux;
uint64_t r;
__asm__ volatile("rdtscp" : "=a"(r), "=d"(aux) :: "rcx");
return (aux << 32) | r;
}
// Pin to CPU 0
cpu_set_t set; CPU_ZERO(&set); CPU_SET(0, &set);
sched_setaffinity(0, sizeof(set), &set);
uint64_t start = rdtscp();
for (size_t i = 0; i < ITERS; i++) p = (uintptr_t*)*p;
uint64_t end = rdtscp();
Common Misconceptions
- “rdtsc is always accurate” -> It can be reordered without fences.
- “numactl guarantees placement” -> It only sets policy; first-touch still matters.
- “Cycles equal time” -> You must convert using a stable frequency.
Check-Your-Understanding Questions
- Why is
rdtscppreferred overrdtscfor microbenchmarks? - What happens if you allocate memory before binding the thread?
- Why might frequency scaling distort results?
Check-Your-Understanding Answers
rdtscpprovides serialization, reducing instruction reordering noise.- First-touch may place pages on a different node than intended.
- The TSC-to-time conversion becomes invalid if frequency changes.
Real-World Applications
- Benchmarking DRAM latency for capacity planning.
- Comparing hardware generations in data centers.
- Validating NUMA penalties after BIOS upgrades.
Where You’ll Apply It
- In this project: Sec. 3.7, Sec. 5.1, Sec. 5.10.
- Also used in: P03-memory-bandwidth-benchmark, P09-numa-aware-thread-pool.
References
- “Computer Systems: A Programmer’s Perspective” (Bryant & O’Hallaron) – Ch. 5
- “Low-Level Programming” (Zhirkov) – Ch. 9
Key Insights
Timing without placement control is just noise; timing with placement reveals the NUMA truth.
Summary
Accurate latency measurement requires serialized timing and strict control over where threads and memory live. Use rdtscp with fences, pin the thread, bind memory, and report both cycles and nanoseconds to make results trustworthy and reproducible.
Homework/Exercises to Practice the Concept
- Measure latency with and without
sched_setaffinityand compare variance. - Calibrate TSC frequency using
clock_gettimeover 1 second. - Add a fixed random seed and verify repeated runs are identical.
Solutions to the Homework/Exercises
- Without affinity, results vary as the thread migrates; with affinity, variance drops.
- Measure cycles over a known interval and divide to get cycles/sec.
- A fixed seed yields the same pointer chain and stable latency numbers.
3. Project Specification
3.1 What You Will Build
A CLI tool latency-bench that measures memory latency between NUMA nodes using pointer chasing. It supports binding CPU and memory to chosen nodes, reports cycles and nanoseconds, and outputs a stable JSON summary.
Included: pointer-chasing generator, rdtscp timing, node binding, JSON output. Excluded: bandwidth measurement, cache benchmarking, or GPU memory latency.
3.2 Functional Requirements
- Pointer Chain Generator: Build a randomized cycle across cache-line-sized elements.
- Placement Control: Bind thread and memory to selected nodes.
- Cycle Timing: Measure cycles with rdtscp and fences.
- Frequency Calibration: Convert cycles to nanoseconds.
- Warm-up Phase: Ensure pages are faulted in on the correct node.
- Repeat Runs: Run multiple trials and report median.
- JSON Output: Emit stable schema with per-node-pair latency.
- Deterministic Mode: Fixed random seed for reproducible results.
- Validation: Confirm placement via
numastator/proc/<pid>/numa_maps.
3.3 Non-Functional Requirements
- Performance: Low overhead outside the measured loop.
- Reliability: Stable results across runs (variance under 5%).
- Usability: Clear output with both cycles and nanoseconds.
3.4 Example Usage / Output
$ numactl --cpunodebind=0 --membind=0 ./latency-bench --iters=10000000
Node 0 -> Node 0: 240 cycles (80 ns)
Node 0 -> Node 1: 390 cycles (130 ns)
3.5 Data Formats / Schemas / Protocols
JSON output (--json):
{
"node_pairs": [
{"from": 0, "to": 0, "cycles": 240, "ns": 80},
{"from": 0, "to": 1, "cycles": 390, "ns": 130}
],
"iters": 10000000,
"seed": 42,
"freq_ghz": 3.0
}
3.6 Edge Cases
- System without NUMA (single node).
- CPU frequency scaling enabled (numbers unstable).
- Pointer chain small enough to fit in cache.
- Memory policy not honored due to cpusets.
3.7 Real World Outcome
A developer can run latency-bench and see a clear, reproducible local vs remote latency difference, validating the NUMA penalty on their machine.
3.7.1 How to Run (Copy/Paste)
cc -O2 -Wall -lnuma -o latency-bench src/main.c
numactl --cpunodebind=0 --membind=0 ./latency-bench --iters=10000000 --seed=42
3.7.2 Golden Path Demo (Deterministic)
$ ./latency-bench --cpunode=0 --memnode=1 --iters=5000000 --seed=42
Node 0 -> Node 1: 395 cycles (132 ns)
3.7.3 If CLI: Exact Terminal Transcript
$ ./latency-bench --cpunode=0 --memnode=0 --iters=5000000 --seed=42
Node 0 -> Node 0: 238 cycles (79 ns)
$ ./latency-bench --cpunode=0 --memnode=1 --iters=5000000 --seed=42
Node 0 -> Node 1: 392 cycles (131 ns)
3.7.4 Failure Demo (Deterministic)
$ ./latency-bench --cpunode=9 --memnode=0
ERROR: invalid node id 9 (max node = 1)
EXIT: 1
Exit Codes:
0success1invalid arguments2placement failure3timing error
4. Solution Architecture
4.1 High-Level Design
+-------------+ +--------------+ +------------------+
| Chain Gen |-->| Benchmark Loop|-->| Report Renderer |
+-------------+ +--------------+ +------------------+
^ ^ ^
| | |
+-------------+ +--------------+ +---------------+
| Placement |----->| Timing (TSC) | | JSON Encoder |
+-------------+ +--------------+ +---------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |—|—|—| | Chain Generator | Build randomized pointer cycle | Use fixed seed for determinism | | Placement Layer | Pin thread and bind memory | libnuma vs numactl support | | Timer | Measure cycles and convert to ns | rdtscp + fences | | Reporter | Aggregate trials and emit results | median vs average |
4.3 Data Structures (No Full Code)
typedef struct {
int from_node;
int to_node;
uint64_t cycles;
double ns;
} Result;
4.4 Algorithm Overview
Key Algorithm: Latency Measurement
- Build pointer chain across N cache lines.
- Pin thread and bind memory to selected nodes.
- Warm up by traversing the chain once.
- Measure cycles over many iterations.
- Repeat trials and compute median latency.
Complexity Analysis:
- Time: O(iters) per trial
- Space: O(N) for chain storage
5. Implementation Guide
5.1 Development Environment Setup
sudo apt-get install -y build-essential libnuma-dev
5.2 Project Structure
latency-bench/
|-- src/
| |-- main.c
| |-- chain.c
| |-- timing.c
| |-- placement.c
| +-- report.c
|-- tests/
| +-- fixtures/
|-- Makefile
+-- README.md
5.3 The Core Question You’re Answering
“How much slower is remote memory on my machine, and how do I measure it correctly?”
5.4 Concepts You Must Understand First
- Pointer chasing – why dependent loads reveal latency.
- rdtscp timing – how to avoid reordering noise.
- NUMA placement – how to bind threads and memory.
5.5 Questions to Guide Your Design
- How will you ensure the chain exceeds LLC size?
- How will you verify memory placement?
- How many trials are needed for stable results?
5.6 Thinking Exercise
If you increase the chain size from 32 MB to 1 GB, what happens to latency and why?
5.7 The Interview Questions They’ll Ask
- “Why is pointer chasing used for latency measurement?”
- “What does rdtscp provide that rdtsc does not?”
- “How does first-touch affect latency measurements?”
5.8 Hints in Layers
Hint 1: Build a fixed-seed shuffle of indices.
Hint 2: Use sched_setaffinity and numa_alloc_onnode.
Hint 3: Report cycles and nanoseconds for clarity.
Hint 4: Validate placement with /proc/<pid>/numa_maps.
5.9 Books That Will Help
| Topic | Book | Chapter | |—|—|—| | Memory hierarchy | “Computer Systems: A Programmer’s Perspective” | Ch. 5-6 | | OS placement policies | “Operating Systems: Three Easy Pieces” | Ch. 13-21 | | Low-level timing | “Low-Level Programming” | Ch. 9 |
5.10 Implementation Phases
Phase 1: Chain + Timing (3-4 days)
Goals: build pointer chain and rdtscp timing.
Tasks:
- Implement chain generator with fixed seed.
- Implement rdtscp timing helpers.
Checkpoint: cycles per access are stable within 5%.
Phase 2: Placement + Validation (3-4 days)
Goals: bind threads and memory to nodes.
Tasks:
- Add CPU affinity binding.
- Add memory binding and verification.
Checkpoint: local vs remote latency differs measurably.
Phase 3: Reporting + JSON (2-3 days)
Goals: produce deterministic output and JSON.
Tasks:
- Add median aggregation across trials.
- Implement JSON output schema.
Checkpoint: repeated runs produce identical output with fixed seed.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |—|—|—|—| | Timing | rdtsc vs rdtscp | rdtscp | reduces reordering noise | | Chain size | LLC size vs 4x LLC | 4x LLC | ensure DRAM access | | Aggregation | mean vs median | median | robust to outliers |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |—|—|—| | Unit Tests | Validate chain generator | fixed seed produces known cycle | | Integration Tests | Validate placement | memory bound to target node | | Edge Case Tests | Handle invalid nodes | expect exit code 1 |
6.2 Critical Test Cases
- Fixed seed chain produces identical order across runs.
- Invalid node returns exit code 1 with clear message.
- Local vs remote yields measurable latency difference.
6.3 Test Data
seed=42, iters=1000000
expected_cycles_per_access~240 (local)
expected_cycles_per_access~390 (remote)
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |—|—|—| | Working set too small | Latency unusually low | Increase size beyond LLC | | Thread migration | High variance | Pin thread to CPU | | No serialization | Unstable cycles | Use rdtscp + fences |
7.2 Debugging Strategies
- Compare results under
numactl --membind. - Use
perf statto see cache-miss rates. - Print TSC frequency and ensure it matches expectations.
7.3 Performance Traps
- Measuring too few iterations introduces timer overhead.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
--csvoutput. - Add
--warmup-itersflag.
8.2 Intermediate Extensions
- Add huge page mode to reduce TLB effects.
- Add per-node pair heatmap output.
8.3 Advanced Extensions
- Measure latency distribution (p50/p95/p99).
- Integrate hardware counters to track cache misses.
9. Real-World Connections
9.1 Industry Applications
- Memory latency benchmarking for hardware procurement.
- Cloud performance regression detection after BIOS updates.
9.2 Related Open Source Projects
- lmbench – classic latency microbenchmarks.
- STREAM – complementary bandwidth benchmark.
9.3 Interview Relevance
- Explaining MLP vs latency vs bandwidth.
- Showing how to control placement in Linux.
10. Resources
10.1 Essential Reading
- “Computer Systems: A Programmer’s Perspective” – Ch. 5-6
- “Operating Systems: Three Easy Pieces” – Ch. 13-21
10.2 Video Resources
- CPU microbenchmarking lectures.
- NUMA memory access talks.
10.3 Tools & Documentation
- numactl – set memory policy and affinity.
- perf – verify cache-miss rates.
10.4 Related Projects in This Series
- P01: NUMA Topology Explorer – pick local/remote node pairs.
- P03: Memory Bandwidth Benchmark – contrast bandwidth vs latency.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why pointer chasing reveals latency.
- I can describe how to serialize rdtscp timing.
- I can explain first-touch effects.
11.2 Implementation
- Results are stable across runs.
- Placement is verified.
- JSON output matches schema.
11.3 Growth
- I can explain these results to a performance engineer.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Pointer-chasing benchmark runs and prints cycles.
- Supports node binding for CPU and memory.
Full Completion:
- Reports cycles and nanoseconds with calibration.
- JSON output and deterministic seed mode.
Excellence (Going Above & Beyond):
- p95/p99 latency distribution and counter integration.