Project 3: Memory Bandwidth Benchmark

Build a STREAM-style benchmark that measures copy/scale/add/triad bandwidth under different NUMA policies and thread placements.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1-2 weeks
Main Programming Language	C (Alternatives: Rust, C++)
Alternative Programming Languages	Rust, C++
Coolness Level	Level 3: Genuinely Useful
Business Potential	2. The “Internal Tool”
Prerequisites	C loops, timing basics, NUMA placement concepts
Key Topics	STREAM kernels, cache effects, NUMA interleaving, thread scaling

1. Learning Objectives

By completing this project, you will:

Explain why bandwidth benchmarks differ from latency benchmarks.
Implement STREAM-like kernels and measure GB/s accurately.
Choose working-set sizes that bypass caches.
Control NUMA placement with bind and interleave policies.
Evaluate single-thread vs multi-thread bandwidth scaling.
Produce deterministic benchmark output for comparisons.
Interpret why interleaving can increase bandwidth but harm latency.

2. All Theory Needed (Per-Concept Breakdown)

2.1 STREAM Kernels and Bandwidth vs Latency

Fundamentals

Bandwidth measures how much data can be moved per second, while latency measures the delay for a single access. STREAM kernels (copy, scale, add, triad) are designed to saturate memory bandwidth by performing simple operations on large arrays. Because the kernels are predictable and vectorizable, they stress the memory subsystem rather than the CPU’s computation. If the arrays fit in cache, you measure cache bandwidth instead of DRAM bandwidth, so working set size must exceed the LLC. STREAM-style benchmarks are the standard way to compare memory bandwidth across systems, and they are ideal for measuring how NUMA placement affects throughput.

Deep Dive into the Concept

The classic STREAM benchmark uses four kernels: copy (c = a), scale (b = scalar * c), add (c = a + b), and triad (a = b + scalar * c). Each kernel has a known ratio of memory operations to floating point operations (bytes per FLOP). This makes it possible to predict the theoretical bandwidth and to identify whether the system is compute-bound or memory-bound. In practice, modern CPUs are usually memory-bound for these kernels because the arithmetic is trivial and the limiting factor is DRAM throughput.

A correct bandwidth benchmark must avoid several pitfalls. First, the arrays must be larger than the last-level cache; otherwise the benchmark measures cache bandwidth, which can be 5-10x higher than DRAM. A common approach is to use a working set of 4x LLC size. Second, you must prevent the compiler from optimizing the loops away. This is typically done by using volatile pointers or by verifying that the results are consumed. Third, you must ensure proper alignment to cache lines and memory pages to avoid skewed results.

Bandwidth measurement is also affected by write-allocate policies. For example, a simple copy c = a may read c into cache before writing, unless you use non-temporal stores. This can inflate memory traffic. STREAM’s triad kernel includes both read and write streams and is often considered the most representative. But for your project, you should implement all four kernels and report each. This allows users to compare how different patterns behave under NUMA policies.

Another important factor is multi-threading. Memory bandwidth scales with the number of memory channels and controllers. With one thread pinned to one node, you may only use that node’s channels. With interleaving across nodes, you can use multiple channels and increase aggregate bandwidth. However, this can also increase latency for each thread, so the bandwidth numbers must be interpreted in context. The benchmark should include both single-thread and multi-thread modes, and it should report per-thread and aggregate bandwidth.

Finally, you must account for timing accuracy. Use clock_gettime(CLOCK_MONOTONIC_RAW) or rdtscp-based timing, but ensure that the measured interval is long enough to dominate timer overhead. A few hundred milliseconds per kernel is usually sufficient. With these controls, the benchmark becomes a reliable tool for comparing NUMA policies and understanding memory throughput behavior.

How this fits on projects

This concept defines the kernels and measurement strategy for your benchmark. Without STREAM-style kernels and correct working-set sizing, you cannot claim to measure DRAM bandwidth.

Definitions & Key Terms

Bandwidth -> Data transferred per second, often GB/s.
STREAM kernel -> Simple vector operations used to stress memory throughput.
Working set -> Data size actively touched by the benchmark.
Write-allocate -> Policy that loads a cache line before writing to it.
Non-temporal store -> Store instruction that bypasses caches.

Mental Model Diagram (ASCII)

Memory Channels -> IMC -> LLC -> L2 -> L1 -> Core
   |-------------------------------------------|
   |      STREAM kernels saturate DRAM         |

How It Works (Step-by-Step)

Allocate three large arrays aligned to cache lines.
Initialize arrays to avoid lazy allocation.
Run each kernel for a fixed time or iteration count.
Measure elapsed time and compute GB/s.
Repeat for multiple trials and take median.

Invariants: Arrays exceed LLC; results are used to prevent elimination.

Failure modes: Arrays too small, compiler optimizations, timer overhead dominates.

Minimal Concrete Example

for (size_t i = 0; i < N; i++) c[i] = a[i];          // COPY
for (size_t i = 0; i < N; i++) b[i] = scalar * c[i]; // SCALE
for (size_t i = 0; i < N; i++) c[i] = a[i] + b[i];   // ADD
for (size_t i = 0; i < N; i++) a[i] = b[i] + scalar * c[i]; // TRIAD

Common Misconceptions

“Bandwidth and latency are the same” -> They measure different properties.
“Small arrays are fine” -> They measure cache bandwidth, not DRAM.
“One thread equals peak bandwidth” -> Many systems require multiple threads.

Check-Your-Understanding Questions

Why must arrays be larger than the LLC for a DRAM bandwidth test?
What is the difference between COPY and TRIAD in terms of memory traffic?
How does write-allocate affect measured bandwidth?

Check-Your-Understanding Answers

Otherwise the data stays in cache and you measure cache bandwidth.
TRIAD has more read/write streams, stressing memory more than COPY.
It causes extra reads for writes, inflating memory traffic.

Real-World Applications

HPC system qualification and procurement.
Tuning memory-intensive services like analytics engines.
Evaluating NUMA interleaving policies for throughput.

Where You’ll Apply It

In this project: Sec. 3.1, Sec. 4.4, Sec. 5.10.
Also used in: P02-memory-latency-microbenchmark.

References

“Computer Architecture, 5th Edition” (Hennessy & Patterson) – Ch. 2
“Computer Systems: A Programmer’s Perspective” (Bryant & O’Hallaron) – Ch. 5

Key Insights

Bandwidth is only visible when you feed memory controllers with large, predictable streams.

Summary

STREAM kernels provide a standard way to measure memory throughput. With large working sets, careful timing, and compiler-safe loops, you can reliably quantify bandwidth and compare NUMA placement strategies.

Homework/Exercises to Practice the Concept

Run COPY with a 1 MB array and a 1 GB array and compare GB/s.
Implement non-temporal stores and compare TRIAD performance.
Try different iteration counts and observe timing stability.

Solutions to the Homework/Exercises

The 1 GB array will show lower bandwidth because it hits DRAM.
Non-temporal stores often improve write-heavy kernels by avoiding write-allocate.
Longer runs reduce timing noise and stabilize results.

2.2 NUMA Placement and Multi-Thread Scaling

Fundamentals

NUMA placement determines where memory pages are allocated and which CPUs access them. For bandwidth benchmarks, this matters because each NUMA node has its own memory channels. Binding memory to a single node limits the bandwidth to that node’s channels, while interleaving across nodes can increase aggregate bandwidth by using more channels. Thread placement also matters: to fully utilize interleaved memory, threads should be distributed across nodes. Understanding these effects allows you to interpret why interleaving may boost GB/s in multi-threaded runs but can degrade single-thread latency.

Deep Dive into the Concept

Linux offers several NUMA policies: bind (allocate only on specified nodes), preferred (try a node but fall back), and interleave (round-robin across nodes). The choice of policy directly affects bandwidth measurements. If you bind memory to node 0 and run a single thread on node 0, you measure the bandwidth of that node’s memory controller. If you run multiple threads on node 0 but still bind memory to node 0, bandwidth may saturate after a few threads because you are limited by a single node’s channels. If you interleave memory across nodes but keep all threads on node 0, each thread must traverse the interconnect for remote pages, which can raise latency and sometimes reduce overall bandwidth.

The highest aggregate bandwidth usually comes from distributing both memory and threads across nodes. A common strategy is to run one thread per core, with cores distributed across sockets, while memory is interleaved across nodes. This allows each memory controller to contribute. But the exact scaling depends on interconnect bandwidth and the number of channels. On some systems, a single node already saturates memory with a small number of threads. Your benchmark should therefore report not only aggregate GB/s but also per-thread bandwidth and scaling efficiency.

First-touch policy adds another layer. If you allocate memory without explicit binding, the pages are placed on the node where they are first accessed. This means initialization order can determine placement. A robust benchmark should either explicitly bind pages with numa_alloc_onnode/mbind or deliberately initialize pages on specific nodes to model first-touch. Providing a --policy flag that supports bind, interleave, and first-touch is ideal.

Finally, the benchmark must respect cpusets and containers. If the process is restricted to a subset of CPUs or nodes, interleave may not use all nodes. This is not a bug, but the benchmark should surface the effective node set in its output. This ensures users do not misinterpret limited bandwidth as a hardware issue.

How this fits on projects

This concept guides how you run your benchmark across nodes and threads, and how you interpret the results. It also informs your CLI flags and reporting.

Definitions & Key Terms

Bind policy -> Allocate memory only on specified nodes.
Interleave policy -> Round-robin memory allocation across nodes.
Preferred policy -> Try a node but allow fallback.
First-touch -> Allocation on the node that first accesses a page.
Scaling efficiency -> Bandwidth per thread relative to ideal linear scaling.

Mental Model Diagram (ASCII)

Thread placement:
Node 0: CPU0 CPU1 CPU2 CPU3
Node 1: CPU8 CPU9 CPU10 CPU11

Memory policy:
Bind -> all pages on Node 0
Interleave -> pages round-robin Node0, Node1

Bandwidth outcome:
Bind + local threads -> limited by Node0 channels
Interleave + distributed threads -> higher aggregate GB/s

How It Works (Step-by-Step)

Choose a NUMA policy (bind, interleave, first-touch).
Allocate and initialize arrays according to policy.
Pin threads to CPUs across nodes (or to a single node).
Run STREAM kernels and measure GB/s.
Report per-node and aggregate bandwidth.

Invariants: Policy must be applied before touching pages.

Failure modes: Unpinned threads, incorrect initialization, cpuset restrictions.

Minimal Concrete Example

// Bind memory to node 0
numa_set_preferred(0);

// Pin thread to CPU 0
cpu_set_t set; CPU_ZERO(&set); CPU_SET(0, &set);
sched_setaffinity(0, sizeof(set), &set);

Common Misconceptions

“Interleave always faster” -> For single-threaded runs, interleave can hurt.
“First-touch is deterministic” -> It depends on initialization order and thread placement.
“All cores equal” -> Remote cores see higher access cost.

Check-Your-Understanding Questions

Why might interleave increase bandwidth but worsen latency?
What happens if you initialize arrays on node 1 but run threads on node 0?
How do cpusets affect interleave policies?

Check-Your-Understanding Answers

Interleave uses more channels but introduces remote accesses.
Memory pages will be remote to the threads, reducing performance.
Interleave only uses nodes visible to the process.

Real-World Applications

Memory tuning for analytic workloads.
Evaluating hardware upgrades for bandwidth-bound workloads.

Where You’ll Apply It

In this project: Sec. 3.2, Sec. 3.7, Sec. 5.10.
Also used in: P05-numa-aware-memory-allocator, P09-numa-aware-thread-pool.

References

“Operating Systems: Three Easy Pieces” – Ch. 13-21
“Computer Architecture” – Ch. 2

Key Insights

Bandwidth is a system-wide resource; placement decides which memory controllers you actually use.

Summary

NUMA placement policies determine how memory is distributed across nodes, and thread placement determines who consumes that memory. For bandwidth measurement, the best results usually come from distributing both threads and memory across nodes and reporting scaling efficiency rather than only aggregate GB/s.

Homework/Exercises to Practice the Concept

Measure bandwidth with bind policy and with interleave policy.
Run one thread per node and compare to all threads on node 0.
Simulate first-touch by initializing arrays with different thread placements.

Solutions to the Homework/Exercises

Interleave should increase aggregate GB/s if multiple nodes are used.
Distributing threads usually yields higher bandwidth.
First-touch will place pages on the initializing thread’s node, changing results.

3. Project Specification

3.1 What You Will Build

A CLI benchmark bandwidth-bench that implements STREAM kernels (copy, scale, add, triad) and reports bandwidth under different NUMA placement and threading configurations.

Included: STREAM kernels, placement control, thread scaling, JSON output. Excluded: GPU bandwidth, disk I/O throughput.

3.2 Functional Requirements

Implement COPY, SCALE, ADD, TRIAD kernels.
Allocate arrays larger than LLC (configurable).
Support bind/interleave/first-touch policies.
Pin threads to specific CPUs or nodes.
Measure runtime and compute GB/s.
Run multiple trials and report median.
Provide JSON output with per-kernel results.
Support deterministic seed for initialization order.

3.3 Non-Functional Requirements

Performance: Overhead < 5% of kernel time.
Reliability: Results stable across runs (variance < 5%).
Usability: Clear output and explanations of policy/placement.

3.4 Example Usage / Output

$ numactl --cpunodebind=0 --membind=0 ./bandwidth-bench
COPY:  38.2 GB/s
SCALE: 36.9 GB/s
SUM:   35.1 GB/s
TRIAD: 34.8 GB/s

3.5 Data Formats / Schemas / Protocols

JSON output:

{
  "policy": "bind",
  "node": 0,
  "threads": 1,
  "kernels": {
    "copy": 38.2,
    "scale": 36.9,
    "add": 35.1,
    "triad": 34.8
  },
  "units": "GB/s"
}

3.6 Edge Cases

Arrays smaller than LLC (warn user).
Single-node UMA system.
Thread count exceeds available CPUs.
Interleave policy with only one visible node.

3.7 Real World Outcome

You can compare bandwidth under different NUMA policies and thread placements, producing a reproducible GB/s report to guide performance tuning.

3.7.1 How to Run (Copy/Paste)

cc -O2 -Wall -lnuma -o bandwidth-bench src/main.c
./bandwidth-bench --size-mb=1024 --threads=8 --policy=interleave

3.7.2 Golden Path Demo (Deterministic)

$ ./bandwidth-bench --size-mb=1024 --threads=8 --policy=interleave --seed=42
COPY:  60.4 GB/s
SCALE: 58.7 GB/s
ADD:   56.9 GB/s
TRIAD: 56.1 GB/s

3.7.3 If CLI: Exact Terminal Transcript

$ ./bandwidth-bench --size-mb=1024 --threads=8 --policy=bind --node=0
COPY:  38.2 GB/s
SCALE: 36.9 GB/s
ADD:   35.1 GB/s
TRIAD: 34.8 GB/s

3.7.4 Failure Demo (Deterministic)

$ ./bandwidth-bench --threads=999
ERROR: thread count exceeds available CPUs (max=16)
EXIT: 1

Exit Codes:

0 success
1 invalid arguments
2 allocation failure
3 policy/affinity error

4. Solution Architecture

4.1 High-Level Design

+-------------+   +--------------+   +------------------+
| Allocator   |-->| Kernel Runner |-->| Report Renderer  |
+-------------+   +--------------+   +------------------+
        ^                 ^                    ^
        |                 |                    |
+-------------+     +--------------+     +---------------+
| Policy Ctrl |---->| Thread Pool  |     | JSON Encoder  |
+-------------+     +--------------+     +---------------+

4.2 Key Components

4.3 Data Structures (No Full Code)

typedef struct {
    double copy_gbs;
    double scale_gbs;
    double add_gbs;
    double triad_gbs;
} BandwidthResult;

4.4 Algorithm Overview

Key Algorithm: Bandwidth Measurement

Allocate arrays and initialize them.
Apply policy and pin threads.
Run each kernel for fixed iterations.
Measure time and compute GB/s.
Repeat trials and report median.

Complexity Analysis:

Time: O(N) per kernel
Space: O(N) for arrays

5. Implementation Guide

5.1 Development Environment Setup

sudo apt-get install -y build-essential libnuma-dev

5.2 Project Structure

bandwidth-bench/
|-- src/
|   |-- main.c
|   |-- kernels.c
|   |-- policy.c
|   |-- timing.c
|   +-- report.c
|-- tests/
|   +-- fixtures/
|-- Makefile
+-- README.md

5.3 The Core Question You’re Answering

“How does memory placement change achievable bandwidth?”

5.4 Concepts You Must Understand First

STREAM kernels and memory traffic.
Cache sizes and working set selection.
NUMA policies and thread placement.

5.5 Questions to Guide Your Design

How will you ensure arrays exceed LLC size?
Will you time by iterations or by duration?
How will you distribute threads across nodes?

5.6 Thinking Exercise

If you interleave memory but run all threads on node 0, what happens to bandwidth and why?

5.7 The Interview Questions They’ll Ask

“What does the STREAM benchmark measure?”
“Why do you need large arrays for bandwidth measurement?”
“How does interleaving affect throughput?”

5.8 Hints in Layers

Hint 1: Start with a single-thread COPY kernel.

Hint 2: Add multi-threading and pin threads per node.

Hint 3: Add interleave policy and compare results.

Hint 4: Report variance across trials to show stability.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Kernels + Timing (3-4 days)

Goals: Implement STREAM kernels with accurate timing.

Tasks:

Implement copy/scale/add/triad loops.
Build timing helpers.

Checkpoint: Single-thread bandwidth results are stable.

Phase 2: Placement + Threads (3-4 days)

Goals: Add NUMA policy support and thread pinning.

Tasks:

Implement bind/interleave policy support.
Add thread manager with affinity.

Checkpoint: Multi-thread runs show bandwidth scaling.

Phase 3: Reporting + JSON (2-3 days)

Goals: Provide deterministic output and JSON.

Tasks:

Add JSON schema and output.
Add seed-based deterministic initialization.

Checkpoint: Golden demo output is reproducible.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Arrays smaller than LLC trigger a warning.
Invalid thread count exits with code 1.
Interleave policy on multi-node system increases aggregate bandwidth.

6.3 Test Data

size_mb=1024, threads=8, seed=42
expected_bandwidth > 50 GB/s (example)

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Use perf stat to confirm cache misses are high.
Compare results with known STREAM numbers for similar hardware.

7.3 Performance Traps

Over-threading can saturate memory and reduce per-thread performance.

8. Extensions & Challenges

8.1 Beginner Extensions

Add CSV output for plotting.
Add --warmup option.

8.2 Intermediate Extensions

Add non-temporal store option.
Add per-node bandwidth breakdown.

8.3 Advanced Extensions

Measure bandwidth scaling curve from 1..N threads.
Integrate hardware counters for memory traffic.

9. Real-World Connections

9.1 Industry Applications

Capacity planning for memory-bound services.
NUMA tuning for analytics engines.

STREAM – the canonical bandwidth benchmark.
likwid-bench – microbenchmarking suite.

9.3 Interview Relevance

Explaining why interleaving can boost throughput but hurt latency.

10. Resources

10.1 Essential Reading

“Computer Architecture” – Ch. 2
“Computer Systems: A Programmer’s Perspective” – Ch. 5

10.2 Video Resources

Memory bandwidth tuning talks.

10.3 Tools & Documentation

numactl – policy control.
lscpu – topology overview.

11. Self-Assessment Checklist

11.1 Understanding

I can explain bandwidth vs latency.
I can explain why arrays must exceed LLC.
I can explain interleave vs bind.

11.2 Implementation

Kernels produce correct results.
JSON output matches schema.
Results are stable across runs.

11.3 Growth

I can explain these results to a performance engineer.

12. Submission / Completion Criteria

Minimum Viable Completion:

STREAM kernels run and report GB/s.
Single-node and multi-node modes work.

Full Completion:

Supports bind/interleave/first-touch.
Deterministic output with fixed seed.

Excellence (Going Above & Beyond):

Non-temporal store optimization and scaling curves.