Project 4: Cache Locality Visualizer
Project 4: Cache Locality Visualizer
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Primary Language | C |
| Alternative Languages | C++, Rust, Zig |
| Knowledge Area | CPU Caches and Memory Access |
| Tools Required | perf, cachegrind (optional) |
| Primary Reference | โComputer Systems: A Programmerโs Perspectiveโ by Bryant & OโHallaron |
Learning Objectives
By completing this project, you will be able to:
- Explain cache line behavior including how data is fetched, evicted, and replaced
- Predict cache miss patterns for different data layouts and access sequences
- Measure cache performance using hardware counters (L1, L2, L3 hits/misses)
- Design cache-friendly data structures that maximize spatial and temporal locality
- Quantify the performance impact of cache misses in real workloads
- Apply cache optimization techniques like blocking, prefetching, and structure splitting
Deep Theoretical Foundation
The Memory Wall Problem
Modern CPUs can execute billions of instructions per second, but DRAM can only deliver millions of cache lines per second. This 1000x gapโthe โmemory wallโโmakes cache behavior the dominant factor in many programsโ performance.
Speed Comparison (2025 approximate):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Component โ Latency โ Bandwidth โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ CPU Register โ 0 cycles โ 100+ GB/s โ
โ L1 Cache (32 KB) โ 4 cycles โ 1000+ GB/s โ
โ L2 Cache (256 KB) โ 12 cycles โ 400+ GB/s โ
โ L3 Cache (8-32 MB) โ 40 cycles โ 200+ GB/s โ
โ DRAM โ 100+ cycles โ 50-100 GB/s โ
โ NVMe SSD โ 10,000+ cyc โ 5-10 GB/s โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
A cache miss to DRAM costs 100 cyclesโtime for 100+ instructions to execute. If 10% of your memory accesses miss cache, you lose half your performance to waiting.
Cache Organization
Cache Lines
Memory is not accessed byte-by-byte. Caches fetch data in fixed-size chunks called cache lines, typically 64 bytes on x86-64:
Memory Address: 0x00007FFF12345678
โโโโฌโโโ
Cache Line: 0x00007FFF12345640
When you access address 0x...678, the CPU fetches:
0x...640, 0x...648, 0x...650, 0x...658, 0x...660, 0x...668, 0x...670, 0x...678
(8 x 8-byte words = 64 bytes)
Implication: Accessing one byte loads 64 bytes. If you use them all, efficient. If you use one and move on, you wasted 63 bytes of bandwidth.
Set-Associative Structure
Caches are organized as sets of ways:
8-way Set-Associative L1 Cache (32 KB):
- 64 sets
- 8 ways per set
- Each way holds one 64-byte line
- Total: 64 ร 8 ร 64 = 32,768 bytes
Address โ Set Index โ Check 8 ways for match
The set is determined by address bits. Multiple addresses can map to the same set, competing for limited ways. This causes conflict misses.
Cache Coherence (Multi-Core)
When multiple cores access shared data, caches must stay synchronized. The MESI protocol defines states:
- Modified: This cache has the only valid copy, modified
- Exclusive: This cache has the only valid copy, clean
- Shared: Multiple caches have valid copies
- Invalid: This cache line is stale
False sharing occurs when different cores write to different variables on the same cache lineโforcing expensive cache coherence traffic even though thereโs no logical sharing.
Types of Locality
Spatial Locality: If you access address X, youโll likely soon access X+1, X+2, etc.
Examples of good spatial locality:
// Good: Sequential array access
for (int i = 0; i < n; i++) {
sum += array[i]; // Accesses array[0], array[1], array[2], ...
}
// Each cache line prefetched serves 8 elements (for 64-bit)
Examples of poor spatial locality:
// Bad: Strided access
for (int i = 0; i < n; i += 64) {
sum += array[i]; // Accesses array[0], array[64], array[128], ...
}
// Each cache line fetched only serves 1 element (7/8 wasted)
Temporal Locality: If you access address X, youโll likely access X again soon.
Examples of good temporal locality:
// Good: Reusing data while in cache
for (int iter = 0; iter < 1000; iter++) {
for (int i = 0; i < 100; i++) {
sum += array[i]; // 100-element array fits in L1
}
}
Examples of poor temporal locality:
// Bad: Data accessed once then evicted
for (int i = 0; i < 10_000_000; i++) {
sum += huge_array[i]; // Array much larger than cache
}
// By the time we loop back, data is evicted
The Row-Major vs Column-Major Problem
C arrays are row-major: array[row][col] stores rows contiguously.
Logical view: Memory layout:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ [0][0] [0][1] โ โ [0][0] [0][1] [0][2] [0][3] [1][0]โ
โ [1][0] [1][1] โ โ [1][1] [1][2] [1][3] [2][0] [2][1]โ
โ [2][0] [2][1] โ โ [2][2] [2][3] ... โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Row-major traversal (good):
for (int row = 0; row < N; row++) {
for (int col = 0; col < M; col++) {
sum += matrix[row][col]; // Sequential in memory
}
}
Column-major traversal (bad):
for (int col = 0; col < M; col++) {
for (int row = 0; row < N; row++) {
sum += matrix[row][col]; // Jumps by M elements each access
}
}
For a 1000ร1000 matrix of 8-byte doubles:
- Row-major: 1 cache miss per 8 elements = 125,000 misses
- Column-major: 1 cache miss per element = 1,000,000 misses
- 8x difference from loop order alone
Complete Project Specification
What Youโre Building
An experimental toolkit called cache_lab that:
- Generates controlled workloads with different access patterns
- Measures cache behavior using hardware performance counters
- Visualizes cache access patterns showing hits and misses
- Compares data layouts quantifying performance differences
- Produces reports connecting patterns to performance
Functional Requirements
cache_lab run --pattern <name> --size <MB> --iters <n>
cache_lab compare --patterns <list> --sizes <list>
cache_lab visualize --pattern <name> --output <html>
cache_lab report --output <report.md>
Access Patterns to Implement
- sequential: Access array elements 0, 1, 2, 3, โฆ
- reverse: Access array elements n-1, n-2, n-3, โฆ
- stride-N: Access every Nth element
- random: Access elements in random order
- row-major: 2D array, row-by-row
- column-major: 2D array, column-by-column
- blocked: 2D array with cache-friendly blocking
Example Output
Cache Locality Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data size: 64 MB (fits L3, exceeds L2)
Iterations: 10
CPU: Intel Core i7-12700K
Pattern Comparison:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Pattern Time L1 Miss% L3 Miss% Bandwidth
sequential 120 ms 0.4% 0.1% 533 MB/s
stride-64 845 ms 99.2% 12.1% 76 MB/s
random 1823 ms 98.7% 34.2% 35 MB/s
row-major 125 ms 0.5% 0.2% 512 MB/s
column-major 982 ms 94.1% 18.4% 65 MB/s
blocked 142 ms 1.2% 0.3% 451 MB/s
Analysis:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โข Sequential is 15x faster than random due to spatial locality
โข Row-major is 7.8x faster than column-major (same algorithm!)
โข Stride-64 matches cache line size (64 bytes), guaranteeing 1 miss/access
โข Blocked access trades some L1 efficiency for better L3 utilization
Visualization saved: cache_patterns.html
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Interface โ
โ Pattern selection, size configuration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Pattern โ โ Counter โ โ Timer โ
โ Generator โ โ Collector โ โ Engine โ
โ โ โ (perf) โ โ โ
โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Measurement Loop โ
โ Generate indices โ Access โ
โ โ Collect counters โ
โโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Analysis & Report โ
โ Compare patterns, visualize โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Data Structures
// Access pattern generator
typedef void (*index_generator_t)(size_t *indices, size_t count, size_t stride);
typedef struct {
const char *name;
index_generator_t generator;
const char *description;
} pattern_t;
// Cache measurement result
typedef struct {
double wall_time_ms;
uint64_t l1_loads;
uint64_t l1_misses;
uint64_t l2_loads;
uint64_t l2_misses;
uint64_t llc_loads; // Last-level cache (L3)
uint64_t llc_misses;
} cache_result_t;
// Experiment configuration
typedef struct {
const char *pattern_name;
size_t data_size_bytes;
size_t element_size;
int iterations;
int warmup_iterations;
} experiment_config_t;
Pattern Generator Examples
// Sequential access
void generate_sequential(size_t *indices, size_t count, size_t stride) {
for (size_t i = 0; i < count; i++) {
indices[i] = i;
}
}
// Strided access (every Nth element)
void generate_strided(size_t *indices, size_t count, size_t stride) {
size_t idx = 0;
for (size_t i = 0; i < count; i++) {
indices[i] = idx;
idx = (idx + stride) % count;
}
}
// Random access (Fisher-Yates shuffle of sequential)
void generate_random(size_t *indices, size_t count, size_t stride) {
generate_sequential(indices, count, 0);
for (size_t i = count - 1; i > 0; i--) {
size_t j = rand() % (i + 1);
size_t tmp = indices[i];
indices[i] = indices[j];
indices[j] = tmp;
}
}
Phased Implementation Guide
Phase 1: Basic Access Pattern Experiments (Days 1-3)
Goal: Measure timing difference between sequential and random access.
Steps:
- Allocate a large array (larger than L3 cache)
- Implement sequential and random index generators
- Measure wall-clock time for each pattern
- Calculate and report bandwidth (MB/s)
Validation: Random should be 10-20x slower than sequential.
Phase 2: Hardware Counter Integration (Days 4-6)
Goal: Add L1, L2, L3 cache miss measurements.
Steps:
- Use perf_event_open for cache counters:
PERF_COUNT_HW_CACHE_L1Dwith read/missPERF_COUNT_HW_CACHE_LLwith read/miss
- Correlate miss rates with timing
- Verify sequential has ~0% miss rate
Validation: Stride-64 should have ~100% L1 miss rate.
Phase 3: 2D Array Experiments (Days 7-9)
Goal: Demonstrate row-major vs column-major impact.
Steps:
- Create 2D matrix traversal workloads
- Implement row-major and column-major loops
- Implement blocked traversal for comparison
- Measure and compare all three
Validation: Column-major should be 5-10x slower than row-major.
Phase 4: Size Scaling Study (Days 10-12)
Goal: Show performance cliffs as data exceeds cache levels.
Steps:
- Run sequential pattern across data sizes:
- 16 KB (fits L1)
- 128 KB (fits L2)
- 4 MB (fits L3)
- 64 MB (exceeds L3)
- Plot bandwidth vs size
- Identify cache level boundaries
Validation: Should see 3 distinct performance plateaus.
Phase 5: Visualization and Reporting (Days 13-14)
Goal: Create visual representation of access patterns.
Steps:
- Generate access pattern diagrams
- Create HTML/SVG visualization of cache line access
- Produce summary report with all experiments
- Add recommendations based on patterns
Validation: Visualization clearly shows good vs bad locality.
Testing Strategy
Micro-Benchmarks
- Cache line verification: Access every 64th byte, verify ~100% L1 miss
- L1 size boundary: Find exact size where L1 miss rate jumps
- Prefetch effectiveness: Compare with/without software prefetch hints
Validation Experiments
- Cross-validate with Cachegrind: Compare miss counts
- perf stat comparison: Verify counters match perf output
- Known algorithms: Test matrix multiply with different loop orders
Edge Cases
- Huge pages: Do 2MB pages change results?
- NUMA effects: What about cross-socket access?
- TLB misses: Separate TLB from cache effects
Common Pitfalls and Debugging
Pitfall 1: Compiler Optimizes Away Access
Symptom: All patterns run at same speed (fast).
Cause: Compiler sees unused reads and eliminates them.
Solution:
// Bad: Result unused
for (int i = 0; i < n; i++) {
data[indices[i]]; // Compiler may eliminate
}
// Good: Accumulate and use result
volatile uint64_t sum = 0;
for (int i = 0; i < n; i++) {
sum += data[indices[i]];
}
return sum;
Pitfall 2: Prefetcher Masks Randomness
Symptom: โRandomโ access isnโt as slow as expected.
Cause: Modern CPUs have stream prefetchers that can detect some patterns.
Solutions:
- Use truly random order (Fisher-Yates shuffle)
- Use pointer chasing (each element contains next address)
- Disable hardware prefetcher:
wrmsr 0x1A4 0xF(requires privileges)
Pitfall 3: NUMA Effects Dominate
Symptom: Results vary wildly between runs.
Cause: On multi-socket systems, memory may be on remote node.
Solution:
# Pin to single NUMA node
numactl --membind=0 --cpunodebind=0 ./cache_lab
Pitfall 4: Page Faults in Measurement
Symptom: First iteration is 100x slower.
Cause: Memory pages not allocated until first touch.
Solution:
// Pre-fault all pages before measurement
memset(data, 0, data_size); // Forces page allocation
Extensions and Challenges
Extension 1: Cache Blocking for Matrix Operations
Implement blocked matrix multiply:
// Naive: O(nยณ) but cache-hostile
for (i) for (j) for (k) C[i][j] += A[i][k] * B[k][j];
// Blocked: Same O(nยณ) but cache-friendly
for (ii, block) for (jj, block) for (kk, block)
for (i in block) for (j in block) for (k in block)
C[i][j] += A[i][k] * B[k][j];
Demonstrate 2-10x speedup from blocking alone.
Extension 2: False Sharing Detector
Create multi-threaded workload where:
- Each thread writes to its own variable
- Variables are on same cache line (false sharing)
- Move to separate cache lines (padding)
- Measure speedup from eliminating false sharing
Extension 3: Software Prefetch Experiments
Test explicit prefetch instructions:
for (int i = 0; i < n; i++) {
__builtin_prefetch(&data[i + 64], 0, 3); // Prefetch ahead
sum += data[i];
}
Measure when prefetching helps vs hurts.
Challenge: Cache-Oblivious Algorithm
Implement a cache-oblivious algorithm (like funnel sort or recursive matrix multiply) and demonstrate it achieves good performance across different cache sizes without tuning.
Real-World Connections
Where Cache Locality Matters
- Database Query Engines: Column stores vs row storesโcache-friendly layouts
- Game Engines: Entity-Component-System design for cache-friendly iteration
- Scientific Computing: Loop tiling in BLAS/LAPACK
- Machine Learning: Tensor memory layouts in PyTorch/TensorFlow
Industry Examples
- Google Dense Hash Map: Cache-optimized hash table layout
- Facebook Folly F14: False-sharing-aware concurrent hash map
- Intel MKL: Auto-tuned cache blocking for matrix operations
- Redis: Cache-line-aligned data structures
Self-Assessment Checklist
Before considering this project complete, verify:
- You can explain what a cache line is and why it matters
- You can predict miss rates for different access patterns
- You measured row-major vs column-major difference (should be 5-10x)
- You can identify cache level boundaries from size scaling experiments
- Your measurements match expected cache behavior
- You can explain when and why blocking helps
- You understand false sharing and how to prevent it
Resources
Essential Reading
- โComputer Systems: A Programmerโs Perspectiveโ by Bryant & OโHallaron, Chapter 6
- โWhat Every Programmer Should Know About Memoryโ by Ulrich Drepper
- โOptimizing Software in C++โ by Agner Fog, Chapter 5
Reference Documentation
- Intel 64 and IA-32 Optimization Reference Manual
- AMD64 Architecture Programmerโs Manual, Volume 2
- Linux perf wiki: Hardware cache events
Tools
- perf stat: Cache counter measurement
- Cachegrind (Valgrind): Cache simulation
- Intel VTune: Detailed cache analysis
- likwid: Performance counter suite