Project 4: Cache Locality Visualizer

Project Overview

Attribute	Details
Difficulty	Advanced
Time Estimate	1-2 weeks
Primary Language	C
Alternative Languages	C++, Rust, Zig
Knowledge Area	CPU Caches and Memory Access
Tools Required	perf, cachegrind (optional)
Primary Reference	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

Learning Objectives

By completing this project, you will be able to:

Explain cache line behavior including how data is fetched, evicted, and replaced
Predict cache miss patterns for different data layouts and access sequences
Measure cache performance using hardware counters (L1, L2, L3 hits/misses)
Design cache-friendly data structures that maximize spatial and temporal locality
Quantify the performance impact of cache misses in real workloads
Apply cache optimization techniques like blocking, prefetching, and structure splitting

Deep Theoretical Foundation

The Memory Wall Problem

Modern CPUs can execute billions of instructions per second, but DRAM can only deliver millions of cache lines per second. This 1000x gap—the “memory wall”—makes cache behavior the dominant factor in many programs’ performance.

Speed Comparison (2025 approximate):
┌────────────────────────────────────────────────────────┐
│ Component          │ Latency     │ Bandwidth          │
├────────────────────┼─────────────┼────────────────────┤
│ CPU Register       │ 0 cycles    │ 100+ GB/s          │
│ L1 Cache (32 KB)   │ 4 cycles    │ 1000+ GB/s         │
│ L2 Cache (256 KB)  │ 12 cycles   │ 400+ GB/s          │
│ L3 Cache (8-32 MB) │ 40 cycles   │ 200+ GB/s          │
│ DRAM               │ 100+ cycles │ 50-100 GB/s        │
│ NVMe SSD           │ 10,000+ cyc │ 5-10 GB/s          │
└────────────────────┴─────────────┴────────────────────┘

A cache miss to DRAM costs 100 cycles—time for 100+ instructions to execute. If 10% of your memory accesses miss cache, you lose half your performance to waiting.

Cache Organization

Cache Lines

Memory is not accessed byte-by-byte. Caches fetch data in fixed-size chunks called cache lines, typically 64 bytes on x86-64:

Memory Address: 0x00007FFF12345678
                         └──┬──┘
                   Cache Line: 0x00007FFF12345640

When you access address 0x...678, the CPU fetches:
0x...640, 0x...648, 0x...650, 0x...658, 0x...660, 0x...668, 0x...670, 0x...678
(8 x 8-byte words = 64 bytes)

Implication: Accessing one byte loads 64 bytes. If you use them all, efficient. If you use one and move on, you wasted 63 bytes of bandwidth.

Set-Associative Structure

Caches are organized as sets of ways:

8-way Set-Associative L1 Cache (32 KB):
- 64 sets
- 8 ways per set
- Each way holds one 64-byte line
- Total: 64 × 8 × 64 = 32,768 bytes

Address → Set Index → Check 8 ways for match

The set is determined by address bits. Multiple addresses can map to the same set, competing for limited ways. This causes conflict misses.

Cache Coherence (Multi-Core)

When multiple cores access shared data, caches must stay synchronized. The MESI protocol defines states:

Modified: This cache has the only valid copy, modified
Exclusive: This cache has the only valid copy, clean
Shared: Multiple caches have valid copies
Invalid: This cache line is stale

False sharing occurs when different cores write to different variables on the same cache line—forcing expensive cache coherence traffic even though there’s no logical sharing.

Types of Locality

Spatial Locality: If you access address X, you’ll likely soon access X+1, X+2, etc.

Examples of good spatial locality:

// Good: Sequential array access
for (int i = 0; i < n; i++) {
    sum += array[i];  // Accesses array[0], array[1], array[2], ...
}
// Each cache line prefetched serves 8 elements (for 64-bit)

Examples of poor spatial locality:

// Bad: Strided access
for (int i = 0; i < n; i += 64) {
    sum += array[i];  // Accesses array[0], array[64], array[128], ...
}
// Each cache line fetched only serves 1 element (7/8 wasted)

Temporal Locality: If you access address X, you’ll likely access X again soon.

Examples of good temporal locality:

// Good: Reusing data while in cache
for (int iter = 0; iter < 1000; iter++) {
    for (int i = 0; i < 100; i++) {
        sum += array[i];  // 100-element array fits in L1
    }
}

Examples of poor temporal locality:

// Bad: Data accessed once then evicted
for (int i = 0; i < 10_000_000; i++) {
    sum += huge_array[i];  // Array much larger than cache
}
// By the time we loop back, data is evicted

The Row-Major vs Column-Major Problem

C arrays are row-major: array[row][col] stores rows contiguously.

Logical view:          Memory layout:
┌─────────────────┐    ┌───────────────────────────────────┐
│ [0][0] [0][1]   │    │ [0][0] [0][1] [0][2] [0][3] [1][0]│
│ [1][0] [1][1]   │    │ [1][1] [1][2] [1][3] [2][0] [2][1]│
│ [2][0] [2][1]   │    │ [2][2] [2][3] ...                 │
└─────────────────┘    └───────────────────────────────────┘

Row-major traversal (good):

for (int row = 0; row < N; row++) {
    for (int col = 0; col < M; col++) {
        sum += matrix[row][col];  // Sequential in memory
    }
}

Column-major traversal (bad):

for (int col = 0; col < M; col++) {
    for (int row = 0; row < N; row++) {
        sum += matrix[row][col];  // Jumps by M elements each access
    }
}

For a 1000×1000 matrix of 8-byte doubles:

Row-major: 1 cache miss per 8 elements = 125,000 misses
Column-major: 1 cache miss per element = 1,000,000 misses
8x difference from loop order alone

Complete Project Specification

What You’re Building

An experimental toolkit called cache_lab that:

Generates controlled workloads with different access patterns
Measures cache behavior using hardware performance counters
Visualizes cache access patterns showing hits and misses
Compares data layouts quantifying performance differences
Produces reports connecting patterns to performance

Functional Requirements

cache_lab run --pattern <name> --size <MB> --iters <n>
cache_lab compare --patterns <list> --sizes <list>
cache_lab visualize --pattern <name> --output <html>
cache_lab report --output <report.md>

Access Patterns to Implement

sequential: Access array elements 0, 1, 2, 3, …
reverse: Access array elements n-1, n-2, n-3, …
stride-N: Access every Nth element
random: Access elements in random order
row-major: 2D array, row-by-row
column-major: 2D array, column-by-column
blocked: 2D array with cache-friendly blocking

Example Output

Cache Locality Report
═══════════════════════════════════════════════════
Data size: 64 MB (fits L3, exceeds L2)
Iterations: 10
CPU: Intel Core i7-12700K

Pattern Comparison:
───────────────────────────────────────────────────
Pattern          Time     L1 Miss%   L3 Miss%   Bandwidth
sequential       120 ms   0.4%       0.1%       533 MB/s
stride-64        845 ms   99.2%      12.1%      76 MB/s
random           1823 ms  98.7%      34.2%      35 MB/s
row-major        125 ms   0.5%       0.2%       512 MB/s
column-major     982 ms   94.1%      18.4%      65 MB/s
blocked          142 ms   1.2%       0.3%       451 MB/s

Analysis:
───────────────────────────────────────────────────
• Sequential is 15x faster than random due to spatial locality
• Row-major is 7.8x faster than column-major (same algorithm!)
• Stride-64 matches cache line size (64 bytes), guaranteeing 1 miss/access
• Blocked access trades some L1 efficiency for better L3 utilization

Visualization saved: cache_patterns.html

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────┐
│                    CLI Interface                             │
│   Pattern selection, size configuration                      │
└──────────────────────────┬──────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
   │ Pattern     │  │ Counter     │  │ Timer       │
   │ Generator   │  │ Collector   │  │ Engine      │
   │             │  │ (perf)      │  │             │
   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
          │                │                │
          └────────────────┼────────────────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │      Measurement Loop          │
           │  Generate indices → Access     │
           │  → Collect counters            │
           └───────────────┬───────────────┘
                           │
                           ▼
           ┌───────────────────────────────┐
           │      Analysis & Report         │
           │  Compare patterns, visualize   │
           └───────────────────────────────┘

Key Data Structures

// Access pattern generator
typedef void (*index_generator_t)(size_t *indices, size_t count, size_t stride);

typedef struct {
    const char *name;
    index_generator_t generator;
    const char *description;
} pattern_t;

// Cache measurement result
typedef struct {
    double wall_time_ms;
    uint64_t l1_loads;
    uint64_t l1_misses;
    uint64_t l2_loads;
    uint64_t l2_misses;
    uint64_t llc_loads;     // Last-level cache (L3)
    uint64_t llc_misses;
} cache_result_t;

// Experiment configuration
typedef struct {
    const char *pattern_name;
    size_t data_size_bytes;
    size_t element_size;
    int iterations;
    int warmup_iterations;
} experiment_config_t;

Pattern Generator Examples

// Sequential access
void generate_sequential(size_t *indices, size_t count, size_t stride) {
    for (size_t i = 0; i < count; i++) {
        indices[i] = i;
    }
}

// Strided access (every Nth element)
void generate_strided(size_t *indices, size_t count, size_t stride) {
    size_t idx = 0;
    for (size_t i = 0; i < count; i++) {
        indices[i] = idx;
        idx = (idx + stride) % count;
    }
}

// Random access (Fisher-Yates shuffle of sequential)
void generate_random(size_t *indices, size_t count, size_t stride) {
    generate_sequential(indices, count, 0);
    for (size_t i = count - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        size_t tmp = indices[i];
        indices[i] = indices[j];
        indices[j] = tmp;
    }
}

Phased Implementation Guide

Phase 1: Basic Access Pattern Experiments (Days 1-3)

Goal: Measure timing difference between sequential and random access.

Steps:

Allocate a large array (larger than L3 cache)
Implement sequential and random index generators
Measure wall-clock time for each pattern
Calculate and report bandwidth (MB/s)

Validation: Random should be 10-20x slower than sequential.

Phase 2: Hardware Counter Integration (Days 4-6)

Goal: Add L1, L2, L3 cache miss measurements.

Steps:

Use perf_event_open for cache counters:
- PERF_COUNT_HW_CACHE_L1D with read/miss
- PERF_COUNT_HW_CACHE_LL with read/miss
Correlate miss rates with timing
Verify sequential has ~0% miss rate

Validation: Stride-64 should have ~100% L1 miss rate.

Phase 3: 2D Array Experiments (Days 7-9)

Goal: Demonstrate row-major vs column-major impact.

Steps:

Create 2D matrix traversal workloads
Implement row-major and column-major loops
Implement blocked traversal for comparison
Measure and compare all three

Validation: Column-major should be 5-10x slower than row-major.

Phase 4: Size Scaling Study (Days 10-12)

Goal: Show performance cliffs as data exceeds cache levels.

Steps:

Run sequential pattern across data sizes:
- 16 KB (fits L1)
- 128 KB (fits L2)
- 4 MB (fits L3)
- 64 MB (exceeds L3)
Plot bandwidth vs size
Identify cache level boundaries

Validation: Should see 3 distinct performance plateaus.

Phase 5: Visualization and Reporting (Days 13-14)

Goal: Create visual representation of access patterns.

Steps:

Generate access pattern diagrams
Create HTML/SVG visualization of cache line access
Produce summary report with all experiments
Add recommendations based on patterns

Validation: Visualization clearly shows good vs bad locality.

Testing Strategy

Micro-Benchmarks

Cache line verification: Access every 64th byte, verify ~100% L1 miss
L1 size boundary: Find exact size where L1 miss rate jumps
Prefetch effectiveness: Compare with/without software prefetch hints

Validation Experiments

Cross-validate with Cachegrind: Compare miss counts
perf stat comparison: Verify counters match perf output
Known algorithms: Test matrix multiply with different loop orders

Edge Cases

Huge pages: Do 2MB pages change results?
NUMA effects: What about cross-socket access?
TLB misses: Separate TLB from cache effects

Common Pitfalls and Debugging

Pitfall 1: Compiler Optimizes Away Access

Symptom: All patterns run at same speed (fast).

Cause: Compiler sees unused reads and eliminates them.

Solution:

// Bad: Result unused
for (int i = 0; i < n; i++) {
    data[indices[i]];  // Compiler may eliminate
}

// Good: Accumulate and use result
volatile uint64_t sum = 0;
for (int i = 0; i < n; i++) {
    sum += data[indices[i]];
}
return sum;

Pitfall 2: Prefetcher Masks Randomness

Symptom: “Random” access isn’t as slow as expected.

Cause: Modern CPUs have stream prefetchers that can detect some patterns.

Solutions:

Use truly random order (Fisher-Yates shuffle)
Use pointer chasing (each element contains next address)
Disable hardware prefetcher: wrmsr 0x1A4 0xF (requires privileges)

Pitfall 3: NUMA Effects Dominate

Symptom: Results vary wildly between runs.

Cause: On multi-socket systems, memory may be on remote node.

Solution:

# Pin to single NUMA node
numactl --membind=0 --cpunodebind=0 ./cache_lab

Pitfall 4: Page Faults in Measurement

Symptom: First iteration is 100x slower.

Cause: Memory pages not allocated until first touch.

Solution:

// Pre-fault all pages before measurement
memset(data, 0, data_size);  // Forces page allocation

Extensions and Challenges

Extension 1: Cache Blocking for Matrix Operations

Implement blocked matrix multiply:

// Naive: O(n³) but cache-hostile
for (i) for (j) for (k) C[i][j] += A[i][k] * B[k][j];

// Blocked: Same O(n³) but cache-friendly
for (ii, block) for (jj, block) for (kk, block)
  for (i in block) for (j in block) for (k in block)
    C[i][j] += A[i][k] * B[k][j];

Demonstrate 2-10x speedup from blocking alone.

Create multi-threaded workload where:

Each thread writes to its own variable
Variables are on same cache line (false sharing)
Move to separate cache lines (padding)
Measure speedup from eliminating false sharing

Extension 3: Software Prefetch Experiments

Test explicit prefetch instructions:

for (int i = 0; i < n; i++) {
    __builtin_prefetch(&data[i + 64], 0, 3);  // Prefetch ahead
    sum += data[i];
}

Measure when prefetching helps vs hurts.

Challenge: Cache-Oblivious Algorithm

Implement a cache-oblivious algorithm (like funnel sort or recursive matrix multiply) and demonstrate it achieves good performance across different cache sizes without tuning.

Real-World Connections

Where Cache Locality Matters

Database Query Engines: Column stores vs row stores—cache-friendly layouts
Game Engines: Entity-Component-System design for cache-friendly iteration
Scientific Computing: Loop tiling in BLAS/LAPACK
Machine Learning: Tensor memory layouts in PyTorch/TensorFlow

Industry Examples

Google Dense Hash Map: Cache-optimized hash table layout
Facebook Folly F14: False-sharing-aware concurrent hash map
Intel MKL: Auto-tuned cache blocking for matrix operations
Redis: Cache-line-aligned data structures

Self-Assessment Checklist

Before considering this project complete, verify:

You can explain what a cache line is and why it matters
You can predict miss rates for different access patterns
You measured row-major vs column-major difference (should be 5-10x)
You can identify cache level boundaries from size scaling experiments
Your measurements match expected cache behavior
You can explain when and why blocking helps
You understand false sharing and how to prevent it

Resources

Essential Reading

“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron, Chapter 6
“What Every Programmer Should Know About Memory” by Ulrich Drepper
“Optimizing Software in C++” by Agner Fog, Chapter 5

Reference Documentation

Intel 64 and IA-32 Optimization Reference Manual
AMD64 Architecture Programmer’s Manual, Volume 2
Linux perf wiki: Hardware cache events

Tools

perf stat: Cache counter measurement
Cachegrind (Valgrind): Cache simulation
Intel VTune: Detailed cache analysis
likwid: Performance counter suite

Project 4: Cache Locality Visualizer

Project Overview

Learning Objectives

Deep Theoretical Foundation

The Memory Wall Problem

Cache Organization

Types of Locality

The Row-Major vs Column-Major Problem

Complete Project Specification

What You’re Building

Functional Requirements

Access Patterns to Implement

Example Output

Solution Architecture

Component Design

Key Data Structures

Pattern Generator Examples

Phased Implementation Guide

Phase 1: Basic Access Pattern Experiments (Days 1-3)

Phase 2: Hardware Counter Integration (Days 4-6)

Phase 3: 2D Array Experiments (Days 7-9)

Phase 4: Size Scaling Study (Days 10-12)

Phase 5: Visualization and Reporting (Days 13-14)

Testing Strategy

Micro-Benchmarks

Validation Experiments

Edge Cases

Common Pitfalls and Debugging

Pitfall 1: Compiler Optimizes Away Access

Pitfall 2: Prefetcher Masks Randomness

Pitfall 3: NUMA Effects Dominate

Pitfall 4: Page Faults in Measurement

Extensions and Challenges

Extension 1: Cache Blocking for Matrix Operations

Extension 2: False Sharing Detector

Extension 3: Software Prefetch Experiments

Challenge: Cache-Oblivious Algorithm

Real-World Connections

Where Cache Locality Matters

Industry Examples

Self-Assessment Checklist

Resources

Essential Reading

Reference Documentation

Tools