Project 4: Cache Locality Visualizer

Project 4: Cache Locality Visualizer

Project Overview

Attribute Details
Difficulty Advanced
Time Estimate 1-2 weeks
Primary Language C
Alternative Languages C++, Rust, Zig
Knowledge Area CPU Caches and Memory Access
Tools Required perf, cachegrind (optional)
Primary Reference โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron

Learning Objectives

By completing this project, you will be able to:

  1. Explain cache line behavior including how data is fetched, evicted, and replaced
  2. Predict cache miss patterns for different data layouts and access sequences
  3. Measure cache performance using hardware counters (L1, L2, L3 hits/misses)
  4. Design cache-friendly data structures that maximize spatial and temporal locality
  5. Quantify the performance impact of cache misses in real workloads
  6. Apply cache optimization techniques like blocking, prefetching, and structure splitting

Deep Theoretical Foundation

The Memory Wall Problem

Modern CPUs can execute billions of instructions per second, but DRAM can only deliver millions of cache lines per second. This 1000x gapโ€”the โ€œmemory wallโ€โ€”makes cache behavior the dominant factor in many programsโ€™ performance.

Speed Comparison (2025 approximate):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Component          โ”‚ Latency     โ”‚ Bandwidth          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ CPU Register       โ”‚ 0 cycles    โ”‚ 100+ GB/s          โ”‚
โ”‚ L1 Cache (32 KB)   โ”‚ 4 cycles    โ”‚ 1000+ GB/s         โ”‚
โ”‚ L2 Cache (256 KB)  โ”‚ 12 cycles   โ”‚ 400+ GB/s          โ”‚
โ”‚ L3 Cache (8-32 MB) โ”‚ 40 cycles   โ”‚ 200+ GB/s          โ”‚
โ”‚ DRAM               โ”‚ 100+ cycles โ”‚ 50-100 GB/s        โ”‚
โ”‚ NVMe SSD           โ”‚ 10,000+ cyc โ”‚ 5-10 GB/s          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

A cache miss to DRAM costs 100 cyclesโ€”time for 100+ instructions to execute. If 10% of your memory accesses miss cache, you lose half your performance to waiting.

Cache Organization

Cache Lines

Memory is not accessed byte-by-byte. Caches fetch data in fixed-size chunks called cache lines, typically 64 bytes on x86-64:

Memory Address: 0x00007FFF12345678
                         โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜
                   Cache Line: 0x00007FFF12345640

When you access address 0x...678, the CPU fetches:
0x...640, 0x...648, 0x...650, 0x...658, 0x...660, 0x...668, 0x...670, 0x...678
(8 x 8-byte words = 64 bytes)

Implication: Accessing one byte loads 64 bytes. If you use them all, efficient. If you use one and move on, you wasted 63 bytes of bandwidth.

Set-Associative Structure

Caches are organized as sets of ways:

8-way Set-Associative L1 Cache (32 KB):
- 64 sets
- 8 ways per set
- Each way holds one 64-byte line
- Total: 64 ร— 8 ร— 64 = 32,768 bytes

Address โ†’ Set Index โ†’ Check 8 ways for match

The set is determined by address bits. Multiple addresses can map to the same set, competing for limited ways. This causes conflict misses.

Cache Coherence (Multi-Core)

When multiple cores access shared data, caches must stay synchronized. The MESI protocol defines states:

  • Modified: This cache has the only valid copy, modified
  • Exclusive: This cache has the only valid copy, clean
  • Shared: Multiple caches have valid copies
  • Invalid: This cache line is stale

False sharing occurs when different cores write to different variables on the same cache lineโ€”forcing expensive cache coherence traffic even though thereโ€™s no logical sharing.

Types of Locality

Spatial Locality: If you access address X, youโ€™ll likely soon access X+1, X+2, etc.

Examples of good spatial locality:

// Good: Sequential array access
for (int i = 0; i < n; i++) {
    sum += array[i];  // Accesses array[0], array[1], array[2], ...
}
// Each cache line prefetched serves 8 elements (for 64-bit)

Examples of poor spatial locality:

// Bad: Strided access
for (int i = 0; i < n; i += 64) {
    sum += array[i];  // Accesses array[0], array[64], array[128], ...
}
// Each cache line fetched only serves 1 element (7/8 wasted)

Temporal Locality: If you access address X, youโ€™ll likely access X again soon.

Examples of good temporal locality:

// Good: Reusing data while in cache
for (int iter = 0; iter < 1000; iter++) {
    for (int i = 0; i < 100; i++) {
        sum += array[i];  // 100-element array fits in L1
    }
}

Examples of poor temporal locality:

// Bad: Data accessed once then evicted
for (int i = 0; i < 10_000_000; i++) {
    sum += huge_array[i];  // Array much larger than cache
}
// By the time we loop back, data is evicted

The Row-Major vs Column-Major Problem

C arrays are row-major: array[row][col] stores rows contiguously.

Logical view:          Memory layout:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ [0][0] [0][1]   โ”‚    โ”‚ [0][0] [0][1] [0][2] [0][3] [1][0]โ”‚
โ”‚ [1][0] [1][1]   โ”‚    โ”‚ [1][1] [1][2] [1][3] [2][0] [2][1]โ”‚
โ”‚ [2][0] [2][1]   โ”‚    โ”‚ [2][2] [2][3] ...                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Row-major traversal (good):

for (int row = 0; row < N; row++) {
    for (int col = 0; col < M; col++) {
        sum += matrix[row][col];  // Sequential in memory
    }
}

Column-major traversal (bad):

for (int col = 0; col < M; col++) {
    for (int row = 0; row < N; row++) {
        sum += matrix[row][col];  // Jumps by M elements each access
    }
}

For a 1000ร—1000 matrix of 8-byte doubles:

  • Row-major: 1 cache miss per 8 elements = 125,000 misses
  • Column-major: 1 cache miss per element = 1,000,000 misses
  • 8x difference from loop order alone

Complete Project Specification

What Youโ€™re Building

An experimental toolkit called cache_lab that:

  1. Generates controlled workloads with different access patterns
  2. Measures cache behavior using hardware performance counters
  3. Visualizes cache access patterns showing hits and misses
  4. Compares data layouts quantifying performance differences
  5. Produces reports connecting patterns to performance

Functional Requirements

cache_lab run --pattern <name> --size <MB> --iters <n>
cache_lab compare --patterns <list> --sizes <list>
cache_lab visualize --pattern <name> --output <html>
cache_lab report --output <report.md>

Access Patterns to Implement

  1. sequential: Access array elements 0, 1, 2, 3, โ€ฆ
  2. reverse: Access array elements n-1, n-2, n-3, โ€ฆ
  3. stride-N: Access every Nth element
  4. random: Access elements in random order
  5. row-major: 2D array, row-by-row
  6. column-major: 2D array, column-by-column
  7. blocked: 2D array with cache-friendly blocking

Example Output

Cache Locality Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Data size: 64 MB (fits L3, exceeds L2)
Iterations: 10
CPU: Intel Core i7-12700K

Pattern Comparison:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Pattern          Time     L1 Miss%   L3 Miss%   Bandwidth
sequential       120 ms   0.4%       0.1%       533 MB/s
stride-64        845 ms   99.2%      12.1%      76 MB/s
random           1823 ms  98.7%      34.2%      35 MB/s
row-major        125 ms   0.5%       0.2%       512 MB/s
column-major     982 ms   94.1%      18.4%      65 MB/s
blocked          142 ms   1.2%       0.3%       451 MB/s

Analysis:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ€ข Sequential is 15x faster than random due to spatial locality
โ€ข Row-major is 7.8x faster than column-major (same algorithm!)
โ€ข Stride-64 matches cache line size (64 bytes), guaranteeing 1 miss/access
โ€ข Blocked access trades some L1 efficiency for better L3 utilization

Visualization saved: cache_patterns.html

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    CLI Interface                             โ”‚
โ”‚   Pattern selection, size configuration                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚                โ”‚                โ”‚
          โ–ผ                โ–ผ                โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Pattern     โ”‚  โ”‚ Counter     โ”‚  โ”‚ Timer       โ”‚
   โ”‚ Generator   โ”‚  โ”‚ Collector   โ”‚  โ”‚ Engine      โ”‚
   โ”‚             โ”‚  โ”‚ (perf)      โ”‚  โ”‚             โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                โ”‚                โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Measurement Loop          โ”‚
           โ”‚  Generate indices โ†’ Access     โ”‚
           โ”‚  โ†’ Collect counters            โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
           โ”‚      Analysis & Report         โ”‚
           โ”‚  Compare patterns, visualize   โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Data Structures

// Access pattern generator
typedef void (*index_generator_t)(size_t *indices, size_t count, size_t stride);

typedef struct {
    const char *name;
    index_generator_t generator;
    const char *description;
} pattern_t;

// Cache measurement result
typedef struct {
    double wall_time_ms;
    uint64_t l1_loads;
    uint64_t l1_misses;
    uint64_t l2_loads;
    uint64_t l2_misses;
    uint64_t llc_loads;     // Last-level cache (L3)
    uint64_t llc_misses;
} cache_result_t;

// Experiment configuration
typedef struct {
    const char *pattern_name;
    size_t data_size_bytes;
    size_t element_size;
    int iterations;
    int warmup_iterations;
} experiment_config_t;

Pattern Generator Examples

// Sequential access
void generate_sequential(size_t *indices, size_t count, size_t stride) {
    for (size_t i = 0; i < count; i++) {
        indices[i] = i;
    }
}

// Strided access (every Nth element)
void generate_strided(size_t *indices, size_t count, size_t stride) {
    size_t idx = 0;
    for (size_t i = 0; i < count; i++) {
        indices[i] = idx;
        idx = (idx + stride) % count;
    }
}

// Random access (Fisher-Yates shuffle of sequential)
void generate_random(size_t *indices, size_t count, size_t stride) {
    generate_sequential(indices, count, 0);
    for (size_t i = count - 1; i > 0; i--) {
        size_t j = rand() % (i + 1);
        size_t tmp = indices[i];
        indices[i] = indices[j];
        indices[j] = tmp;
    }
}

Phased Implementation Guide

Phase 1: Basic Access Pattern Experiments (Days 1-3)

Goal: Measure timing difference between sequential and random access.

Steps:

  1. Allocate a large array (larger than L3 cache)
  2. Implement sequential and random index generators
  3. Measure wall-clock time for each pattern
  4. Calculate and report bandwidth (MB/s)

Validation: Random should be 10-20x slower than sequential.

Phase 2: Hardware Counter Integration (Days 4-6)

Goal: Add L1, L2, L3 cache miss measurements.

Steps:

  1. Use perf_event_open for cache counters:
    • PERF_COUNT_HW_CACHE_L1D with read/miss
    • PERF_COUNT_HW_CACHE_LL with read/miss
  2. Correlate miss rates with timing
  3. Verify sequential has ~0% miss rate

Validation: Stride-64 should have ~100% L1 miss rate.

Phase 3: 2D Array Experiments (Days 7-9)

Goal: Demonstrate row-major vs column-major impact.

Steps:

  1. Create 2D matrix traversal workloads
  2. Implement row-major and column-major loops
  3. Implement blocked traversal for comparison
  4. Measure and compare all three

Validation: Column-major should be 5-10x slower than row-major.

Phase 4: Size Scaling Study (Days 10-12)

Goal: Show performance cliffs as data exceeds cache levels.

Steps:

  1. Run sequential pattern across data sizes:
    • 16 KB (fits L1)
    • 128 KB (fits L2)
    • 4 MB (fits L3)
    • 64 MB (exceeds L3)
  2. Plot bandwidth vs size
  3. Identify cache level boundaries

Validation: Should see 3 distinct performance plateaus.

Phase 5: Visualization and Reporting (Days 13-14)

Goal: Create visual representation of access patterns.

Steps:

  1. Generate access pattern diagrams
  2. Create HTML/SVG visualization of cache line access
  3. Produce summary report with all experiments
  4. Add recommendations based on patterns

Validation: Visualization clearly shows good vs bad locality.


Testing Strategy

Micro-Benchmarks

  1. Cache line verification: Access every 64th byte, verify ~100% L1 miss
  2. L1 size boundary: Find exact size where L1 miss rate jumps
  3. Prefetch effectiveness: Compare with/without software prefetch hints

Validation Experiments

  1. Cross-validate with Cachegrind: Compare miss counts
  2. perf stat comparison: Verify counters match perf output
  3. Known algorithms: Test matrix multiply with different loop orders

Edge Cases

  1. Huge pages: Do 2MB pages change results?
  2. NUMA effects: What about cross-socket access?
  3. TLB misses: Separate TLB from cache effects

Common Pitfalls and Debugging

Pitfall 1: Compiler Optimizes Away Access

Symptom: All patterns run at same speed (fast).

Cause: Compiler sees unused reads and eliminates them.

Solution:

// Bad: Result unused
for (int i = 0; i < n; i++) {
    data[indices[i]];  // Compiler may eliminate
}

// Good: Accumulate and use result
volatile uint64_t sum = 0;
for (int i = 0; i < n; i++) {
    sum += data[indices[i]];
}
return sum;

Pitfall 2: Prefetcher Masks Randomness

Symptom: โ€œRandomโ€ access isnโ€™t as slow as expected.

Cause: Modern CPUs have stream prefetchers that can detect some patterns.

Solutions:

  1. Use truly random order (Fisher-Yates shuffle)
  2. Use pointer chasing (each element contains next address)
  3. Disable hardware prefetcher: wrmsr 0x1A4 0xF (requires privileges)

Pitfall 3: NUMA Effects Dominate

Symptom: Results vary wildly between runs.

Cause: On multi-socket systems, memory may be on remote node.

Solution:

# Pin to single NUMA node
numactl --membind=0 --cpunodebind=0 ./cache_lab

Pitfall 4: Page Faults in Measurement

Symptom: First iteration is 100x slower.

Cause: Memory pages not allocated until first touch.

Solution:

// Pre-fault all pages before measurement
memset(data, 0, data_size);  // Forces page allocation

Extensions and Challenges

Extension 1: Cache Blocking for Matrix Operations

Implement blocked matrix multiply:

// Naive: O(nยณ) but cache-hostile
for (i) for (j) for (k) C[i][j] += A[i][k] * B[k][j];

// Blocked: Same O(nยณ) but cache-friendly
for (ii, block) for (jj, block) for (kk, block)
  for (i in block) for (j in block) for (k in block)
    C[i][j] += A[i][k] * B[k][j];

Demonstrate 2-10x speedup from blocking alone.

Extension 2: False Sharing Detector

Create multi-threaded workload where:

  1. Each thread writes to its own variable
  2. Variables are on same cache line (false sharing)
  3. Move to separate cache lines (padding)
  4. Measure speedup from eliminating false sharing

Extension 3: Software Prefetch Experiments

Test explicit prefetch instructions:

for (int i = 0; i < n; i++) {
    __builtin_prefetch(&data[i + 64], 0, 3);  // Prefetch ahead
    sum += data[i];
}

Measure when prefetching helps vs hurts.

Challenge: Cache-Oblivious Algorithm

Implement a cache-oblivious algorithm (like funnel sort or recursive matrix multiply) and demonstrate it achieves good performance across different cache sizes without tuning.


Real-World Connections

Where Cache Locality Matters

  1. Database Query Engines: Column stores vs row storesโ€”cache-friendly layouts
  2. Game Engines: Entity-Component-System design for cache-friendly iteration
  3. Scientific Computing: Loop tiling in BLAS/LAPACK
  4. Machine Learning: Tensor memory layouts in PyTorch/TensorFlow

Industry Examples

  • Google Dense Hash Map: Cache-optimized hash table layout
  • Facebook Folly F14: False-sharing-aware concurrent hash map
  • Intel MKL: Auto-tuned cache blocking for matrix operations
  • Redis: Cache-line-aligned data structures

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can explain what a cache line is and why it matters
  • You can predict miss rates for different access patterns
  • You measured row-major vs column-major difference (should be 5-10x)
  • You can identify cache level boundaries from size scaling experiments
  • Your measurements match expected cache behavior
  • You can explain when and why blocking helps
  • You understand false sharing and how to prevent it

Resources

Essential Reading

  • โ€œComputer Systems: A Programmerโ€™s Perspectiveโ€ by Bryant & Oโ€™Hallaron, Chapter 6
  • โ€œWhat Every Programmer Should Know About Memoryโ€ by Ulrich Drepper
  • โ€œOptimizing Software in C++โ€ by Agner Fog, Chapter 5

Reference Documentation

  • Intel 64 and IA-32 Optimization Reference Manual
  • AMD64 Architecture Programmerโ€™s Manual, Volume 2
  • Linux perf wiki: Hardware cache events

Tools

  • perf stat: Cache counter measurement
  • Cachegrind (Valgrind): Cache simulation
  • Intel VTune: Detailed cache analysis
  • likwid: Performance counter suite