Project 9: Cache Lab++ – Cache Simulator + Locality Visualizer
Build a set-associative cache simulator and an ASCII locality visualizer that reveals hit/miss patterns, enabling you to optimize code through informed data layout decisions.
Quick Reference
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 2–3 weeks |
| Chapters | 6, 5 |
| Coolness | ★★★★☆ Hardcore Tech Flex |
| Portfolio Value | Resume Gold |
Learning Objectives
By completing this project, you will:
- Understand cache organization: Master direct-mapped, set-associative, and fully-associative cache structures
- Implement address decomposition: Parse addresses into tag, index, and offset components
- Classify cache misses: Distinguish compulsory, capacity, and conflict misses through simulation
- Apply replacement policies: Implement LRU, FIFO, and random replacement strategies
- Visualize memory access patterns: Create ASCII visualizations showing cache behavior over time
- Optimize for locality: Design data layouts and access patterns that exploit cache behavior
- Measure and analyze: Use your simulator to quantify cache performance and validate optimizations
Deep Theoretical Foundation
The Memory Hierarchy: Why Caches Exist
Modern processors operate orders of magnitude faster than main memory. Without caches, the CPU would spend most of its time waiting for data:
┌─────────────────────────────────────────────────────────────────────────────┐
│ MEMORY HIERARCHY OVERVIEW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Component Size Access Time Bandwidth │
│ ────────────────────────────────────────────────────────────────────── │
│ CPU Registers ~1 KB ~0.25 ns 100s GB/s │
│ L1 Cache 32-64 KB ~1 ns ~500 GB/s │
│ L2 Cache 256 KB-1 MB ~3-4 ns ~200 GB/s │
│ L3 Cache 8-64 MB ~10-20 ns ~100 GB/s │
│ Main Memory 16-256 GB ~50-100 ns ~50 GB/s │
│ SSD 256 GB-4 TB ~100 us ~5 GB/s │
│ HDD 1-20 TB ~10 ms ~200 MB/s │
│ │
│ Speed │
│ ▲ │
│ │ Registers │
│ │ ● │
│ │ L1 Cache │
│ │ ● │
│ │ L2 Cache │
│ │ ● │
│ │ L3 Cache │
│ │ ● │
│ │ Main Memory │
│ │ ● │
│ │ │
│ └─────────────────────────────────────► Size │
│ │
│ Key insight: Smaller = Faster (physics + economics) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Why this hierarchy works: Programs exhibit locality of reference:
- Temporal locality: Recently accessed items are likely to be accessed again soon
- Spatial locality: Items near recently accessed items are likely to be accessed soon
Caches exploit both: they keep recently-used data (temporal) in cache lines that hold multiple adjacent bytes (spatial).
Cache Organization Fundamentals
A cache is organized as a collection of sets, each containing one or more lines (also called blocks):
┌─────────────────────────────────────────────────────────────────────────────┐
│ CACHE ORGANIZATION ANATOMY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Cache Parameters: │
│ ───────────────── │
│ S = Number of sets (S = 2^s, where s = index bits) │
│ E = Lines per set (associativity) (E = 1: direct-mapped) │
│ B = Bytes per line (block size) (B = 2^b, where b = offset bits) │
│ │
│ Total cache size: C = S × E × B bytes │
│ │
│ │
│ Example: 4 sets, 2 lines/set, 4 bytes/line (32-byte 2-way set-assoc) │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ CACHE │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ Set 0 │ Line 0: [V][Tag: 0x1A][B0][B1][B2][B3] │ │
│ │ │ Line 1: [V][Tag: 0x2F][B0][B1][B2][B3] │ │
│ ├───────┼─────────────────────────────────────────────────────────────┤ │
│ │ Set 1 │ Line 0: [V][Tag: 0x05][B0][B1][B2][B3] │ │
│ │ │ Line 1: [ ][Tag: ---][--][--][--][--] (invalid) │ │
│ ├───────┼─────────────────────────────────────────────────────────────┤ │
│ │ Set 2 │ Line 0: [V][Tag: 0x12][B0][B1][B2][B3] │ │
│ │ │ Line 1: [V][Tag: 0x77][B0][B1][B2][B3] │ │
│ ├───────┼─────────────────────────────────────────────────────────────┤ │
│ │ Set 3 │ Line 0: [V][Tag: 0x00][B0][B1][B2][B3] │ │
│ │ │ Line 1: [V][Tag: 0x33][B0][B1][B2][B3] │ │
│ └───────┴─────────────────────────────────────────────────────────────┘ │
│ │
│ Each line contains: │
│ • Valid bit (V): Is this line holding valid data? │
│ • Tag: High bits of address to identify which block is stored │
│ • Data block: B bytes of actual cached data │
│ • (Optional) Dirty bit: Has this block been modified? (for write-back) │
│ • (Optional) LRU bits: Track access recency for replacement │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Cache Types by Associativity
┌─────────────────────────────────────────────────────────────────────────────┐
│ ASSOCIATIVITY SPECTRUM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ DIRECT-MAPPED (E = 1) │
│ ───────────────────── │
│ Each memory block maps to exactly ONE cache line. │
│ │
│ Pros: Simple, fast lookup (no searching) │
│ Cons: High conflict miss rate │
│ │
│ Memory: [A][B][C][D][E][F][G][H][I][J][K][L]... │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Cache: [0] [1] [2] [3] ← A, E, I all compete for line 0! │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ SET-ASSOCIATIVE (1 < E < C/B) │
│ ──────────────────────────────── │
│ Each block maps to a SET, can go in any line within that set. │
│ │
│ Pros: Reduces conflict misses; reasonable lookup speed │
│ Cons: More complex; requires replacement policy │
│ │
│ 2-way set-associative example: │
│ Memory: [A][B][C][D][E][F][G][H]... │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Set 0: [_][_] ← A and E can BOTH live here now! │
│ Set 1: [_][_] ← B and F can both live here │
│ │
│ Common configurations: 2-way, 4-way, 8-way │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ FULLY-ASSOCIATIVE (E = C/B, S = 1) │
│ ────────────────────────────────── │
│ Any block can go in ANY cache line. One giant set. │
│ │
│ Pros: No conflict misses (only compulsory and capacity) │
│ Cons: Slow lookup (must search all lines); complex hardware │
│ │
│ Memory: [A][B][C][D][E][F][G][H]... │
│ │ │ │ │ │ │ │ │ │
│ └──┴──┴──┴──┼──┴──┴──┘ │
│ ▼ │
│ Cache: [_][_][_][_][_][_][_][_] ← Any block can go anywhere │
│ │
│ Used mainly for TLBs and very small caches │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Address Decomposition: Tag, Index, Offset
Every memory address is decomposed into three fields that determine cache behavior:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ADDRESS DECOMPOSITION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ For a cache with S = 2^s sets and B = 2^b bytes per block: │
│ │
│ m-bit address: │
│ ┌────────────────────────┬───────────────┬──────────────┐ │
│ │ Tag │ Index │ Offset │ │
│ │ (t bits) │ (s bits) │ (b bits) │ │
│ └────────────────────────┴───────────────┴──────────────┘ │
│ │
│ Where: t = m - s - b │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: 32-bit address, 16 sets, 64 bytes/block │
│ │
│ S = 16 = 2^4 → s = 4 bits for index │
│ B = 64 = 2^6 → b = 6 bits for offset │
│ t = 32 - 4 - 6 = 22 bits for tag │
│ │
│ Address 0x12345678: │
│ Binary: 0001 0010 0011 0100 0101 0110 0111 1000 │
│ │
│ ┌──────────────────────┬──────┬────────┐ │
│ │ 0001001000110100010101 │ 1001 │ 111000 │ │
│ │ Tag (22 bits) │Index │ Offset │ │
│ │ = 0x048D15 │ = 9 │ = 56 │ │
│ └──────────────────────┴──────┴────────┘ │
│ │
│ Interpretation: │
│ • Look in Set 9 │
│ • Compare tag 0x048D15 against all lines in set 9 │
│ • If match AND valid bit set: HIT! Return byte at offset 56 │
│ • If no match: MISS. Fetch block, store with this tag │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ SPECIAL CASES: │
│ │
│ Direct-mapped (E=1, S=C/B): │
│ • Maximum index bits │
│ • Block maps to exactly one place │
│ │
│ Fully-associative (S=1): │
│ • No index bits (s=0) │
│ • Tag is almost the entire address │
│ • Must compare tag against ALL lines │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The Cache Line (Block) Concept
A cache line is the fundamental unit of transfer between cache and memory:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CACHE LINE CONCEPT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ When CPU requests 1 byte, the cache fetches an ENTIRE BLOCK: │
│ │
│ CPU requests address 0x1004: │
│ │
│ Memory: │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ ... │0x1000│0x1001│0x1002│0x1003│0x1004│0x1005│0x1006│0x1007│ ... │ │
│ │ │ A │ B │ C │ D │ [E] │ F │ G │ H │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────┘ │
│ ▼ Entire 8-byte block fetched │
│ Cache Line: │
│ ┌───────────────────────────────────────────────────┐ │
│ │ V=1 │ Tag │ A │ B │ C │ D │ E │ F │ G │ H │ │ │
│ └───────────────────────────────────────────────────┘ │
│ ↑ │
│ Requested byte │
│ │
│ WHY BLOCKS? │
│ ──────────── │
│ 1. DRAM is optimized for sequential access (row buffers) │
│ 2. Exploits SPATIAL LOCALITY: if you access E, you probably │
│ will soon access F, G, H... they're now already cached! │
│ 3. Amortizes the cost of the miss over multiple bytes │
│ │
│ BLOCK SIZE TRADEOFFS: │
│ ───────────────────── │
│ Larger blocks: │
│ + Better spatial locality exploitation │
│ + Fewer tags (less overhead) │
│ - Longer miss penalty (more bytes to transfer) │
│ - Fewer blocks for given cache size → more conflict misses │
│ - May fetch unused data (if locality is poor) │
│ │
│ Typical sizes: 32, 64, or 128 bytes │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Miss Types: The Three Cs
Understanding WHY a miss occurred is crucial for optimization:
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE THREE Cs OF CACHE MISSES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. COMPULSORY MISSES (Cold Misses) │
│ ────────────────────────────────── │
│ First access to a block that has never been in cache. │
│ │
│ Cause: Block simply hasn't been loaded yet │
│ Fix: Cannot be eliminated! (Must load data at least once) │
│ Can be reduced with prefetching │
│ │
│ Example: │
│ for (i = 0; i < N; i++) │
│ sum += A[i]; // First access to each cache line is compulsory │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ 2. CAPACITY MISSES │
│ ────────────────── │
│ Working set is larger than the cache. │
│ │
│ Cause: Too much data for cache to hold │
│ Fix: Reduce working set size (blocking/tiling) │
│ Or get a bigger cache (hardware solution) │
│ │
│ Example (cache holds 4 blocks): │
│ // Working set = 8 blocks (A[0..7]) │
│ for (pass = 0; pass < 100; pass++) │
│ for (i = 0; i < 8; i++) │
│ sum += A[i]; // Every pass evicts earlier blocks │
│ │
│ Access pattern: A0,A1,A2,A3,A4,A5,A6,A7, A0(miss!),A1(miss!)... │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ 3. CONFLICT MISSES │
│ ────────────────── │
│ Block evicted due to other blocks mapping to same set. │
│ (Would NOT have occurred in a fully-associative cache of same size) │
│ │
│ Cause: Limited associativity + unfortunate address mapping │
│ Fix: Increase associativity; change data layout; padding │
│ │
│ Example (direct-mapped, 4 sets): │
│ // A[0] and A[4] map to same set! │
│ for (i = 0; i < 100; i++) { │
│ sum += A[0]; // miss, load A[0] │
│ sum += A[4]; // miss, evicts A[0], load A[4] │
│ sum += A[0]; // miss! A[0] was evicted by A[4] │
│ sum += A[4]; // miss! A[4] was evicted by A[0] │
│ } │
│ // Every access is a miss! This is "cache thrashing" │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ IDENTIFYING MISS TYPE IN SIMULATION: │
│ │
│ if (block never seen before): │
│ → COMPULSORY │
│ else if (would hit in fully-associative cache of same size): │
│ → CONFLICT (evicted due to set collision) │
│ else: │
│ → CAPACITY (evicted due to overall cache full) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Replacement Policies
When a set is full and we need to load a new block, which victim do we evict?
┌─────────────────────────────────────────────────────────────────────────────┐
│ REPLACEMENT POLICIES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. LRU (Least Recently Used) │
│ ──────────────────────────── │
│ Evict the block that hasn't been accessed for the longest time. │
│ │
│ Rationale: Temporal locality — recently used items likely used again │
│ Implementation: Track access order per set │
│ │
│ Example (2-way set, LRU policy): │
│ Access sequence: A, B, A, C │
│ │
│ After A: [A*, _] (* = most recent) │
│ After B: [A, B*] │
│ After A: [A*, B] (A touched, now most recent) │
│ After C: [A*, C*] → evict B (LRU), C takes its place │
│ [A, C*] │
│ │
│ Pros: Often best hit rate; matches locality assumption │
│ Cons: Hardware complexity for high associativity; │
│ O(E) state per set; update on every access │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ 2. FIFO (First In, First Out) │
│ ───────────────────────────── │
│ Evict the block that was loaded earliest. │
│ │
│ Rationale: Simple approximation; older data less likely needed │
│ Implementation: Circular pointer per set │
│ │
│ Example (2-way set, FIFO policy): │
│ Access sequence: A, B, A, C │
│ │
│ After A: [A, _] pointer→0 │
│ After B: [A, B] pointer→0 │
│ After A: [A, B] (hit, pointer unchanged) │
│ After C: [C, B] evict A (oldest), pointer→1 │
│ │
│ Note: A was evicted even though it was just accessed! │
│ This is Belady's anomaly — FIFO doesn't match recency. │
│ │
│ Pros: Simple to implement; O(1) state per set │
│ Cons: Ignores recency; can have poor hit rate │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ 3. RANDOM │
│ ──────── │
│ Pick a random victim from the set. │
│ │
│ Rationale: Unpredictable; avoids pathological worst-cases │
│ Implementation: Random number generator │
│ │
│ Pros: Simplest hardware; no state needed; avoids worst cases │
│ Cons: Unpredictable; may evict hot blocks │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ 4. LFU (Least Frequently Used) │
│ ────────────────────────────── │
│ Evict the block with fewest accesses. │
│ │
│ Pros: Good for some access patterns │
│ Cons: Doesn't adapt to changing access patterns; │
│ Infrequently used early blocks never evicted │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ COMPARISON (typical hit rates): │
│ │
│ LRU ≥ Pseudo-LRU ≈ PLRU > FIFO > Random │
│ │
│ In practice: Modern CPUs use variants like Pseudo-LRU or │
│ tree-based approximations for hardware efficiency. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Write Policies
What happens when the CPU writes to a cached block?
┌─────────────────────────────────────────────────────────────────────────────┐
│ WRITE POLICIES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TWO INDEPENDENT DECISIONS: │
│ │
│ 1. WRITE HIT: What to do when we write to a block already in cache? │
│ │
│ WRITE-THROUGH WRITE-BACK │
│ ───────────── ────────── │
│ Write to cache AND memory Write to cache only │
│ immediately Mark block "dirty" │
│ Write to memory only on eviction │
│ │
│ + Memory always consistent + Fewer memory writes │
│ + Simple + Better performance │
│ - High memory traffic - Complex (track dirty bits) │
│ - Slow (wait for memory) - Memory inconsistent until evict │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ 2. WRITE MISS: What to do when we write to a block NOT in cache? │
│ │
│ WRITE-ALLOCATE NO-WRITE-ALLOCATE │
│ ───────────── ───────────────── │
│ Load block into cache, Write directly to memory │
│ then write to cache Don't load into cache │
│ │
│ + Good if writing then reading + Saves cache space for reads │
│ + Exploits spatial locality - No locality benefit │
│ - Extra memory read on miss │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ COMMON COMBINATIONS: │
│ │
│ Write-Through + No-Write-Allocate │
│ • Writes always go to memory; reads populate cache │
│ • Used in some L1 caches │
│ │
│ Write-Back + Write-Allocate (most common today) │
│ • Writes stay in cache until eviction │
│ • Exploits temporal locality on writes │
│ • Used in most L2/L3 caches │
│ │
│ FOR THIS PROJECT: │
│ We focus on READ behavior. Write policies are a natural extension. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Spatial and Temporal Locality
These are the principles that make caching effective:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOCALITY PRINCIPLES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TEMPORAL LOCALITY │
│ ───────────────── │
│ "If you accessed it recently, you'll probably access it again soon." │
│ │
│ Good temporal locality: Poor temporal locality: │
│ ──────────────────────── ───────────────────────── │
│ sum = 0; // Process each element once │
│ for (i = 0; i < 1000; i++) { for (i = 0; i < N; i++) │
│ sum += A[i % 10]; // Reuse! process(A[i]); // No reuse │
│ } │
│ │
│ Access pattern: 0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,... │
│ Each element accessed 100 times → 99% reuse from cache! │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ SPATIAL LOCALITY │
│ ──────────────── │
│ "If you accessed location X, you'll probably access nearby locations." │
│ │
│ Good spatial locality: Poor spatial locality: │
│ ────────────────────── ───────────────────────── │
│ // Sequential access // Stride access │
│ for (i = 0; i < N; i++) for (i = 0; i < N; i += 64) │
│ sum += A[i]; sum += A[i]; │
│ │
│ Access pattern: 0,1,2,3,4,5,6,7... Access pattern: 0,64,128,192... │
│ Each cache line fully utilized Each cache line: 1 useful byte │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ QUANTIFYING LOCALITY: │
│ │
│ Miss Rate = Misses / Total Accesses │
│ │
│ For sequential access with B-byte blocks: │
│ Miss rate ≈ 1/B (one miss per B bytes → B-1 hits) │
│ │
│ For stride-K access with B-byte blocks: │
│ If K < B: Miss rate ≈ 1/B (still get some spatial locality) │
│ If K ≥ B: Miss rate ≈ 1/1 = 100% (no spatial locality!) │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ MATRIX ACCESS PATTERNS: │
│ │
│ Row-major storage: A[i][j] at address base + i*N + j │
│ │
│ Row-wise access (good): Column-wise access (bad): │
│ for (i...) for (j...) for (j...) for (i...) │
│ A[i][j] A[i][j] │
│ │
│ Memory: [0,0][0,1][0,2][0,3][1,0][1,1][1,2][1,3]... │
│ Access: 1 2 3 4 9 10 11 12 (row-wise) │
│ Access: 1 5 9 13 2 6 10 14 (col-wise = strided!) │
│ │
│ Column-wise in row-major = stride of N elements! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Blocking (Tiling) for Cache Optimization
The key optimization technique for improving cache performance:
┌─────────────────────────────────────────────────────────────────────────────┐
│ BLOCKING / TILING TECHNIQUE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: Matrix multiply naive version has poor locality │
│ │
│ // C = A × B (N × N matrices) │
│ for (i = 0; i < N; i++) │
│ for (j = 0; j < N; j++) │
│ for (k = 0; k < N; k++) │
│ C[i][j] += A[i][k] * B[k][j]; │
│ │
│ Access analysis (row-major storage): │
│ • A[i][k]: Row-wise access (good) │
│ • B[k][j]: Column-wise access (BAD - stride N!) │
│ • C[i][j]: Repeated access (good temporal locality) │
│ │
│ For large N, B causes cache thrashing. │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ SOLUTION: Process in cache-sized blocks │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ A B C │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │███│ │ │ │███│ │ │ │███│ │ │ │ │
│ │ ├───┼───┼───┤ × ├───┼───┼───┤ = ├───┼───┼───┤ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ ├───┼───┼───┤ ├───┼───┼───┤ ├───┼───┼───┤ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ │ │ │
│ │ Process one block at a time: │ │
│ │ C[block] += A[row_block] × B[col_block] │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ BLOCKED MATRIX MULTIPLY: │
│ │
│ // Block size B chosen so 3 blocks fit in cache │
│ for (ii = 0; ii < N; ii += BLOCK) │
│ for (jj = 0; jj < N; jj += BLOCK) │
│ for (kk = 0; kk < N; kk += BLOCK) │
│ // Mini matrix multiply on blocks │
│ for (i = ii; i < min(ii+BLOCK, N); i++) │
│ for (j = jj; j < min(jj+BLOCK, N); j++) │
│ for (k = kk; k < min(kk+BLOCK, N); k++) │
│ C[i][j] += A[i][k] * B[k][j]; │
│ │
│ CHOOSING BLOCK SIZE: │
│ ──────────────────── │
│ • Need 3 blocks in cache: A_block, B_block, C_block │
│ • Each block is BLOCK × BLOCK elements │
│ • Total: 3 × BLOCK² × sizeof(element) < Cache Size │
│ • BLOCK ≈ sqrt(CacheSize / (3 × sizeof(element))) │
│ │
│ Example: 32KB L1 cache, double (8 bytes) │
│ BLOCK ≈ sqrt(32768 / 24) ≈ 36 │
│ Use BLOCK = 32 (power of 2 for alignment) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Cache-Oblivious Algorithms
An advanced concept: algorithms that are efficient without knowing cache size:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CACHE-OBLIVIOUS ALGORITHMS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CACHE-AWARE (blocking): │
│ • You choose block size based on known cache size │
│ • Good: Optimal for that cache │
│ • Bad: Need to retune for different caches/hierarchy levels │
│ │
│ CACHE-OBLIVIOUS: │
│ • Algorithm is designed to work well for ANY cache size │
│ • Uses recursive divide-and-conquer │
│ • Automatically adapts to all levels of hierarchy │
│ │
│ ─────────────────────────────────────────────────────────────────────── │
│ │
│ EXAMPLE: Cache-oblivious matrix multiply │
│ │
│ void matmul(A, B, C, n) { │
│ if (n <= BASE_CASE) { │
│ // Direct multiply (fits in registers) │
│ base_case_multiply(A, B, C); │
│ } else { │
│ // Divide each matrix into quadrants │
│ // ┌────┬────┐ │
│ // │ A11│ A12│ │
│ // ├────┼────┤ │
│ // │ A21│ A22│ │
│ // └────┴────┘ │
│ // │
│ // C11 = A11×B11 + A12×B21 │
│ // C12 = A11×B12 + A12×B22 │
│ // C21 = A21×B11 + A22×B21 │
│ // C22 = A21×B12 + A22×B22 │
│ │
│ // 8 recursive multiplies, 4 additions │
│ matmul(A11, B11, T1, n/2); │
│ matmul(A12, B21, T2, n/2); │
│ add(T1, T2, C11, n/2); │
│ // ... etc. │
│ } │
│ } │
│ │
│ WHY IT WORKS: │
│ ───────────── │
│ At some recursion level, subproblems fit in cache. │
│ From then on, all accesses are cache hits. │
│ This happens automatically for any cache size! │
│ │
│ THEORETICAL GUARANTEE: │
│ Cache misses = O(N³ / (B × sqrt(M))) │
│ Where: N = matrix size, B = block size, M = cache size │
│ This is optimal for any M and B! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Project Specification
What You Will Build
A two-part system:
-
Cache Simulator: A configurable set-associative cache simulator that processes memory access traces and reports hit/miss statistics with miss classification.
-
Locality Visualizer: An ASCII-art visualization tool that shows cache behavior over time, making hit/miss patterns visible and intuitive.
Functional Requirements
Cache Simulator (csim):
- Configuration (
-s,-E,-bflags):- Accept cache parameters: s (index bits), E (associativity), b (block bits)
- Support cache sizes from tiny (s=1, E=1, b=2) to realistic (s=10, E=8, b=6)
- Trace Processing:
- Read Valgrind-format memory traces:
I 0400d7d4,8 (instruction fetch - optional to handle) L 7ff0005c8,8 (data load) S 7ff0005d0,4 (data store) M 0421c7f0,4 (modify = load then store) - Process each access and update cache state
- Read Valgrind-format memory traces:
- Statistics Reporting:
- Count: hits, misses, evictions
- Classify misses: compulsory, capacity, conflict
- Support verbose mode showing each access result
- Replacement Policies:
- Implement LRU (required)
- FIFO and Random (stretch goals)
Locality Visualizer (lvis):
- Visual Output:
- Display cache state as ASCII art after each access (or at intervals)
- Show which sets are being accessed
- Highlight hits (green/+) vs misses (red/-)
- Pattern Recognition:
- Identify and annotate access patterns (sequential, strided, random)
- Calculate and display running miss rate
- Code Path Mode:
- Accept annotated traces that include source location
- Group statistics by code region
Non-Functional Requirements
- Correctness: Match reference simulator output exactly
- Performance: Handle traces with millions of accesses in reasonable time
- Clarity: Output must be educational, not just numerical
- Modularity: Clean separation between simulator core and visualization
Example Usage and Output
Basic Simulation:
$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3
$ ./csim -v -s 4 -E 2 -b 4 -t traces/yi.trace
L 10,1 miss
M 20,1 miss hit
L 22,1 hit
S 18,1 hit
L 110,1 miss eviction
L 210,1 miss eviction
M 12,1 miss eviction hit
hits:4 misses:5 evictions:3
Miss Classification:
$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace --classify
Cache Configuration:
Sets: 16 (s=4)
Lines per set: 2 (E=2)
Block size: 16 bytes (b=4)
Total size: 512 bytes
Results:
Hits: 4 (44.4%)
Misses: 5 (55.6%)
Compulsory: 4 (80.0% of misses)
Capacity: 0 (0.0% of misses)
Conflict: 1 (20.0% of misses)
Evictions: 3
Locality Visualizer Output:
$ ./lvis -s 2 -E 2 -b 2 -t traces/simple.trace
=== CACHE LOCALITY VISUALIZER ===
Cache: 4 sets × 2 ways × 4 bytes = 32 bytes
Access #1: L 0x00 [MISS - Compulsory]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x00] [ ] │ ← ACCESSED (miss)
│ Set 1: [ ] [ ] │
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└──────────────────────────────────┘
Running: hits=0 misses=1 (0.0% hit rate)
Access #2: L 0x04 [MISS - Compulsory]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x00] [ ] │
│ Set 1: [████ tag=0x00] [ ] │ ← ACCESSED (miss)
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└──────────────────────────────────┘
Running: hits=0 misses=2 (0.0% hit rate)
Access #3: L 0x00 [HIT]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x00] [ ] │ ← ACCESSED (HIT!)
│ Set 1: [████ tag=0x00] [ ] │
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└──────────────────────────────────┘
Running: hits=1 misses=2 (33.3% hit rate)
Access #4: L 0x10 [MISS - Conflict]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x01] [████ 0x00]│ ← ACCESSED (miss, evicted old)
│ Set 1: [████ tag=0x00] [ ] │
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└──────────────────────────────────┘
Running: hits=1 misses=3 (25.0% hit rate)
=== PATTERN ANALYSIS ===
Stride detected: 4 bytes (matches block size)
Spatial locality: GOOD (sequential within blocks)
Temporal locality: MODERATE (some reuse observed)
Conflict misses: 1 (33% of misses) - consider increasing associativity
Real World Outcome
When you complete this project, you will have a fully functional cache simulator and locality visualizer. Here is exactly what running your tools will look like:
Cache Simulator Basic Usage
$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3
Verbose Mode with Full Trace
$ ./csim -v -s 4 -E 2 -b 4 -t traces/yi.trace
Cache Configuration:
Sets (S): 16 (s=4 bits)
Lines per set (E): 2 (2-way set-associative)
Block size (B): 16 bytes (b=4 bits)
Cache size: 512 bytes
Processing trace: traces/yi.trace
─────────────────────────────────────────────────────────────
L 10,1 set=1 tag=0x000001 miss
M 20,1 set=2 tag=0x000002 miss hit
L 22,1 set=2 tag=0x000002 hit
S 18,1 set=1 tag=0x000001 hit
L 110,1 set=1 tag=0x000011 miss eviction
L 210,1 set=1 tag=0x000021 miss eviction
M 12,1 set=1 tag=0x000001 miss eviction hit
─────────────────────────────────────────────────────────────
Summary:
hits:4 misses:5 evictions:3
Hit rate: 44.4%
Miss rate: 55.6%
Miss Classification Analysis
$ ./csim --classify -s 4 -E 2 -b 4 -t traces/matrix_multiply.trace
Cache Configuration:
Sets: 16 (s=4)
Lines per set: 2 (E=2)
Block size: 16 bytes (b=4)
Total size: 512 bytes
Access Summary:
Total accesses: 10,000
Hits: 7,234 (72.3%)
Misses: 2,766 (27.7%)
Miss Classification:
┌─────────────────────────────────────────────────────┐
│ Compulsory: 1,024 (37.0% of misses) █████████ │
│ Capacity: 892 (32.3% of misses) ████████ │
│ Conflict: 850 (30.7% of misses) ███████ │
└─────────────────────────────────────────────────────┘
Evictions: 1,742
Recommendations:
- High conflict misses: Consider 4-way associativity
- Capacity misses indicate working set > cache size
- Use blocking with block size ~16 elements
Locality Visualizer Output
$ ./lvis -s 2 -E 2 -b 2 -t traces/simple.trace --animate
╔══════════════════════════════════════════════════════════════╗
║ CACHE LOCALITY VISUALIZER ║
║ Cache: 4 sets × 2 ways × 4 bytes = 32 bytes ║
╚══════════════════════════════════════════════════════════════╝
Access #1: L 0x00 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x00] [ ] │ ← MISS (Compulsory)
│ Set 1: [ ] [ ] │
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└────────────────────────────────────────────┘
Stats: hits=0 misses=1 rate=0.0%
Access #2: L 0x04 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x00] [ ] │
│ Set 1: [████ t=0x00] [ ] │ ← MISS (Compulsory)
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└────────────────────────────────────────────┘
Stats: hits=0 misses=2 rate=0.0%
Access #3: L 0x00 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x00] [ ] │ ← HIT!
│ Set 1: [████ t=0x00] [ ] │
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└────────────────────────────────────────────┘
Stats: hits=1 misses=2 rate=33.3%
Access #4: L 0x10 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x01] [████ t=0x00] │ ← MISS (loaded new tag)
│ Set 1: [████ t=0x00] [ ] │
│ Set 2: [ ] [ ] │
│ Set 3: [ ] [ ] │
└────────────────────────────────────────────┘
Stats: hits=1 misses=3 rate=25.0%
═══════════════════════════════════════════════════════════════
PATTERN ANALYSIS
═══════════════════════════════════════════════════════════════
Detected stride: 4 bytes
Spatial locality: GOOD (sequential within cache lines)
Temporal locality: MODERATE (50% reuse of cached data)
Recommendation: Current cache configuration is adequate
═══════════════════════════════════════════════════════════════
Matrix Multiply Optimization Analysis
$ ./csim --compare -s 5 -E 4 -b 6 \
-t traces/matmul_naive.trace \
-t traces/matmul_blocked.trace
Comparing cache behavior between two traces:
Cache: 32 sets, 4-way, 64-byte blocks (8 KB)
┌──────────────────────────────────────────────────────────────┐
│ COMPARISON RESULTS │
├──────────────────┬────────────────────┬──────────────────────┤
│ │ matmul_naive │ matmul_blocked │
├──────────────────┼────────────────────┼──────────────────────┤
│ Total accesses │ 2,097,152 │ 2,097,152 │
│ Hits │ 524,288 (25.0%) │ 1,835,008 (87.5%) │
│ Misses │ 1,572,864 (75.0%) │ 262,144 (12.5%) │
│ Evictions │ 1,572,832 │ 262,112 │
├──────────────────┼────────────────────┼──────────────────────┤
│ Est. cycles │ 157,810,688 │ 28,311,552 │
│ Speedup │ 1.0x (baseline) │ 5.6x faster │
└──────────────────┴────────────────────┴──────────────────────┘
Analysis: Blocking reduced misses by 83.3%
Estimated 5.6x performance improvement
Solution Architecture
High-Level Design
┌─────────────────────────────────────────────────────────────────────────────┐
│ SYSTEM ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────────────────────────────────┐ │
│ │ │ │ CACHE SIMULATOR │ │
│ │ Trace │ │ ┌─────────────────────────────────────────┐│ │
│ │ Parser │──────▶│ │ Cache Controller ││ │
│ │ │ │ │ ┌───────────┐ ┌───────────────────┐ ││ │
│ └─────────────┘ │ │ │ Address │ │ Cache State │ ││ │
│ ▲ │ │ │ Decomposer│ │ (sets, lines) │ ││ │
│ │ │ │ └───────────┘ └───────────────────┘ ││ │
│ ┌─────────────┐ │ │ ┌───────────┐ ┌───────────────────┐ ││ │
│ │ Trace │ │ │ │ Hit/Miss │ │ Replacement │ ││ │
│ │ Files │ │ │ │ Logic │ │ Policy (LRU/FIFO) │ ││ │
│ │ (valgrind) │ │ │ └───────────┘ └───────────────────┘ ││ │
│ └─────────────┘ │ └─────────────────────────────────────────┘│ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────┐│ │
│ │ │ Statistics Collector ││ │
│ │ │ hits, misses, evictions, miss types ││ │
│ │ └─────────────────────────────────────────┘│ │
│ └─────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┴────────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Text Reporter │ │ ASCII │ │
│ │ (stats output) │ │ Visualizer │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Data Structures
/* Cache line structure */
typedef struct {
int valid; /* Valid bit */
unsigned long tag; /* Tag bits */
int lru_counter; /* For LRU replacement (higher = more recent) */
int load_time; /* For FIFO replacement */
int dirty; /* Dirty bit (for write-back, extension) */
} cache_line_t;
/* Cache set structure */
typedef struct {
cache_line_t *lines; /* Array of E lines */
int num_lines; /* Associativity (E) */
} cache_set_t;
/* Cache structure */
typedef struct {
cache_set_t *sets; /* Array of S sets */
int num_sets; /* S = 2^s */
int lines_per_set; /* E */
int block_size; /* B = 2^b */
/* Bit counts for address decomposition */
int s; /* Index bits */
int b; /* Offset bits */
int t; /* Tag bits = m - s - b */
/* Statistics */
int hits;
int misses;
int evictions;
/* Miss classification */
int compulsory_misses;
int capacity_misses;
int conflict_misses;
/* Global LRU counter */
int lru_clock;
/* For miss classification: track all unique blocks ever seen */
/* (Using a set/hash table in real implementation) */
} cache_t;
/* Memory access structure */
typedef struct {
char op; /* 'L' (load), 'S' (store), 'M' (modify) */
unsigned long addr; /* Memory address */
int size; /* Size of access in bytes */
} mem_access_t;
/* Access result for visualization */
typedef struct {
int hit; /* 1 if hit, 0 if miss */
int eviction; /* 1 if eviction occurred */
int set_index; /* Which set was accessed */
int line_index; /* Which line in set */
unsigned long tag; /* Tag of accessed block */
char miss_type; /* 'C'ompulsory, 'A'pacity, 'O'nflict */
} access_result_t;
Module Structure
cache-simulator/
├── src/
│ ├── main.c # Entry point, CLI parsing
│ ├── cache.c # Cache simulation logic
│ ├── cache.h # Cache data structures and API
│ ├── trace.c # Trace file parsing
│ ├── trace.h # Trace structures
│ ├── lru.c # LRU replacement policy
│ ├── replacement.h # Replacement policy interface
│ ├── classify.c # Miss classification logic
│ ├── stats.c # Statistics collection
│ ├── visualizer.c # ASCII visualization
│ └── visualizer.h # Visualizer API
├── tests/
│ ├── traces/ # Test trace files
│ │ ├── yi.trace
│ │ ├── dave.trace
│ │ └── trans.trace
│ ├── test_cache.c # Unit tests for cache logic
│ ├── test_lru.c # Unit tests for LRU
│ └── expected/ # Expected outputs
├── tools/
│ ├── gen_trace.py # Generate custom traces
│ └── analyze_pattern.py # Pattern analysis helper
├── Makefile
└── README.md
Core Algorithm: Cache Access
┌─────────────────────────────────────────────────────────────────────────────┐
│ CACHE ACCESS ALGORITHM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ access_cache(cache, address): │
│ ───────────────────────────── │
│ │
│ 1. DECOMPOSE ADDRESS │
│ ┌─────────────────────────────────────────┐ │
│ │ tag = address >> (s + b) │ │
│ │ set_index = (address >> b) & ((1 << s) - 1) │
│ │ offset = address & ((1 << b) - 1) │ │
│ └─────────────────────────────────────────┘ │
│ │
│ 2. SEARCH SET FOR MATCHING TAG │
│ ┌─────────────────────────────────────────┐ │
│ │ set = cache.sets[set_index] │ │
│ │ for each line in set: │ │
│ │ if line.valid AND line.tag == tag: │ │
│ │ → HIT! Update LRU, return │ │
│ └─────────────────────────────────────────┘ │
│ │
│ 3. CACHE MISS - NEED TO LOAD BLOCK │
│ ┌─────────────────────────────────────────┐ │
│ │ // First check for empty line │ │
│ │ for each line in set: │ │
│ │ if NOT line.valid: │ │
│ │ → Use this empty line │ │
│ │ → No eviction needed │ │
│ │ break │ │
│ │ │ │
│ │ // No empty line - must evict │ │
│ │ if all lines valid: │ │
│ │ victim = select_victim(set) // LRU │ │
│ │ eviction = true │ │
│ └─────────────────────────────────────────┘ │
│ │
│ 4. INSTALL NEW BLOCK │
│ ┌─────────────────────────────────────────┐ │
│ │ line.valid = true │ │
│ │ line.tag = tag │ │
│ │ line.lru_counter = cache.lru_clock++ │ │
│ └─────────────────────────────────────────┘ │
│ │
│ 5. UPDATE STATISTICS │
│ ┌─────────────────────────────────────────┐ │
│ │ if hit: cache.hits++ │ │
│ │ else: cache.misses++ │ │
│ │ if eviction: cache.evictions++ │ │
│ │ classify_miss(...) │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation Guide
Development Environment Setup
# Required tools
sudo apt-get install gcc gdb valgrind build-essential
# Optional: valgrind for generating traces
# (Traces are provided, but you can generate your own)
# Create project structure
mkdir -p cache-simulator/{src,tests/traces,tools}
cd cache-simulator
Project Structure
cache-simulator/
├── src/
│ ├── main.c
│ ├── cache.c / cache.h
│ ├── trace.c / trace.h
│ ├── visualizer.c / visualizer.h
│ └── stats.c / stats.h
├── tests/
│ ├── traces/
│ └── test_*.c
├── Makefile
└── README.md
Implementation Phases
Phase 1: Address Decomposition (Days 1-2)
Goals:
- Implement address parsing into tag, index, offset
- Validate decomposition with known examples
Tasks:
- Create
cache.hwith basic structures - Implement
decompose_address(addr, s, b, &tag, &index, &offset) - Write unit tests with hand-calculated examples
- Test edge cases (addresses at block boundaries)
Test Cases:
// Cache: s=4, b=4 (16 sets, 16-byte blocks)
// Address: 0x12345678
// Expected decomposition:
// offset = 0x8 (4 bits)
// index = 0x7 (4 bits)
// tag = 0x1234567 (24 bits)
void test_decompose() {
int s = 4, b = 4;
unsigned long addr = 0x12345678;
unsigned long tag, index, offset;
decompose_address(addr, s, b, &tag, &index, &offset);
assert(offset == 0x8);
assert(index == 0x7);
assert(tag == 0x1234567);
printf("Address decomposition: PASSED\n");
}
Checkpoint: Address decomposition passes all unit tests.
Phase 2: Cache Data Structures (Days 3-4)
Goals:
- Implement cache initialization
- Create set and line data structures
- Implement cache state printing (for debugging)
Tasks:
- Implement
cache_t* cache_init(int s, int E, int b) - Implement
cache_free(cache_t *cache) - Implement
cache_print_state(cache_t *cache)for debugging - Validate memory allocation/deallocation
Implementation:
cache_t* cache_init(int s, int E, int b) {
cache_t *cache = malloc(sizeof(cache_t));
cache->s = s;
cache->b = b;
cache->num_sets = 1 << s; // 2^s
cache->lines_per_set = E;
cache->block_size = 1 << b; // 2^b
cache->t = 64 - s - b; // Assuming 64-bit addresses
// Allocate sets
cache->sets = malloc(cache->num_sets * sizeof(cache_set_t));
for (int i = 0; i < cache->num_sets; i++) {
cache->sets[i].lines = calloc(E, sizeof(cache_line_t));
cache->sets[i].num_lines = E;
}
// Initialize statistics
cache->hits = cache->misses = cache->evictions = 0;
cache->lru_clock = 0;
return cache;
}
Checkpoint: Cache initializes correctly; memory tools show no leaks.
Phase 3: Cache Access Logic (Days 5-8)
Goals:
- Implement cache lookup
- Implement cache update on miss
- Track hits, misses, evictions
Tasks:
- Implement
access_result_t cache_access(cache_t *cache, unsigned long addr) - Implement hit detection (tag match + valid)
- Implement miss handling (find empty line or evict)
- Update statistics on each access
- Test with simple traces
Core Implementation:
access_result_t cache_access(cache_t *cache, unsigned long addr) {
access_result_t result = {0};
unsigned long tag, set_index, offset;
decompose_address(addr, cache->s, cache->b, &tag, &set_index, &offset);
cache_set_t *set = &cache->sets[set_index];
result.set_index = set_index;
result.tag = tag;
// Check for hit
for (int i = 0; i < set->num_lines; i++) {
if (set->lines[i].valid && set->lines[i].tag == tag) {
// HIT
result.hit = 1;
result.line_index = i;
cache->hits++;
set->lines[i].lru_counter = cache->lru_clock++;
return result;
}
}
// MISS
cache->misses++;
result.hit = 0;
// Find empty line or evict
int target_line = -1;
for (int i = 0; i < set->num_lines; i++) {
if (!set->lines[i].valid) {
target_line = i;
break;
}
}
if (target_line == -1) {
// All lines full - evict LRU
target_line = find_lru_victim(set);
result.eviction = 1;
cache->evictions++;
}
// Install new block
set->lines[target_line].valid = 1;
set->lines[target_line].tag = tag;
set->lines[target_line].lru_counter = cache->lru_clock++;
result.line_index = target_line;
return result;
}
Checkpoint: Simple hand-traced examples produce correct hit/miss counts.
Phase 4: LRU Replacement (Days 9-10)
Goals:
- Implement correct LRU replacement
- Validate against known LRU behavior
Tasks:
- Implement
find_lru_victim(cache_set_t *set) - Ensure LRU counters update on every access (hit or miss)
- Test LRU ordering with specific sequences
LRU Test Case:
// 2-way set associative, accessing same set
// Sequence: A, B, A, C
// Expected: A, B both cached; access A updates recency;
// C evicts B (not A, because A more recent)
void test_lru() {
cache_t *cache = cache_init(1, 2, 4); // 2 sets, 2-way, 16B blocks
// All map to set 0
unsigned long A = 0x00; // tag 0
unsigned long B = 0x20; // tag 1 (same set, different tag)
unsigned long C = 0x40; // tag 2 (same set)
cache_access(cache, A); // Miss, load A
cache_access(cache, B); // Miss, load B. Set: [A, B]
cache_access(cache, A); // Hit on A. A now more recent
access_result_t r = cache_access(cache, C); // Miss, evict B (LRU)
assert(r.eviction == 1);
// Verify B was evicted, not A
assert(cache->sets[0].lines[0].tag == 0 ||
cache->sets[0].lines[1].tag == 0); // A still present
assert(cache->sets[0].lines[0].tag == 2 ||
cache->sets[0].lines[1].tag == 2); // C now present
printf("LRU replacement: PASSED\n");
cache_free(cache);
}
Checkpoint: LRU test cases pass; replacement order is correct.
Phase 5: Trace Parser (Days 11-12)
Goals:
- Parse Valgrind trace format
- Handle all operation types (L, S, M)
Tasks:
- Implement
trace_open(filename) - Implement
trace_next_access(trace, &access) - Handle ‘M’ as two accesses (load then store)
- Validate against provided traces
Trace Format:
I 0400d7d4,8 <- Instruction fetch (ignore or optional)
L 7ff0005c8,8 <- Data load (note leading space)
S 7ff0005d0,4 <- Data store
M 0421c7f0,4 <- Modify (load + store)
Implementation:
int trace_parse_line(char *line, mem_access_t *access) {
if (line[0] == 'I') {
return 0; // Skip instruction fetches
}
char op;
unsigned long addr;
int size;
if (sscanf(line, " %c %lx,%d", &op, &addr, &size) == 3) {
access->op = op;
access->addr = addr;
access->size = size;
return 1;
}
return 0;
}
Checkpoint: Trace parser correctly reads provided test traces.
Phase 6: Statistics and Verbose Mode (Days 13-14)
Goals:
- Output matches reference simulator format
- Verbose mode shows per-access results
Tasks:
- Implement summary output:
hits:X misses:Y evictions:Z - Implement verbose output for each access
- Compare against reference output
- Fix any discrepancies
Verbose Output Format:
void print_access_verbose(mem_access_t *access, access_result_t *result) {
printf("%c %lx,%d ", access->op, access->addr, access->size);
if (result->hit) {
printf("hit ");
} else {
printf("miss ");
if (result->eviction) {
printf("eviction ");
}
}
printf("\n");
}
Checkpoint: Output matches reference for all provided traces.
Phase 7: Miss Classification (Days 15-17)
Goals:
- Classify each miss as compulsory, capacity, or conflict
- Track block history for classification
Tasks:
- Track all unique blocks ever accessed (for compulsory detection)
- Simulate a fully-associative cache of same size (for conflict detection)
- Classify: compulsory if first access; conflict if FA would hit; else capacity
Implementation Strategy:
// Maintain a fully-associative "oracle" cache for classification
typedef struct {
unsigned long *blocks; // Array of block addresses
int capacity; // Max blocks (same as main cache)
int count; // Current count
int lru_clock;
} oracle_cache_t;
char classify_miss(cache_t *cache, unsigned long block_addr,
oracle_cache_t *oracle, hash_set_t *seen_blocks) {
// Check if block was ever seen before
if (!hash_set_contains(seen_blocks, block_addr)) {
hash_set_add(seen_blocks, block_addr);
return 'C'; // Compulsory
}
// Check if fully-associative cache would have hit
if (oracle_access(oracle, block_addr)) {
return 'O'; // Conflict (would have hit with full associativity)
}
return 'A'; // Capacity (even FA cache missed)
}
Checkpoint: Miss classification matches expected distribution for test traces.
Phase 8: ASCII Visualizer (Days 18-21)
Goals:
- Display cache state visually
- Show access patterns and miss types
Tasks:
- Design ASCII cache representation
- Implement
visualize_state(cache, access, result) - Add pattern detection (stride, sequential)
- Implement running statistics display
Visualization Implementation:
void visualize_state(cache_t *cache, int access_num,
mem_access_t *access, access_result_t *result) {
printf("\nAccess #%d: %c 0x%lx [%s%s]\n",
access_num, access->op, access->addr,
result->hit ? "HIT" : "MISS",
result->hit ? "" : get_miss_type_str(result->miss_type));
printf("┌");
for (int i = 0; i < 40; i++) printf("─");
printf("┐\n");
for (int s = 0; s < cache->num_sets; s++) {
printf("│ Set %d: ", s);
for (int l = 0; l < cache->lines_per_set; l++) {
cache_line_t *line = &cache->sets[s].lines[l];
if (line->valid) {
printf("[%04lx] ", line->tag & 0xFFFF);
} else {
printf("[ ] ");
}
}
if (s == result->set_index) {
printf(" ← %s", result->hit ? "HIT" : "accessed");
}
printf("\n");
}
printf("└");
for (int i = 0; i < 40; i++) printf("─");
printf("┘\n");
float hit_rate = (float)cache->hits / (cache->hits + cache->misses) * 100;
printf("Running: hits=%d misses=%d (%.1f%% hit rate)\n",
cache->hits, cache->misses, hit_rate);
}
Checkpoint: Visualizer shows correct cache state and highlights accessed sets.
Testing Strategy
Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Test individual components | Address decomposition, LRU logic |
| Trace Tests | Validate against known traces | Match reference simulator output |
| Edge Cases | Handle unusual inputs | Empty trace, single access, full cache |
| Stress Tests | Performance and correctness at scale | Million-access traces |
Test Traces
Create traces for specific scenarios:
trace1_sequential.trace - Sequential access, good spatial locality:
L 0,4
L 4,4
L 8,4
L c,4
L 10,4
L 14,4
trace2_strided.trace - Strided access, poor spatial locality:
L 0,4
L 40,4
L 80,4
L c0,4
L 100,4
trace3_conflict.trace - Designed to cause conflict misses:
# For direct-mapped cache with 4 sets (s=2), 16-byte blocks (b=4)
# These all map to set 0
L 0,4 # Set 0, tag 0
L 40,4 # Set 0, tag 1 (evicts previous)
L 0,4 # Set 0, tag 0 (conflict miss!)
L 40,4 # Set 0, tag 1 (conflict miss!)
trace4_reuse.trace - Temporal locality test:
L 0,4
L 10,4
L 0,4 # Should be hit
L 10,4 # Should be hit
L 0,4 # Should be hit
Reference Validator
#!/bin/bash
# validate.sh - Compare against reference simulator
TRACES="yi dave trans simple"
CONFIGS="1-1-1 4-2-4 2-4-3" # s-E-b configurations
for trace in $TRACES; do
for config in $CONFIGS; do
s=$(echo $config | cut -d- -f1)
E=$(echo $config | cut -d- -f2)
b=$(echo $config | cut -d- -f3)
expected=$(./csim-ref -s $s -E $E -b $b -t traces/${trace}.trace)
actual=$(./csim -s $s -E $E -b $b -t traces/${trace}.trace)
if [ "$expected" == "$actual" ]; then
echo "PASS: $trace with s=$s E=$E b=$b"
else
echo "FAIL: $trace with s=$s E=$E b=$b"
echo " Expected: $expected"
echo " Actual: $actual"
fi
done
done
Critical Test Cases
- Empty cache hit detection: First access is always a miss
- LRU correctness: Verify least-recently-used block is evicted
- Set isolation: Accesses to different sets don’t affect each other
- Block boundary: Adjacent addresses in same block share cache line
- Tag uniqueness: Same index, different tags are different blocks
- Modify operation: ‘M’ counts as one potential miss but two potential hits
Common Pitfalls
Implementation Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Off-by-one in bit extraction | Wrong set/tag | Use (1 << n) - 1 for n-bit masks |
| Forgetting to update LRU on hit | Wrong eviction | Always update LRU counter on any access |
| Not handling ‘M’ operations | Miscount hits | ‘M’ = load + store; can hit twice |
| Signed vs unsigned addresses | Negative set indices | Use unsigned long for addresses |
| Wrong mask calculation | Tag/index overlap | Verify: tag_bits + index_bits + offset_bits = address_bits |
| Evicting wrong line | Incorrect cache state | Verify LRU logic with small examples |
Debugging Strategies
// Debug mode: print every decomposition
#ifdef DEBUG
printf("addr=0x%lx -> tag=0x%lx set=%lu offset=%lu\n",
addr, tag, set_index, offset);
#endif
// Sanity checks
assert(set_index < cache->num_sets);
assert(offset < cache->block_size);
assert(tag < (1UL << cache->t));
Performance Traps
- Large traces: Don’t store all accesses in memory; process streaming
- Hash table for seen blocks: Don’t use linear search for millions of blocks
- Visualization overhead: Only visualize in debug mode or with sampling
Extensions
Beginner Extensions
- FIFO replacement policy: Simpler than LRU
- Random replacement: Even simpler, good baseline
- Write operations: Track stores separately from loads
- Command-line help: Proper usage message with
-h
Intermediate Extensions
- Multi-level cache: Simulate L1 + L2; misses in L1 go to L2
- Inclusive vs exclusive: Different multi-level policies
- Write-back simulation: Track dirty bits, count writebacks
- Trace generation: Create traces from actual program runs
- Block-level statistics: Per-block hit/miss counts
Advanced Extensions
- Prefetching simulation: Next-line, stride, or correlation prefetch
- TLB simulation: Combine with page table simulation
- Cache coherence: Simulate MESI protocol for multi-core
- Real code optimization: Use simulator to guide actual code changes
- GUI visualizer: Web-based or ncurses interface
Multi-Level Cache Example
┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTI-LEVEL CACHE SIMULATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Typical modern hierarchy: │
│ │
│ ┌─────────────┐ │
│ │ CPU │ │
│ └──────┬──────┘ │
│ ▼ │
│ ┌─────────────┐ ~1ns, 32KB, 8-way │
│ │ L1 Cache │──────────────────────────┐ │
│ └──────┬──────┘ │ L1 miss │
│ ▼ ▼ │
│ ┌─────────────┐ ~4ns, 256KB, 8-way │
│ │ L2 Cache │──────────────────────────┐ │
│ └──────┬──────┘ │ L2 miss │
│ ▼ ▼ │
│ ┌─────────────┐ ~20ns, 8MB, 16-way │
│ │ L3 Cache │──────────────────────────┐ │
│ └──────┬──────┘ │ L3 miss │
│ ▼ ▼ │
│ ┌─────────────┐ ~100ns │
│ │ DRAM │ │
│ └─────────────┘ │
│ │
│ Simulation approach: │
│ ────────────────── │
│ 1. Try L1: if hit, done │
│ 2. If L1 miss, try L2: if hit, install in L1 │
│ 3. If L2 miss, try L3: if hit, install in L1 and L2 │
│ 4. If L3 miss, fetch from memory, install everywhere │
│ │
│ // Pseudocode │
│ result = access_cache(L1, addr); │
│ if (!result.hit) { │
│ result = access_cache(L2, addr); │
│ if (!result.hit) { │
│ result = access_cache(L3, addr); │
│ if (!result.hit) { │
│ // Fetch from memory │
│ } │
│ install_block(L2, addr); │
│ } │
│ install_block(L1, addr); │
│ } │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Real-World Connections
Performance Engineering
Your cache simulator skills directly apply to:
- Database systems: Query planners consider cache behavior
- Game engines: Data-oriented design for cache efficiency
- Scientific computing: Blocking for matrix operations
- Embedded systems: Understanding tight memory constraints
Tools That Use These Concepts
| Tool | Purpose | Cache Relevance |
|---|---|---|
| Valgrind/Cachegrind | Cache profiling | Uses simulation like yours |
| perf | Hardware counters | Measures real cache events |
| Intel VTune | Performance analysis | Cache miss attribution |
| Compiler (-O3) | Optimization | Loop blocking, prefetching |
Interview Applications
Common questions your project prepares you for:
- “Why is row-major traversal faster than column-major for matrices?”
- “How would you optimize a function with poor cache behavior?”
- “Explain the tradeoffs in cache associativity.”
- “What’s the difference between L1 and L3 cache?”
- “How do you measure cache performance?”
Modern CPU Cache Parameters
For reference, typical modern CPU cache configurations:
| Level | Size | Associativity | Block Size | Latency |
|---|---|---|---|---|
| L1D | 32-48 KB | 8-12 way | 64 bytes | 4-5 cycles |
| L1I | 32 KB | 8 way | 64 bytes | 4-5 cycles |
| L2 | 256-512 KB | 4-8 way | 64 bytes | 12-14 cycles |
| L3 | 8-64 MB | 12-16 way | 64 bytes | 40-75 cycles |
Resources
Essential Reading
- CS:APP Chapter 6: “The Memory Hierarchy” - Core material
- CS:APP Chapter 5: “Optimizing Program Performance” - Measurement discipline
- “What Every Programmer Should Know About Memory” by Ulrich Drepper - Deep dive
Tools
- Valgrind/Cachegrind:
valgrind --tool=cachegrind ./program - perf:
perf stat -e cache-references,cache-misses ./program - PAPI: Performance API for hardware counters
Trace Generation
# Generate trace with Valgrind lackey
valgrind --tool=lackey --trace-mem=yes ./your_program 2>&1 | \
grep -E '^ [LS]' > trace.trace
Related Projects
- Previous: P8 (Performance Clinic) for measurement methodology
- Next: P13 (VM Map Visualizer) extends memory hierarchy understanding
- Related: P14 (Malloc) uses locality principles
Online Resources
- Cache Simulator Visualization
- Interactive Cache Demo
- CS:APP Cache Lab materials at CMU
Self-Assessment Checklist
Understanding
- I can explain address decomposition (tag/index/offset) without notes
- I can draw a set-associative cache and trace an access through it
- I understand why spatial locality works (blocks, not bytes)
- I can classify a miss as compulsory, capacity, or conflict
- I can explain LRU replacement step-by-step
- I understand why blocking improves matrix multiply performance
Implementation
- Simulator correctly handles all trace operations (L, S, M)
- Hit/miss/eviction counts match reference for all test traces
- LRU replacement is implemented correctly
- Address decomposition is verified with manual calculations
- Visualizer shows clear cache state representation
Application
- I can predict relative cache performance of two code variants
- I used the simulator to measure a real optimization’s effect
- I can explain a miss-rate reduction in terms of locality principles
- I can choose appropriate cache parameters for a simulation
Growth
- I understand how to extend to multi-level caches
- I know how real tools (Cachegrind, perf) relate to my simulator
- I can apply cache knowledge to optimize real programs
Completion Criteria
Minimum Viable Completion
- Direct-mapped and set-associative simulation works
- Address decomposition is correct
- Hit/miss/eviction counting matches reference
- Processes provided trace files
- Basic text output works
Full Completion
- LRU replacement is correct
- All trace formats handled (L, S, M)
- Miss classification (compulsory/capacity/conflict)
- Verbose mode shows per-access details
- ASCII visualizer displays cache state
- All reference traces match expected output
Excellence (Going Above & Beyond)
- Multi-level cache simulation
- Multiple replacement policies (LRU, FIFO, Random)
- Prefetching simulation
- Real program optimization demonstrated with simulator
- Interactive visualization or web interface
The Core Question You’re Answering
“How does the memory hierarchy create the illusion of a large, fast memory, and how can I write code that exploits this illusion?”
This project forces you to deeply understand that CPU performance is dominated by memory access patterns. A cache miss can cost 100+ CPU cycles while a hit costs only 1-4 cycles. By building a simulator, you internalize how your data layout and access patterns directly translate to cache behavior - knowledge that separates average programmers from performance-conscious systems engineers.
Concepts You Must Understand First
Before starting this project, ensure you have a solid grasp of these foundational concepts:
| Concept | Where to Learn | Why It Matters |
|---|---|---|
| Binary/Hexadecimal representation | CS:APP 2.1 | Address decomposition requires bit manipulation |
| Bitwise operations (AND, OR, shifts) | CS:APP 2.1.4 | Extracting tag/index/offset from addresses |
| Memory addressing | CS:APP 6.1 | Understanding how addresses map to physical memory |
| Locality of reference | CS:APP 6.2 | Why caches work - temporal and spatial locality |
| C pointers and dynamic allocation | CS:APP 2.1, K&R Ch. 5 | Building cache data structures |
| Structs and arrays | K&R Ch. 6 | Representing cache organization |
| File I/O in C | K&R Ch. 7 | Parsing trace files |
| Command-line argument parsing | K&R 5.10 | Handling -s, -E, -b flags |
Questions to Guide Your Design
Work through these questions before writing any code:
-
Address Decomposition: Given a 64-bit address, s=4, and b=6, write out the formula to extract tag, set index, and block offset. Try it with address 0x7FF000A3C8.
-
Data Structure Choice: How will you represent a cache line? A set? The entire cache? Why might a 2D array work for some configurations but not others?
-
LRU Tracking: How will you track access recency for LRU replacement? Counter per line? Linked list? What’s the time complexity of each approach?
-
Miss Classification: To classify a miss as conflict vs capacity, you need to know if a fully-associative cache of the same size would have hit. How will you simulate this efficiently?
-
Modify Operations: A trace line “M 7ff000a3c8,4” represents a load followed by a store. How do you handle this? Can both the load and store be hits?
-
Block Boundaries: If a load accesses bytes 60-67 of a 64-byte block, what happens? Do you need to track byte-level access or just block-level?
Thinking Exercise
Before coding, hand-trace this sequence through a cache with s=1 (2 sets), E=2 (2-way), b=2 (4-byte blocks):
L 0x00 # Load from address 0
L 0x04 # Load from address 4
L 0x08 # Load from address 8
L 0x00 # Load from address 0 again
L 0x10 # Load from address 16
L 0x00 # Load from address 0 again
For each access, determine:
- Which set is accessed?
- Is it a hit or miss?
- If miss, is it compulsory, capacity, or conflict?
- Which line is evicted (if any)?
- What is the final cache state?
Show your work before looking at the solution. Draw the cache state after each access.
Solution (click to expand)
Address decomposition for this cache:
- Block offset: bits [1:0] (2 bits)
- Set index: bit [2] (1 bit, so 2 sets)
- Tag: bits [63:3]
| Access | Address | Set | Tag | Result | Cache State After |
|---|---|---|---|---|---|
| L 0x00 | 0b000 | 0 | 0 | MISS (compulsory) | Set0: [tag=0, _] |
| L 0x04 | 0b100 | 1 | 0 | MISS (compulsory) | Set0: [tag=0, _] Set1: [tag=0, _] |
| L 0x08 | 0b1000 | 0 | 1 | MISS (compulsory) | Set0: [tag=0, tag=1] Set1: [tag=0, _] |
| L 0x00 | 0b000 | 0 | 0 | HIT | Set0: [tag=0, tag=1] |
| L 0x10 | 0b10000 | 0 | 2 | MISS (capacity) | Set0: [tag=0, tag=2] (evicted tag=1) |
| L 0x00 | 0b000 | 0 | 0 | HIT | Set0: [tag=0, tag=2] |
Final: hits=2, misses=4, evictions=1
Hints in Layers
Use these hints progressively if you get stuck. Try to solve problems yourself before advancing to the next hint level.
Hint Layer 1: Getting Started
- Start with address decomposition. Write a function that takes an address and prints tag/index/offset. Test it extensively before moving on.
- For cache state, a 2D array
cache_line_t cache[S][E]works well for small caches. - Parse traces with
fscanf(fp, " %c %lx,%d", &op, &addr, &size).
Hint Layer 2: Core Logic
- Hit detection: For each line in set[index], check
valid && (tag == line.tag). - For LRU, a simple counter works: increment global counter on each access, store it in the accessed line. Victim is the line with smallest counter.
- Handle ‘M’ as two separate accesses: a load (which may miss) followed by a store (which always hits if the load brought in the block).
Hint Layer 3: Miss Classification
- Maintain a hash set of all block addresses ever seen. First access to a block is compulsory.
- Simulate a fully-associative cache of the same total size in parallel. If the FA cache hits but your cache misses, it’s a conflict miss.
- Everything else is a capacity miss.
Hint Layer 4: Common Bugs
- Off-by-one in bit masks: use
(1UL << n) - 1, not(1 << n) - 1(the UL matters for 64-bit). - Forgetting to update LRU counter on hits (not just misses).
- Using
intinstead ofunsigned longfor addresses (signed arithmetic bugs). - Not handling the case where a set has empty lines (no eviction needed).
The Interview Questions They’ll Ask
After completing this project, you should be able to confidently answer these questions:
- “Explain the difference between direct-mapped, set-associative, and fully-associative caches. What are the tradeoffs?”
- Expect follow-ups about hardware cost, hit latency, and miss rate
- “Why does row-major traversal of a matrix outperform column-major traversal in C?”
- Must connect to cache line size and spatial locality
- “What is cache thrashing and how would you detect and fix it in code?”
- Discuss conflict misses, power-of-2 strides, and padding
- “Walk me through exactly what happens when the CPU accesses memory at address X.”
- Cover TLB, address decomposition, tag comparison, miss handling
- “How does blocking (loop tiling) improve matrix multiplication performance?”
- Quantify: 3 blocks must fit in cache, calculate optimal block size
- “You have a program with 90% cache miss rate. How do you diagnose and fix it?”
- Discuss Cachegrind, perf, miss classification, data layout changes
Books That Will Help
| Topic | Book | Specific Chapters |
|---|---|---|
| Cache organization fundamentals | CS:APP (3rd ed.) | Chapter 6.1-6.4 |
| Locality and optimization | CS:APP (3rd ed.) | Chapter 6.5-6.6 |
| Performance measurement | CS:APP (3rd ed.) | Chapter 5.1-5.7 |
| Memory hierarchy design | Patterson & Hennessy: Computer Architecture | Chapter 5 |
| Advanced caching | Ulrich Drepper: “What Every Programmer Should Know About Memory” | Sections 3.1-3.3 |
| C bit manipulation | K&R: The C Programming Language | Chapter 2.9 |
| File I/O in C | K&R: The C Programming Language | Chapter 7.5-7.7 |
| Cache-oblivious algorithms | Introduction to Algorithms (CLRS) | Chapter 27 |
This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.