Project 9: Cache Lab++ -- Cache Simulator + Locality Visualizer
Project 9: Cache Lab++ โ Cache Simulator + Locality Visualizer
Build a set-associative cache simulator and an ASCII locality visualizer that reveals hit/miss patterns, enabling you to optimize code through informed data layout decisions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 2-3 weeks |
| Language | C (Alternatives: Rust, Zig, C++) |
| Prerequisites | Projects 2 and 8 recommended |
| Key Topics | Cache organization, locality, miss classification, replacement policies |
| CS:APP Chapters | 6 (primary), 5 (secondary) |
Table of Contents
- Learning Objectives
- Deep Theoretical Foundation
- Project Specification
- Solution Architecture
- Implementation Guide
- Testing Strategy
- Common Pitfalls
- Extensions
- Real-World Connections
- Resources
- Self-Assessment Checklist
1. Learning Objectives
By completing this project, you will:
- Understand cache organization: Master direct-mapped, set-associative, and fully-associative cache structures
- Implement address decomposition: Parse addresses into tag, index, and offset components
- Classify cache misses: Distinguish compulsory, capacity, and conflict misses through simulation
- Apply replacement policies: Implement LRU, FIFO, and random replacement strategies
- Visualize memory access patterns: Create ASCII visualizations showing cache behavior over time
- Optimize for locality: Design data layouts and access patterns that exploit cache behavior
- Measure and analyze: Use your simulator to quantify cache performance and validate optimizations
2. Deep Theoretical Foundation
2.1 The Memory Hierarchy: Why Caches Exist
Modern processors operate orders of magnitude faster than main memory. Without caches, the CPU would spend most of its time waiting for data:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MEMORY HIERARCHY OVERVIEW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Component Size Access Time Bandwidth โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ CPU Registers ~1 KB ~0.25 ns 100s GB/s โ
โ L1 Cache 32-64 KB ~1 ns ~500 GB/s โ
โ L2 Cache 256 KB-1 MB ~3-4 ns ~200 GB/s โ
โ L3 Cache 8-64 MB ~10-20 ns ~100 GB/s โ
โ Main Memory 16-256 GB ~50-100 ns ~50 GB/s โ
โ SSD 256 GB-4 TB ~100 us ~5 GB/s โ
โ HDD 1-20 TB ~10 ms ~200 MB/s โ
โ โ
โ Speed โ
โ โฒ โ
โ โ Registers โ
โ โ โ โ
โ โ L1 Cache โ
โ โ โ โ
โ โ L2 Cache โ
โ โ โ โ
โ โ L3 Cache โ
โ โ โ โ
โ โ Main Memory โ
โ โ โ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ Size โ
โ โ
โ Key insight: Smaller = Faster (physics + economics) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why this hierarchy works: Programs exhibit locality of reference:
- Temporal locality: Recently accessed items are likely to be accessed again soon
- Spatial locality: Items near recently accessed items are likely to be accessed soon
Caches exploit both: they keep recently-used data (temporal) in cache lines that hold multiple adjacent bytes (spatial).
2.2 Cache Organization Fundamentals
A cache is organized as a collection of sets, each containing one or more lines (also called blocks):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CACHE ORGANIZATION ANATOMY โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Cache Parameters: โ
โ โโโโโโโโโโโโโโโโโ โ
โ S = Number of sets (S = 2^s, where s = index bits) โ
โ E = Lines per set (associativity) (E = 1: direct-mapped) โ
โ B = Bytes per line (block size) (B = 2^b, where b = offset bits) โ
โ โ
โ Total cache size: C = S ร E ร B bytes โ
โ โ
โ โ
โ Example: 4 sets, 2 lines/set, 4 bytes/line (32-byte 2-way set-assoc) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ CACHE โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Set 0 โ Line 0: [V][Tag: 0x1A][B0][B1][B2][B3] โ โ
โ โ โ Line 1: [V][Tag: 0x2F][B0][B1][B2][B3] โ โ
โ โโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Set 1 โ Line 0: [V][Tag: 0x05][B0][B1][B2][B3] โ โ
โ โ โ Line 1: [ ][Tag: ---][--][--][--][--] (invalid) โ โ
โ โโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Set 2 โ Line 0: [V][Tag: 0x12][B0][B1][B2][B3] โ โ
โ โ โ Line 1: [V][Tag: 0x77][B0][B1][B2][B3] โ โ
โ โโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โ
โ โ Set 3 โ Line 0: [V][Tag: 0x00][B0][B1][B2][B3] โ โ
โ โ โ Line 1: [V][Tag: 0x33][B0][B1][B2][B3] โ โ
โ โโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Each line contains: โ
โ โข Valid bit (V): Is this line holding valid data? โ
โ โข Tag: High bits of address to identify which block is stored โ
โ โข Data block: B bytes of actual cached data โ
โ โข (Optional) Dirty bit: Has this block been modified? (for write-back) โ
โ โข (Optional) LRU bits: Track access recency for replacement โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.3 Cache Types by Associativity
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ASSOCIATIVITY SPECTRUM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ DIRECT-MAPPED (E = 1) โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โ Each memory block maps to exactly ONE cache line. โ
โ โ
โ Pros: Simple, fast lookup (no searching) โ
โ Cons: High conflict miss rate โ
โ โ
โ Memory: [A][B][C][D][E][F][G][H][I][J][K][L]... โ
โ โ โ โ โ โ
โ โผ โผ โผ โผ โ
โ Cache: [0] [1] [2] [3] โ A, E, I all compete for line 0! โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SET-ASSOCIATIVE (1 < E < C/B) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Each block maps to a SET, can go in any line within that set. โ
โ โ
โ Pros: Reduces conflict misses; reasonable lookup speed โ
โ Cons: More complex; requires replacement policy โ
โ โ
โ 2-way set-associative example: โ
โ Memory: [A][B][C][D][E][F][G][H]... โ
โ โ โ โ โ โ
โ โผ โผ โผ โผ โ
โ Set 0: [_][_] โ A and E can BOTH live here now! โ
โ Set 1: [_][_] โ B and F can both live here โ
โ โ
โ Common configurations: 2-way, 4-way, 8-way โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ FULLY-ASSOCIATIVE (E = C/B, S = 1) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Any block can go in ANY cache line. One giant set. โ
โ โ
โ Pros: No conflict misses (only compulsory and capacity) โ
โ Cons: Slow lookup (must search all lines); complex hardware โ
โ โ
โ Memory: [A][B][C][D][E][F][G][H]... โ
โ โ โ โ โ โ โ โ โ โ
โ โโโโดโโโดโโโดโโโผโโโดโโโดโโโ โ
โ โผ โ
โ Cache: [_][_][_][_][_][_][_][_] โ Any block can go anywhere โ
โ โ
โ Used mainly for TLBs and very small caches โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.4 Address Decomposition: Tag, Index, Offset
Every memory address is decomposed into three fields that determine cache behavior:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ADDRESS DECOMPOSITION โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ For a cache with S = 2^s sets and B = 2^b bytes per block: โ
โ โ
โ m-bit address: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โ
โ โ Tag โ Index โ Offset โ โ
โ โ (t bits) โ (s bits) โ (b bits) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ โ
โ โ
โ Where: t = m - s - b โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ EXAMPLE: 32-bit address, 16 sets, 64 bytes/block โ
โ โ
โ S = 16 = 2^4 โ s = 4 bits for index โ
โ B = 64 = 2^6 โ b = 6 bits for offset โ
โ t = 32 - 4 - 6 = 22 bits for tag โ
โ โ
โ Address 0x12345678: โ
โ Binary: 0001 0010 0011 0100 0101 0110 0111 1000 โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโฌโโโโโโโโโ โ
โ โ 0001001000110100010101 โ 1001 โ 111000 โ โ
โ โ Tag (22 bits) โIndex โ Offset โ โ
โ โ = 0x048D15 โ = 9 โ = 56 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโดโโโโโโโโโ โ
โ โ
โ Interpretation: โ
โ โข Look in Set 9 โ
โ โข Compare tag 0x048D15 against all lines in set 9 โ
โ โข If match AND valid bit set: HIT! Return byte at offset 56 โ
โ โข If no match: MISS. Fetch block, store with this tag โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SPECIAL CASES: โ
โ โ
โ Direct-mapped (E=1, S=C/B): โ
โ โข Maximum index bits โ
โ โข Block maps to exactly one place โ
โ โ
โ Fully-associative (S=1): โ
โ โข No index bits (s=0) โ
โ โข Tag is almost the entire address โ
โ โข Must compare tag against ALL lines โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.5 The Cache Line (Block) Concept
A cache line is the fundamental unit of transfer between cache and memory:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CACHE LINE CONCEPT โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ When CPU requests 1 byte, the cache fetches an ENTIRE BLOCK: โ
โ โ
โ CPU requests address 0x1004: โ
โ โ
โ Memory: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ ... โ0x1000โ0x1001โ0x1002โ0x1003โ0x1004โ0x1005โ0x1006โ0x1007โ ... โ โ
โ โ โ A โ B โ C โ D โ [E] โ F โ G โ H โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ Entire 8-byte block fetched โ
โ Cache Line: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ V=1 โ Tag โ A โ B โ C โ D โ E โ F โ G โ H โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ Requested byte โ
โ โ
โ WHY BLOCKS? โ
โ โโโโโโโโโโโโ โ
โ 1. DRAM is optimized for sequential access (row buffers) โ
โ 2. Exploits SPATIAL LOCALITY: if you access E, you probably โ
โ will soon access F, G, H... they're now already cached! โ
โ 3. Amortizes the cost of the miss over multiple bytes โ
โ โ
โ BLOCK SIZE TRADEOFFS: โ
โ โโโโโโโโโโโโโโโโโโโโโ โ
โ Larger blocks: โ
โ + Better spatial locality exploitation โ
โ + Fewer tags (less overhead) โ
โ - Longer miss penalty (more bytes to transfer) โ
โ - Fewer blocks for given cache size โ more conflict misses โ
โ - May fetch unused data (if locality is poor) โ
โ โ
โ Typical sizes: 32, 64, or 128 bytes โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.6 Miss Types: The Three Cs
Understanding WHY a miss occurred is crucial for optimization:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ THE THREE Cs OF CACHE MISSES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. COMPULSORY MISSES (Cold Misses) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ First access to a block that has never been in cache. โ
โ โ
โ Cause: Block simply hasn't been loaded yet โ
โ Fix: Cannot be eliminated! (Must load data at least once) โ
โ Can be reduced with prefetching โ
โ โ
โ Example: โ
โ for (i = 0; i < N; i++) โ
โ sum += A[i]; // First access to each cache line is compulsory โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 2. CAPACITY MISSES โ
โ โโโโโโโโโโโโโโโโโโ โ
โ Working set is larger than the cache. โ
โ โ
โ Cause: Too much data for cache to hold โ
โ Fix: Reduce working set size (blocking/tiling) โ
โ Or get a bigger cache (hardware solution) โ
โ โ
โ Example (cache holds 4 blocks): โ
โ // Working set = 8 blocks (A[0..7]) โ
โ for (pass = 0; pass < 100; pass++) โ
โ for (i = 0; i < 8; i++) โ
โ sum += A[i]; // Every pass evicts earlier blocks โ
โ โ
โ Access pattern: A0,A1,A2,A3,A4,A5,A6,A7, A0(miss!),A1(miss!)... โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 3. CONFLICT MISSES โ
โ โโโโโโโโโโโโโโโโโโ โ
โ Block evicted due to other blocks mapping to same set. โ
โ (Would NOT have occurred in a fully-associative cache of same size) โ
โ โ
โ Cause: Limited associativity + unfortunate address mapping โ
โ Fix: Increase associativity; change data layout; padding โ
โ โ
โ Example (direct-mapped, 4 sets): โ
โ // A[0] and A[4] map to same set! โ
โ for (i = 0; i < 100; i++) { โ
โ sum += A[0]; // miss, load A[0] โ
โ sum += A[4]; // miss, evicts A[0], load A[4] โ
โ sum += A[0]; // miss! A[0] was evicted by A[4] โ
โ sum += A[4]; // miss! A[4] was evicted by A[0] โ
โ } โ
โ // Every access is a miss! This is "cache thrashing" โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ IDENTIFYING MISS TYPE IN SIMULATION: โ
โ โ
โ if (block never seen before): โ
โ โ COMPULSORY โ
โ else if (would hit in fully-associative cache of same size): โ
โ โ CONFLICT (evicted due to set collision) โ
โ else: โ
โ โ CAPACITY (evicted due to overall cache full) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.7 Replacement Policies
When a set is full and we need to load a new block, which victim do we evict?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REPLACEMENT POLICIES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. LRU (Least Recently Used) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Evict the block that hasn't been accessed for the longest time. โ
โ โ
โ Rationale: Temporal locality โ recently used items likely used again โ
โ Implementation: Track access order per set โ
โ โ
โ Example (2-way set, LRU policy): โ
โ Access sequence: A, B, A, C โ
โ โ
โ After A: [A*, _] (* = most recent) โ
โ After B: [A, B*] โ
โ After A: [A*, B] (A touched, now most recent) โ
โ After C: [A*, C*] โ evict B (LRU), C takes its place โ
โ [A, C*] โ
โ โ
โ Pros: Often best hit rate; matches locality assumption โ
โ Cons: Hardware complexity for high associativity; โ
โ O(E) state per set; update on every access โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 2. FIFO (First In, First Out) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Evict the block that was loaded earliest. โ
โ โ
โ Rationale: Simple approximation; older data less likely needed โ
โ Implementation: Circular pointer per set โ
โ โ
โ Example (2-way set, FIFO policy): โ
โ Access sequence: A, B, A, C โ
โ โ
โ After A: [A, _] pointerโ0 โ
โ After B: [A, B] pointerโ0 โ
โ After A: [A, B] (hit, pointer unchanged) โ
โ After C: [C, B] evict A (oldest), pointerโ1 โ
โ โ
โ Note: A was evicted even though it was just accessed! โ
โ This is Belady's anomaly โ FIFO doesn't match recency. โ
โ โ
โ Pros: Simple to implement; O(1) state per set โ
โ Cons: Ignores recency; can have poor hit rate โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 3. RANDOM โ
โ โโโโโโโโ โ
โ Pick a random victim from the set. โ
โ โ
โ Rationale: Unpredictable; avoids pathological worst-cases โ
โ Implementation: Random number generator โ
โ โ
โ Pros: Simplest hardware; no state needed; avoids worst cases โ
โ Cons: Unpredictable; may evict hot blocks โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 4. LFU (Least Frequently Used) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Evict the block with fewest accesses. โ
โ โ
โ Pros: Good for some access patterns โ
โ Cons: Doesn't adapt to changing access patterns; โ
โ Infrequently used early blocks never evicted โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ COMPARISON (typical hit rates): โ
โ โ
โ LRU โฅ Pseudo-LRU โ PLRU > FIFO > Random โ
โ โ
โ In practice: Modern CPUs use variants like Pseudo-LRU or โ
โ tree-based approximations for hardware efficiency. โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.8 Write Policies
What happens when the CPU writes to a cached block?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WRITE POLICIES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ TWO INDEPENDENT DECISIONS: โ
โ โ
โ 1. WRITE HIT: What to do when we write to a block already in cache? โ
โ โ
โ WRITE-THROUGH WRITE-BACK โ
โ โโโโโโโโโโโโโ โโโโโโโโโโ โ
โ Write to cache AND memory Write to cache only โ
โ immediately Mark block "dirty" โ
โ Write to memory only on eviction โ
โ โ
โ + Memory always consistent + Fewer memory writes โ
โ + Simple + Better performance โ
โ - High memory traffic - Complex (track dirty bits) โ
โ - Slow (wait for memory) - Memory inconsistent until evict โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 2. WRITE MISS: What to do when we write to a block NOT in cache? โ
โ โ
โ WRITE-ALLOCATE NO-WRITE-ALLOCATE โ
โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โ
โ Load block into cache, Write directly to memory โ
โ then write to cache Don't load into cache โ
โ โ
โ + Good if writing then reading + Saves cache space for reads โ
โ + Exploits spatial locality - No locality benefit โ
โ - Extra memory read on miss โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ COMMON COMBINATIONS: โ
โ โ
โ Write-Through + No-Write-Allocate โ
โ โข Writes always go to memory; reads populate cache โ
โ โข Used in some L1 caches โ
โ โ
โ Write-Back + Write-Allocate (most common today) โ
โ โข Writes stay in cache until eviction โ
โ โข Exploits temporal locality on writes โ
โ โข Used in most L2/L3 caches โ
โ โ
โ FOR THIS PROJECT: โ
โ We focus on READ behavior. Write policies are a natural extension. โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.9 Spatial and Temporal Locality
These are the principles that make caching effective:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LOCALITY PRINCIPLES โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ TEMPORAL LOCALITY โ
โ โโโโโโโโโโโโโโโโโ โ
โ "If you accessed it recently, you'll probably access it again soon." โ
โ โ
โ Good temporal locality: Poor temporal locality: โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ sum = 0; // Process each element once โ
โ for (i = 0; i < 1000; i++) { for (i = 0; i < N; i++) โ
โ sum += A[i % 10]; // Reuse! process(A[i]); // No reuse โ
โ } โ
โ โ
โ Access pattern: 0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,... โ
โ Each element accessed 100 times โ 99% reuse from cache! โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SPATIAL LOCALITY โ
โ โโโโโโโโโโโโโโโโ โ
โ "If you accessed location X, you'll probably access nearby locations." โ
โ โ
โ Good spatial locality: Poor spatial locality: โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ // Sequential access // Stride access โ
โ for (i = 0; i < N; i++) for (i = 0; i < N; i += 64) โ
โ sum += A[i]; sum += A[i]; โ
โ โ
โ Access pattern: 0,1,2,3,4,5,6,7... Access pattern: 0,64,128,192... โ
โ Each cache line fully utilized Each cache line: 1 useful byte โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ QUANTIFYING LOCALITY: โ
โ โ
โ Miss Rate = Misses / Total Accesses โ
โ โ
โ For sequential access with B-byte blocks: โ
โ Miss rate โ 1/B (one miss per B bytes โ B-1 hits) โ
โ โ
โ For stride-K access with B-byte blocks: โ
โ If K < B: Miss rate โ 1/B (still get some spatial locality) โ
โ If K โฅ B: Miss rate โ 1/1 = 100% (no spatial locality!) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ MATRIX ACCESS PATTERNS: โ
โ โ
โ Row-major storage: A[i][j] at address base + i*N + j โ
โ โ
โ Row-wise access (good): Column-wise access (bad): โ
โ for (i...) for (j...) for (j...) for (i...) โ
โ A[i][j] A[i][j] โ
โ โ
โ Memory: [0,0][0,1][0,2][0,3][1,0][1,1][1,2][1,3]... โ
โ Access: 1 2 3 4 9 10 11 12 (row-wise) โ
โ Access: 1 5 9 13 2 6 10 14 (col-wise = strided!) โ
โ โ
โ Column-wise in row-major = stride of N elements! โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.10 Blocking (Tiling) for Cache Optimization
The key optimization technique for improving cache performance:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BLOCKING / TILING TECHNIQUE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ PROBLEM: Matrix multiply naive version has poor locality โ
โ โ
โ // C = A ร B (N ร N matrices) โ
โ for (i = 0; i < N; i++) โ
โ for (j = 0; j < N; j++) โ
โ for (k = 0; k < N; k++) โ
โ C[i][j] += A[i][k] * B[k][j]; โ
โ โ
โ Access analysis (row-major storage): โ
โ โข A[i][k]: Row-wise access (good) โ
โ โข B[k][j]: Column-wise access (BAD - stride N!) โ
โ โข C[i][j]: Repeated access (good temporal locality) โ
โ โ
โ For large N, B causes cache thrashing. โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ SOLUTION: Process in cache-sized blocks โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ A B C โ โ
โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ โ
โ โ โโโโโ โ โ โโโโโ โ โ โโโโโ โ โ โ โ
โ โ โโโโโผโโโโผโโโโค ร โโโโโผโโโโผโโโโค = โโโโโผโโโโผโโโโค โ โ
โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ
โ โ โโโโโผโโโโผโโโโค โโโโโผโโโโผโโโโค โโโโโผโโโโผโโโโค โ โ
โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ Process one block at a time: โ โ
โ โ C[block] += A[row_block] ร B[col_block] โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ BLOCKED MATRIX MULTIPLY: โ
โ โ
โ // Block size B chosen so 3 blocks fit in cache โ
โ for (ii = 0; ii < N; ii += BLOCK) โ
โ for (jj = 0; jj < N; jj += BLOCK) โ
โ for (kk = 0; kk < N; kk += BLOCK) โ
โ // Mini matrix multiply on blocks โ
โ for (i = ii; i < min(ii+BLOCK, N); i++) โ
โ for (j = jj; j < min(jj+BLOCK, N); j++) โ
โ for (k = kk; k < min(kk+BLOCK, N); k++) โ
โ C[i][j] += A[i][k] * B[k][j]; โ
โ โ
โ CHOOSING BLOCK SIZE: โ
โ โโโโโโโโโโโโโโโโโโโโ โ
โ โข Need 3 blocks in cache: A_block, B_block, C_block โ
โ โข Each block is BLOCK ร BLOCK elements โ
โ โข Total: 3 ร BLOCKยฒ ร sizeof(element) < Cache Size โ
โ โข BLOCK โ sqrt(CacheSize / (3 ร sizeof(element))) โ
โ โ
โ Example: 32KB L1 cache, double (8 bytes) โ
โ BLOCK โ sqrt(32768 / 24) โ 36 โ
โ Use BLOCK = 32 (power of 2 for alignment) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2.11 Cache-Oblivious Algorithms
An advanced concept: algorithms that are efficient without knowing cache size:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CACHE-OBLIVIOUS ALGORITHMS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ CACHE-AWARE (blocking): โ
โ โข You choose block size based on known cache size โ
โ โข Good: Optimal for that cache โ
โ โข Bad: Need to retune for different caches/hierarchy levels โ
โ โ
โ CACHE-OBLIVIOUS: โ
โ โข Algorithm is designed to work well for ANY cache size โ
โ โข Uses recursive divide-and-conquer โ
โ โข Automatically adapts to all levels of hierarchy โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ EXAMPLE: Cache-oblivious matrix multiply โ
โ โ
โ void matmul(A, B, C, n) { โ
โ if (n <= BASE_CASE) { โ
โ // Direct multiply (fits in registers) โ
โ base_case_multiply(A, B, C); โ
โ } else { โ
โ // Divide each matrix into quadrants โ
โ // โโโโโโฌโโโโโ โ
โ // โ A11โ A12โ โ
โ // โโโโโโผโโโโโค โ
โ // โ A21โ A22โ โ
โ // โโโโโโดโโโโโ โ
โ // โ
โ // C11 = A11รB11 + A12รB21 โ
โ // C12 = A11รB12 + A12รB22 โ
โ // C21 = A21รB11 + A22รB21 โ
โ // C22 = A21รB12 + A22รB22 โ
โ โ
โ // 8 recursive multiplies, 4 additions โ
โ matmul(A11, B11, T1, n/2); โ
โ matmul(A12, B21, T2, n/2); โ
โ add(T1, T2, C11, n/2); โ
โ // ... etc. โ
โ } โ
โ } โ
โ โ
โ WHY IT WORKS: โ
โ โโโโโโโโโโโโโ โ
โ At some recursion level, subproblems fit in cache. โ
โ From then on, all accesses are cache hits. โ
โ This happens automatically for any cache size! โ
โ โ
โ THEORETICAL GUARANTEE: โ
โ Cache misses = O(Nยณ / (B ร sqrt(M))) โ
โ Where: N = matrix size, B = block size, M = cache size โ
โ This is optimal for any M and B! โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3. Project Specification
3.1 What You Will Build
A two-part system:
-
Cache Simulator: A configurable set-associative cache simulator that processes memory access traces and reports hit/miss statistics with miss classification.
-
Locality Visualizer: An ASCII-art visualization tool that shows cache behavior over time, making hit/miss patterns visible and intuitive.
3.2 Functional Requirements
Cache Simulator (csim):
- Configuration (
-s,-E,-bflags):- Accept cache parameters: s (index bits), E (associativity), b (block bits)
- Support cache sizes from tiny (s=1, E=1, b=2) to realistic (s=10, E=8, b=6)
- Trace Processing:
- Read Valgrind-format memory traces:
I 0400d7d4,8 (instruction fetch - optional to handle) L 7ff0005c8,8 (data load) S 7ff0005d0,4 (data store) M 0421c7f0,4 (modify = load then store) - Process each access and update cache state
- Read Valgrind-format memory traces:
- Statistics Reporting:
- Count: hits, misses, evictions
- Classify misses: compulsory, capacity, conflict
- Support verbose mode showing each access result
- Replacement Policies:
- Implement LRU (required)
- FIFO and Random (stretch goals)
Locality Visualizer (lvis):
- Visual Output:
- Display cache state as ASCII art after each access (or at intervals)
- Show which sets are being accessed
- Highlight hits (green/+) vs misses (red/-)
- Pattern Recognition:
- Identify and annotate access patterns (sequential, strided, random)
- Calculate and display running miss rate
- Code Path Mode:
- Accept annotated traces that include source location
- Group statistics by code region
3.3 Non-Functional Requirements
- Correctness: Match reference simulator output exactly
- Performance: Handle traces with millions of accesses in reasonable time
- Clarity: Output must be educational, not just numerical
- Modularity: Clean separation between simulator core and visualization
3.4 Example Usage and Output
Basic Simulation:
$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3
$ ./csim -v -s 4 -E 2 -b 4 -t traces/yi.trace
L 10,1 miss
M 20,1 miss hit
L 22,1 hit
S 18,1 hit
L 110,1 miss eviction
L 210,1 miss eviction
M 12,1 miss eviction hit
hits:4 misses:5 evictions:3
Miss Classification:
$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace --classify
Cache Configuration:
Sets: 16 (s=4)
Lines per set: 2 (E=2)
Block size: 16 bytes (b=4)
Total size: 512 bytes
Results:
Hits: 4 (44.4%)
Misses: 5 (55.6%)
Compulsory: 4 (80.0% of misses)
Capacity: 0 (0.0% of misses)
Conflict: 1 (20.0% of misses)
Evictions: 3
Locality Visualizer Output:
$ ./lvis -s 2 -E 2 -b 2 -t traces/simple.trace
=== CACHE LOCALITY VISUALIZER ===
Cache: 4 sets ร 2 ways ร 4 bytes = 32 bytes
Access #1: L 0x00 [MISS - Compulsory]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ tag=0x00] [ ] โ โ ACCESSED (miss)
โ Set 1: [ ] [ ] โ
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Running: hits=0 misses=1 (0.0% hit rate)
Access #2: L 0x04 [MISS - Compulsory]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ tag=0x00] [ ] โ
โ Set 1: [โโโโ tag=0x00] [ ] โ โ ACCESSED (miss)
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Running: hits=0 misses=2 (0.0% hit rate)
Access #3: L 0x00 [HIT]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ tag=0x00] [ ] โ โ ACCESSED (HIT!)
โ Set 1: [โโโโ tag=0x00] [ ] โ
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Running: hits=1 misses=2 (33.3% hit rate)
Access #4: L 0x10 [MISS - Conflict]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ tag=0x01] [โโโโ 0x00]โ โ ACCESSED (miss, evicted old)
โ Set 1: [โโโโ tag=0x00] [ ] โ
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Running: hits=1 misses=3 (25.0% hit rate)
=== PATTERN ANALYSIS ===
Stride detected: 4 bytes (matches block size)
Spatial locality: GOOD (sequential within blocks)
Temporal locality: MODERATE (some reuse observed)
Conflict misses: 1 (33% of misses) - consider increasing associativity
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SYSTEM ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ CACHE SIMULATOR โ โ
โ โ Trace โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Parser โโโโโโโโถโ โ Cache Controller โโ โ
โ โ โ โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโ โ
โ โโโโโโโโโโโโโโโ โ โ โ Address โ โ Cache State โ โโ โ
โ โฒ โ โ โ Decomposerโ โ (sets, lines) โ โโ โ
โ โ โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโ โ
โ โโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโ โ
โ โ Trace โ โ โ โ Hit/Miss โ โ Replacement โ โโ โ
โ โ Files โ โ โ โ Logic โ โ Policy (LRU/FIFO) โ โโ โ
โ โ (valgrind) โ โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ โโ โ
โ โโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ Statistics Collector โโ โ
โ โ โ hits, misses, evictions, miss types โโ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ Text Reporter โ โ ASCII โ โ
โ โ (stats output) โ โ Visualizer โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.2 Key Data Structures
/* Cache line structure */
typedef struct {
int valid; /* Valid bit */
unsigned long tag; /* Tag bits */
int lru_counter; /* For LRU replacement (higher = more recent) */
int load_time; /* For FIFO replacement */
int dirty; /* Dirty bit (for write-back, extension) */
} cache_line_t;
/* Cache set structure */
typedef struct {
cache_line_t *lines; /* Array of E lines */
int num_lines; /* Associativity (E) */
} cache_set_t;
/* Cache structure */
typedef struct {
cache_set_t *sets; /* Array of S sets */
int num_sets; /* S = 2^s */
int lines_per_set; /* E */
int block_size; /* B = 2^b */
/* Bit counts for address decomposition */
int s; /* Index bits */
int b; /* Offset bits */
int t; /* Tag bits = m - s - b */
/* Statistics */
int hits;
int misses;
int evictions;
/* Miss classification */
int compulsory_misses;
int capacity_misses;
int conflict_misses;
/* Global LRU counter */
int lru_clock;
/* For miss classification: track all unique blocks ever seen */
/* (Using a set/hash table in real implementation) */
} cache_t;
/* Memory access structure */
typedef struct {
char op; /* 'L' (load), 'S' (store), 'M' (modify) */
unsigned long addr; /* Memory address */
int size; /* Size of access in bytes */
} mem_access_t;
/* Access result for visualization */
typedef struct {
int hit; /* 1 if hit, 0 if miss */
int eviction; /* 1 if eviction occurred */
int set_index; /* Which set was accessed */
int line_index; /* Which line in set */
unsigned long tag; /* Tag of accessed block */
char miss_type; /* 'C'ompulsory, 'A'pacity, 'O'nflict */
} access_result_t;
4.3 Module Structure
cache-simulator/
โโโ src/
โ โโโ main.c # Entry point, CLI parsing
โ โโโ cache.c # Cache simulation logic
โ โโโ cache.h # Cache data structures and API
โ โโโ trace.c # Trace file parsing
โ โโโ trace.h # Trace structures
โ โโโ lru.c # LRU replacement policy
โ โโโ replacement.h # Replacement policy interface
โ โโโ classify.c # Miss classification logic
โ โโโ stats.c # Statistics collection
โ โโโ visualizer.c # ASCII visualization
โ โโโ visualizer.h # Visualizer API
โโโ tests/
โ โโโ traces/ # Test trace files
โ โ โโโ yi.trace
โ โ โโโ dave.trace
โ โ โโโ trans.trace
โ โโโ test_cache.c # Unit tests for cache logic
โ โโโ test_lru.c # Unit tests for LRU
โ โโโ expected/ # Expected outputs
โโโ tools/
โ โโโ gen_trace.py # Generate custom traces
โ โโโ analyze_pattern.py # Pattern analysis helper
โโโ Makefile
โโโ README.md
4.4 Core Algorithm: Cache Access
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CACHE ACCESS ALGORITHM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ access_cache(cache, address): โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 1. DECOMPOSE ADDRESS โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ tag = address >> (s + b) โ โ
โ โ set_index = (address >> b) & ((1 << s) - 1) โ
โ โ offset = address & ((1 << b) - 1) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 2. SEARCH SET FOR MATCHING TAG โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ set = cache.sets[set_index] โ โ
โ โ for each line in set: โ โ
โ โ if line.valid AND line.tag == tag: โ โ
โ โ โ HIT! Update LRU, return โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 3. CACHE MISS - NEED TO LOAD BLOCK โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ // First check for empty line โ โ
โ โ for each line in set: โ โ
โ โ if NOT line.valid: โ โ
โ โ โ Use this empty line โ โ
โ โ โ No eviction needed โ โ
โ โ break โ โ
โ โ โ โ
โ โ // No empty line - must evict โ โ
โ โ if all lines valid: โ โ
โ โ victim = select_victim(set) // LRU โ โ
โ โ eviction = true โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 4. INSTALL NEW BLOCK โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ line.valid = true โ โ
โ โ line.tag = tag โ โ
โ โ line.lru_counter = cache.lru_clock++ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ 5. UPDATE STATISTICS โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ if hit: cache.hits++ โ โ
โ โ else: cache.misses++ โ โ
โ โ if eviction: cache.evictions++ โ โ
โ โ classify_miss(...) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5. Implementation Guide
5.1 Development Environment Setup
# Required tools
sudo apt-get install gcc gdb valgrind build-essential
# Optional: valgrind for generating traces
# (Traces are provided, but you can generate your own)
# Create project structure
mkdir -p cache-simulator/{src,tests/traces,tools}
cd cache-simulator
5.2 Project Structure
cache-simulator/
โโโ src/
โ โโโ main.c
โ โโโ cache.c / cache.h
โ โโโ trace.c / trace.h
โ โโโ visualizer.c / visualizer.h
โ โโโ stats.c / stats.h
โโโ tests/
โ โโโ traces/
โ โโโ test_*.c
โโโ Makefile
โโโ README.md
5.3 Implementation Phases
Phase 1: Address Decomposition (Days 1-2)
Goals:
- Implement address parsing into tag, index, offset
- Validate decomposition with known examples
Tasks:
- Create
cache.hwith basic structures - Implement
decompose_address(addr, s, b, &tag, &index, &offset) - Write unit tests with hand-calculated examples
- Test edge cases (addresses at block boundaries)
Test Cases:
// Cache: s=4, b=4 (16 sets, 16-byte blocks)
// Address: 0x12345678
// Expected decomposition:
// offset = 0x8 (4 bits)
// index = 0x7 (4 bits)
// tag = 0x1234567 (24 bits)
void test_decompose() {
int s = 4, b = 4;
unsigned long addr = 0x12345678;
unsigned long tag, index, offset;
decompose_address(addr, s, b, &tag, &index, &offset);
assert(offset == 0x8);
assert(index == 0x7);
assert(tag == 0x1234567);
printf("Address decomposition: PASSED\n");
}
Checkpoint: Address decomposition passes all unit tests.
Phase 2: Cache Data Structures (Days 3-4)
Goals:
- Implement cache initialization
- Create set and line data structures
- Implement cache state printing (for debugging)
Tasks:
- Implement
cache_t* cache_init(int s, int E, int b) - Implement
cache_free(cache_t *cache) - Implement
cache_print_state(cache_t *cache)for debugging - Validate memory allocation/deallocation
Implementation:
cache_t* cache_init(int s, int E, int b) {
cache_t *cache = malloc(sizeof(cache_t));
cache->s = s;
cache->b = b;
cache->num_sets = 1 << s; // 2^s
cache->lines_per_set = E;
cache->block_size = 1 << b; // 2^b
cache->t = 64 - s - b; // Assuming 64-bit addresses
// Allocate sets
cache->sets = malloc(cache->num_sets * sizeof(cache_set_t));
for (int i = 0; i < cache->num_sets; i++) {
cache->sets[i].lines = calloc(E, sizeof(cache_line_t));
cache->sets[i].num_lines = E;
}
// Initialize statistics
cache->hits = cache->misses = cache->evictions = 0;
cache->lru_clock = 0;
return cache;
}
Checkpoint: Cache initializes correctly; memory tools show no leaks.
Phase 3: Cache Access Logic (Days 5-8)
Goals:
- Implement cache lookup
- Implement cache update on miss
- Track hits, misses, evictions
Tasks:
- Implement
access_result_t cache_access(cache_t *cache, unsigned long addr) - Implement hit detection (tag match + valid)
- Implement miss handling (find empty line or evict)
- Update statistics on each access
- Test with simple traces
Core Implementation:
access_result_t cache_access(cache_t *cache, unsigned long addr) {
access_result_t result = {0};
unsigned long tag, set_index, offset;
decompose_address(addr, cache->s, cache->b, &tag, &set_index, &offset);
cache_set_t *set = &cache->sets[set_index];
result.set_index = set_index;
result.tag = tag;
// Check for hit
for (int i = 0; i < set->num_lines; i++) {
if (set->lines[i].valid && set->lines[i].tag == tag) {
// HIT
result.hit = 1;
result.line_index = i;
cache->hits++;
set->lines[i].lru_counter = cache->lru_clock++;
return result;
}
}
// MISS
cache->misses++;
result.hit = 0;
// Find empty line or evict
int target_line = -1;
for (int i = 0; i < set->num_lines; i++) {
if (!set->lines[i].valid) {
target_line = i;
break;
}
}
if (target_line == -1) {
// All lines full - evict LRU
target_line = find_lru_victim(set);
result.eviction = 1;
cache->evictions++;
}
// Install new block
set->lines[target_line].valid = 1;
set->lines[target_line].tag = tag;
set->lines[target_line].lru_counter = cache->lru_clock++;
result.line_index = target_line;
return result;
}
Checkpoint: Simple hand-traced examples produce correct hit/miss counts.
Phase 4: LRU Replacement (Days 9-10)
Goals:
- Implement correct LRU replacement
- Validate against known LRU behavior
Tasks:
- Implement
find_lru_victim(cache_set_t *set) - Ensure LRU counters update on every access (hit or miss)
- Test LRU ordering with specific sequences
LRU Test Case:
// 2-way set associative, accessing same set
// Sequence: A, B, A, C
// Expected: A, B both cached; access A updates recency;
// C evicts B (not A, because A more recent)
void test_lru() {
cache_t *cache = cache_init(1, 2, 4); // 2 sets, 2-way, 16B blocks
// All map to set 0
unsigned long A = 0x00; // tag 0
unsigned long B = 0x20; // tag 1 (same set, different tag)
unsigned long C = 0x40; // tag 2 (same set)
cache_access(cache, A); // Miss, load A
cache_access(cache, B); // Miss, load B. Set: [A, B]
cache_access(cache, A); // Hit on A. A now more recent
access_result_t r = cache_access(cache, C); // Miss, evict B (LRU)
assert(r.eviction == 1);
// Verify B was evicted, not A
assert(cache->sets[0].lines[0].tag == 0 ||
cache->sets[0].lines[1].tag == 0); // A still present
assert(cache->sets[0].lines[0].tag == 2 ||
cache->sets[0].lines[1].tag == 2); // C now present
printf("LRU replacement: PASSED\n");
cache_free(cache);
}
Checkpoint: LRU test cases pass; replacement order is correct.
Phase 5: Trace Parser (Days 11-12)
Goals:
- Parse Valgrind trace format
- Handle all operation types (L, S, M)
Tasks:
- Implement
trace_open(filename) - Implement
trace_next_access(trace, &access) - Handle โMโ as two accesses (load then store)
- Validate against provided traces
Trace Format:
I 0400d7d4,8 <- Instruction fetch (ignore or optional)
L 7ff0005c8,8 <- Data load (note leading space)
S 7ff0005d0,4 <- Data store
M 0421c7f0,4 <- Modify (load + store)
Implementation:
int trace_parse_line(char *line, mem_access_t *access) {
if (line[0] == 'I') {
return 0; // Skip instruction fetches
}
char op;
unsigned long addr;
int size;
if (sscanf(line, " %c %lx,%d", &op, &addr, &size) == 3) {
access->op = op;
access->addr = addr;
access->size = size;
return 1;
}
return 0;
}
Checkpoint: Trace parser correctly reads provided test traces.
Phase 6: Statistics and Verbose Mode (Days 13-14)
Goals:
- Output matches reference simulator format
- Verbose mode shows per-access results
Tasks:
- Implement summary output:
hits:X misses:Y evictions:Z - Implement verbose output for each access
- Compare against reference output
- Fix any discrepancies
Verbose Output Format:
void print_access_verbose(mem_access_t *access, access_result_t *result) {
printf("%c %lx,%d ", access->op, access->addr, access->size);
if (result->hit) {
printf("hit ");
} else {
printf("miss ");
if (result->eviction) {
printf("eviction ");
}
}
printf("\n");
}
Checkpoint: Output matches reference for all provided traces.
Phase 7: Miss Classification (Days 15-17)
Goals:
- Classify each miss as compulsory, capacity, or conflict
- Track block history for classification
Tasks:
- Track all unique blocks ever accessed (for compulsory detection)
- Simulate a fully-associative cache of same size (for conflict detection)
- Classify: compulsory if first access; conflict if FA would hit; else capacity
Implementation Strategy:
// Maintain a fully-associative "oracle" cache for classification
typedef struct {
unsigned long *blocks; // Array of block addresses
int capacity; // Max blocks (same as main cache)
int count; // Current count
int lru_clock;
} oracle_cache_t;
char classify_miss(cache_t *cache, unsigned long block_addr,
oracle_cache_t *oracle, hash_set_t *seen_blocks) {
// Check if block was ever seen before
if (!hash_set_contains(seen_blocks, block_addr)) {
hash_set_add(seen_blocks, block_addr);
return 'C'; // Compulsory
}
// Check if fully-associative cache would have hit
if (oracle_access(oracle, block_addr)) {
return 'O'; // Conflict (would have hit with full associativity)
}
return 'A'; // Capacity (even FA cache missed)
}
Checkpoint: Miss classification matches expected distribution for test traces.
Phase 8: ASCII Visualizer (Days 18-21)
Goals:
- Display cache state visually
- Show access patterns and miss types
Tasks:
- Design ASCII cache representation
- Implement
visualize_state(cache, access, result) - Add pattern detection (stride, sequential)
- Implement running statistics display
Visualization Implementation:
void visualize_state(cache_t *cache, int access_num,
mem_access_t *access, access_result_t *result) {
printf("\nAccess #%d: %c 0x%lx [%s%s]\n",
access_num, access->op, access->addr,
result->hit ? "HIT" : "MISS",
result->hit ? "" : get_miss_type_str(result->miss_type));
printf("โ");
for (int i = 0; i < 40; i++) printf("โ");
printf("โ\n");
for (int s = 0; s < cache->num_sets; s++) {
printf("โ Set %d: ", s);
for (int l = 0; l < cache->lines_per_set; l++) {
cache_line_t *line = &cache->sets[s].lines[l];
if (line->valid) {
printf("[%04lx] ", line->tag & 0xFFFF);
} else {
printf("[ ] ");
}
}
if (s == result->set_index) {
printf(" โ %s", result->hit ? "HIT" : "accessed");
}
printf("\n");
}
printf("โ");
for (int i = 0; i < 40; i++) printf("โ");
printf("โ\n");
float hit_rate = (float)cache->hits / (cache->hits + cache->misses) * 100;
printf("Running: hits=%d misses=%d (%.1f%% hit rate)\n",
cache->hits, cache->misses, hit_rate);
}
Checkpoint: Visualizer shows correct cache state and highlights accessed sets.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Test individual components | Address decomposition, LRU logic |
| Trace Tests | Validate against known traces | Match reference simulator output |
| Edge Cases | Handle unusual inputs | Empty trace, single access, full cache |
| Stress Tests | Performance and correctness at scale | Million-access traces |
6.2 Test Traces
Create traces for specific scenarios:
trace1_sequential.trace - Sequential access, good spatial locality:
L 0,4
L 4,4
L 8,4
L c,4
L 10,4
L 14,4
trace2_strided.trace - Strided access, poor spatial locality:
L 0,4
L 40,4
L 80,4
L c0,4
L 100,4
trace3_conflict.trace - Designed to cause conflict misses:
# For direct-mapped cache with 4 sets (s=2), 16-byte blocks (b=4)
# These all map to set 0
L 0,4 # Set 0, tag 0
L 40,4 # Set 0, tag 1 (evicts previous)
L 0,4 # Set 0, tag 0 (conflict miss!)
L 40,4 # Set 0, tag 1 (conflict miss!)
trace4_reuse.trace - Temporal locality test:
L 0,4
L 10,4
L 0,4 # Should be hit
L 10,4 # Should be hit
L 0,4 # Should be hit
6.3 Reference Validator
#!/bin/bash
# validate.sh - Compare against reference simulator
TRACES="yi dave trans simple"
CONFIGS="1-1-1 4-2-4 2-4-3" # s-E-b configurations
for trace in $TRACES; do
for config in $CONFIGS; do
s=$(echo $config | cut -d- -f1)
E=$(echo $config | cut -d- -f2)
b=$(echo $config | cut -d- -f3)
expected=$(./csim-ref -s $s -E $E -b $b -t traces/${trace}.trace)
actual=$(./csim -s $s -E $E -b $b -t traces/${trace}.trace)
if [ "$expected" == "$actual" ]; then
echo "PASS: $trace with s=$s E=$E b=$b"
else
echo "FAIL: $trace with s=$s E=$E b=$b"
echo " Expected: $expected"
echo " Actual: $actual"
fi
done
done
6.4 Critical Test Cases
- Empty cache hit detection: First access is always a miss
- LRU correctness: Verify least-recently-used block is evicted
- Set isolation: Accesses to different sets donโt affect each other
- Block boundary: Adjacent addresses in same block share cache line
- Tag uniqueness: Same index, different tags are different blocks
- Modify operation: โMโ counts as one potential miss but two potential hits
7. Common Pitfalls
7.1 Implementation Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Off-by-one in bit extraction | Wrong set/tag | Use (1 << n) - 1 for n-bit masks |
| Forgetting to update LRU on hit | Wrong eviction | Always update LRU counter on any access |
| Not handling โMโ operations | Miscount hits | โMโ = load + store; can hit twice |
| Signed vs unsigned addresses | Negative set indices | Use unsigned long for addresses |
| Wrong mask calculation | Tag/index overlap | Verify: tag_bits + index_bits + offset_bits = address_bits |
| Evicting wrong line | Incorrect cache state | Verify LRU logic with small examples |
7.2 Debugging Strategies
// Debug mode: print every decomposition
#ifdef DEBUG
printf("addr=0x%lx -> tag=0x%lx set=%lu offset=%lu\n",
addr, tag, set_index, offset);
#endif
// Sanity checks
assert(set_index < cache->num_sets);
assert(offset < cache->block_size);
assert(tag < (1UL << cache->t));
7.3 Performance Traps
- Large traces: Donโt store all accesses in memory; process streaming
- Hash table for seen blocks: Donโt use linear search for millions of blocks
- Visualization overhead: Only visualize in debug mode or with sampling
8. Extensions
8.1 Beginner Extensions
- FIFO replacement policy: Simpler than LRU
- Random replacement: Even simpler, good baseline
- Write operations: Track stores separately from loads
- Command-line help: Proper usage message with
-h
8.2 Intermediate Extensions
- Multi-level cache: Simulate L1 + L2; misses in L1 go to L2
- Inclusive vs exclusive: Different multi-level policies
- Write-back simulation: Track dirty bits, count writebacks
- Trace generation: Create traces from actual program runs
- Block-level statistics: Per-block hit/miss counts
8.3 Advanced Extensions
- Prefetching simulation: Next-line, stride, or correlation prefetch
- TLB simulation: Combine with page table simulation
- Cache coherence: Simulate MESI protocol for multi-core
- Real code optimization: Use simulator to guide actual code changes
- GUI visualizer: Web-based or ncurses interface
8.4 Multi-Level Cache Example
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MULTI-LEVEL CACHE SIMULATION โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Typical modern hierarchy: โ
โ โ
โ โโโโโโโโโโโโโโโ โ
โ โ CPU โ โ
โ โโโโโโโโฌโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ ~1ns, 32KB, 8-way โ
โ โ L1 Cache โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโฌโโโโโโโ โ L1 miss โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโ ~4ns, 256KB, 8-way โ
โ โ L2 Cache โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโฌโโโโโโโ โ L2 miss โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโ ~20ns, 8MB, 16-way โ
โ โ L3 Cache โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโฌโโโโโโโ โ L3 miss โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโ ~100ns โ
โ โ DRAM โ โ
โ โโโโโโโโโโโโโโโ โ
โ โ
โ Simulation approach: โ
โ โโโโโโโโโโโโโโโโโโ โ
โ 1. Try L1: if hit, done โ
โ 2. If L1 miss, try L2: if hit, install in L1 โ
โ 3. If L2 miss, try L3: if hit, install in L1 and L2 โ
โ 4. If L3 miss, fetch from memory, install everywhere โ
โ โ
โ // Pseudocode โ
โ result = access_cache(L1, addr); โ
โ if (!result.hit) { โ
โ result = access_cache(L2, addr); โ
โ if (!result.hit) { โ
โ result = access_cache(L3, addr); โ
โ if (!result.hit) { โ
โ // Fetch from memory โ
โ } โ
โ install_block(L2, addr); โ
โ } โ
โ install_block(L1, addr); โ
โ } โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
9. Real-World Connections
9.1 Performance Engineering
Your cache simulator skills directly apply to:
- Database systems: Query planners consider cache behavior
- Game engines: Data-oriented design for cache efficiency
- Scientific computing: Blocking for matrix operations
- Embedded systems: Understanding tight memory constraints
9.2 Tools That Use These Concepts
| Tool | Purpose | Cache Relevance |
|---|---|---|
| Valgrind/Cachegrind | Cache profiling | Uses simulation like yours |
| perf | Hardware counters | Measures real cache events |
| Intel VTune | Performance analysis | Cache miss attribution |
| Compiler (-O3) | Optimization | Loop blocking, prefetching |
9.3 Interview Applications
Common questions your project prepares you for:
- โWhy is row-major traversal faster than column-major for matrices?โ
- โHow would you optimize a function with poor cache behavior?โ
- โExplain the tradeoffs in cache associativity.โ
- โWhatโs the difference between L1 and L3 cache?โ
- โHow do you measure cache performance?โ
9.4 Modern CPU Cache Parameters
For reference, typical modern CPU cache configurations:
| Level | Size | Associativity | Block Size | Latency |
|---|---|---|---|---|
| L1D | 32-48 KB | 8-12 way | 64 bytes | 4-5 cycles |
| L1I | 32 KB | 8 way | 64 bytes | 4-5 cycles |
| L2 | 256-512 KB | 4-8 way | 64 bytes | 12-14 cycles |
| L3 | 8-64 MB | 12-16 way | 64 bytes | 40-75 cycles |
10. Resources
10.1 Essential Reading
- CS:APP Chapter 6: โThe Memory Hierarchyโ - Core material
- CS:APP Chapter 5: โOptimizing Program Performanceโ - Measurement discipline
- โWhat Every Programmer Should Know About Memoryโ by Ulrich Drepper - Deep dive
10.2 Tools
- Valgrind/Cachegrind:
valgrind --tool=cachegrind ./program - perf:
perf stat -e cache-references,cache-misses ./program - PAPI: Performance API for hardware counters
10.3 Trace Generation
# Generate trace with Valgrind lackey
valgrind --tool=lackey --trace-mem=yes ./your_program 2>&1 | \
grep -E '^ [LS]' > trace.trace
10.4 Related Projects
- Previous: P8 (Performance Clinic) for measurement methodology
- Next: P13 (VM Map Visualizer) extends memory hierarchy understanding
- Related: P14 (Malloc) uses locality principles
10.5 Online Resources
- Cache Simulator Visualization
- Interactive Cache Demo
- CS:APP Cache Lab materials at CMU
11. Self-Assessment Checklist
Understanding
- I can explain address decomposition (tag/index/offset) without notes
- I can draw a set-associative cache and trace an access through it
- I understand why spatial locality works (blocks, not bytes)
- I can classify a miss as compulsory, capacity, or conflict
- I can explain LRU replacement step-by-step
- I understand why blocking improves matrix multiply performance
Implementation
- Simulator correctly handles all trace operations (L, S, M)
- Hit/miss/eviction counts match reference for all test traces
- LRU replacement is implemented correctly
- Address decomposition is verified with manual calculations
- Visualizer shows clear cache state representation
Application
- I can predict relative cache performance of two code variants
- I used the simulator to measure a real optimizationโs effect
- I can explain a miss-rate reduction in terms of locality principles
- I can choose appropriate cache parameters for a simulation
Growth
- I understand how to extend to multi-level caches
- I know how real tools (Cachegrind, perf) relate to my simulator
- I can apply cache knowledge to optimize real programs
12. Completion Criteria
Minimum Viable Completion
- Direct-mapped and set-associative simulation works
- Address decomposition is correct
- Hit/miss/eviction counting matches reference
- Processes provided trace files
- Basic text output works
Full Completion
- LRU replacement is correct
- All trace formats handled (L, S, M)
- Miss classification (compulsory/capacity/conflict)
- Verbose mode shows per-access details
- ASCII visualizer displays cache state
- All reference traces match expected output
Excellence (Going Above & Beyond)
- Multi-level cache simulation
- Multiple replacement policies (LRU, FIFO, Random)
- Prefetching simulation
- Real program optimization demonstrated with simulator
- Interactive visualization or web interface
13. Real World Outcome
When you complete this project, you will have a fully functional cache simulator and locality visualizer. Here is exactly what running your tools will look like:
Cache Simulator Basic Usage
$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3
Verbose Mode with Full Trace
$ ./csim -v -s 4 -E 2 -b 4 -t traces/yi.trace
Cache Configuration:
Sets (S): 16 (s=4 bits)
Lines per set (E): 2 (2-way set-associative)
Block size (B): 16 bytes (b=4 bits)
Cache size: 512 bytes
Processing trace: traces/yi.trace
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
L 10,1 set=1 tag=0x000001 miss
M 20,1 set=2 tag=0x000002 miss hit
L 22,1 set=2 tag=0x000002 hit
S 18,1 set=1 tag=0x000001 hit
L 110,1 set=1 tag=0x000011 miss eviction
L 210,1 set=1 tag=0x000021 miss eviction
M 12,1 set=1 tag=0x000001 miss eviction hit
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Summary:
hits:4 misses:5 evictions:3
Hit rate: 44.4%
Miss rate: 55.6%
Miss Classification Analysis
$ ./csim --classify -s 4 -E 2 -b 4 -t traces/matrix_multiply.trace
Cache Configuration:
Sets: 16 (s=4)
Lines per set: 2 (E=2)
Block size: 16 bytes (b=4)
Total size: 512 bytes
Access Summary:
Total accesses: 10,000
Hits: 7,234 (72.3%)
Misses: 2,766 (27.7%)
Miss Classification:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Compulsory: 1,024 (37.0% of misses) โโโโโโโโโ โ
โ Capacity: 892 (32.3% of misses) โโโโโโโโ โ
โ Conflict: 850 (30.7% of misses) โโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Evictions: 1,742
Recommendations:
- High conflict misses: Consider 4-way associativity
- Capacity misses indicate working set > cache size
- Use blocking with block size ~16 elements
Locality Visualizer Output
$ ./lvis -s 2 -E 2 -b 2 -t traces/simple.trace --animate
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CACHE LOCALITY VISUALIZER โ
โ Cache: 4 sets ร 2 ways ร 4 bytes = 32 bytes โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Access #1: L 0x00 (4 bytes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ t=0x00] [ ] โ โ MISS (Compulsory)
โ Set 1: [ ] [ ] โ
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stats: hits=0 misses=1 rate=0.0%
Access #2: L 0x04 (4 bytes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ t=0x00] [ ] โ
โ Set 1: [โโโโ t=0x00] [ ] โ โ MISS (Compulsory)
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stats: hits=0 misses=2 rate=0.0%
Access #3: L 0x00 (4 bytes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ t=0x00] [ ] โ โ HIT!
โ Set 1: [โโโโ t=0x00] [ ] โ
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stats: hits=1 misses=2 rate=33.3%
Access #4: L 0x10 (4 bytes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Set 0: [โโโโ t=0x01] [โโโโ t=0x00] โ โ MISS (loaded new tag)
โ Set 1: [โโโโ t=0x00] [ ] โ
โ Set 2: [ ] [ ] โ
โ Set 3: [ ] [ ] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stats: hits=1 misses=3 rate=25.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PATTERN ANALYSIS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Detected stride: 4 bytes
Spatial locality: GOOD (sequential within cache lines)
Temporal locality: MODERATE (50% reuse of cached data)
Recommendation: Current cache configuration is adequate
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Matrix Multiply Optimization Analysis
$ ./csim --compare -s 5 -E 4 -b 6 \
-t traces/matmul_naive.trace \
-t traces/matmul_blocked.trace
Comparing cache behavior between two traces:
Cache: 32 sets, 4-way, 64-byte blocks (8 KB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ COMPARISON RESULTS โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโค
โ โ matmul_naive โ matmul_blocked โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโค
โ Total accesses โ 2,097,152 โ 2,097,152 โ
โ Hits โ 524,288 (25.0%) โ 1,835,008 (87.5%) โ
โ Misses โ 1,572,864 (75.0%) โ 262,144 (12.5%) โ
โ Evictions โ 1,572,832 โ 262,112 โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโค
โ Est. cycles โ 157,810,688 โ 28,311,552 โ
โ Speedup โ 1.0x (baseline) โ 5.6x faster โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโ
Analysis: Blocking reduced misses by 83.3%
Estimated 5.6x performance improvement
14. The Core Question Youโre Answering
โHow does the memory hierarchy create the illusion of a large, fast memory, and how can I write code that exploits this illusion?โ
This project forces you to deeply understand that CPU performance is dominated by memory access patterns. A cache miss can cost 100+ CPU cycles while a hit costs only 1-4 cycles. By building a simulator, you internalize how your data layout and access patterns directly translate to cache behavior - knowledge that separates average programmers from performance-conscious systems engineers.
15. Concepts You Must Understand First
Before starting this project, ensure you have a solid grasp of these foundational concepts:
| Concept | Where to Learn | Why It Matters |
|---|---|---|
| Binary/Hexadecimal representation | CS:APP 2.1 | Address decomposition requires bit manipulation |
| Bitwise operations (AND, OR, shifts) | CS:APP 2.1.4 | Extracting tag/index/offset from addresses |
| Memory addressing | CS:APP 6.1 | Understanding how addresses map to physical memory |
| Locality of reference | CS:APP 6.2 | Why caches work - temporal and spatial locality |
| C pointers and dynamic allocation | CS:APP 2.1, K&R Ch. 5 | Building cache data structures |
| Structs and arrays | K&R Ch. 6 | Representing cache organization |
| File I/O in C | K&R Ch. 7 | Parsing trace files |
| Command-line argument parsing | K&R 5.10 | Handling -s, -E, -b flags |
16. Questions to Guide Your Design
Work through these questions before writing any code:
-
Address Decomposition: Given a 64-bit address, s=4, and b=6, write out the formula to extract tag, set index, and block offset. Try it with address 0x7FF000A3C8.
-
Data Structure Choice: How will you represent a cache line? A set? The entire cache? Why might a 2D array work for some configurations but not others?
-
LRU Tracking: How will you track access recency for LRU replacement? Counter per line? Linked list? Whatโs the time complexity of each approach?
-
Miss Classification: To classify a miss as conflict vs capacity, you need to know if a fully-associative cache of the same size would have hit. How will you simulate this efficiently?
-
Modify Operations: A trace line โM 7ff000a3c8,4โ represents a load followed by a store. How do you handle this? Can both the load and store be hits?
-
Block Boundaries: If a load accesses bytes 60-67 of a 64-byte block, what happens? Do you need to track byte-level access or just block-level?
17. Thinking Exercise
Before coding, hand-trace this sequence through a cache with s=1 (2 sets), E=2 (2-way), b=2 (4-byte blocks):
L 0x00 # Load from address 0
L 0x04 # Load from address 4
L 0x08 # Load from address 8
L 0x00 # Load from address 0 again
L 0x10 # Load from address 16
L 0x00 # Load from address 0 again
For each access, determine:
- Which set is accessed?
- Is it a hit or miss?
- If miss, is it compulsory, capacity, or conflict?
- Which line is evicted (if any)?
- What is the final cache state?
Show your work before looking at the solution. Draw the cache state after each access.
Solution (click to expand)
Address decomposition for this cache:
- Block offset: bits [1:0] (2 bits)
- Set index: bit [2] (1 bit, so 2 sets)
- Tag: bits [63:3]
| Access | Address | Set | Tag | Result | Cache State After |
|---|---|---|---|---|---|
| L 0x00 | 0b000 | 0 | 0 | MISS (compulsory) | Set0: [tag=0, _] |
| L 0x04 | 0b100 | 1 | 0 | MISS (compulsory) | Set0: [tag=0, _] Set1: [tag=0, _] |
| L 0x08 | 0b1000 | 0 | 1 | MISS (compulsory) | Set0: [tag=0, tag=1] Set1: [tag=0, _] |
| L 0x00 | 0b000 | 0 | 0 | HIT | Set0: [tag=0, tag=1] |
| L 0x10 | 0b10000 | 0 | 2 | MISS (capacity) | Set0: [tag=0, tag=2] (evicted tag=1) |
| L 0x00 | 0b000 | 0 | 0 | HIT | Set0: [tag=0, tag=2] |
Final: hits=2, misses=4, evictions=1
18. The Interview Questions Theyโll Ask
After completing this project, you should be able to confidently answer these questions:
- โExplain the difference between direct-mapped, set-associative, and fully-associative caches. What are the tradeoffs?โ
- Expect follow-ups about hardware cost, hit latency, and miss rate
- โWhy does row-major traversal of a matrix outperform column-major traversal in C?โ
- Must connect to cache line size and spatial locality
- โWhat is cache thrashing and how would you detect and fix it in code?โ
- Discuss conflict misses, power-of-2 strides, and padding
- โWalk me through exactly what happens when the CPU accesses memory at address X.โ
- Cover TLB, address decomposition, tag comparison, miss handling
- โHow does blocking (loop tiling) improve matrix multiplication performance?โ
- Quantify: 3 blocks must fit in cache, calculate optimal block size
- โYou have a program with 90% cache miss rate. How do you diagnose and fix it?โ
- Discuss Cachegrind, perf, miss classification, data layout changes
19. Hints in Layers
Use these hints progressively if you get stuck. Try to solve problems yourself before advancing to the next hint level.
Hint Layer 1: Getting Started
- Start with address decomposition. Write a function that takes an address and prints tag/index/offset. Test it extensively before moving on.
- For cache state, a 2D array
cache_line_t cache[S][E]works well for small caches. - Parse traces with
fscanf(fp, " %c %lx,%d", &op, &addr, &size).
Hint Layer 2: Core Logic
- Hit detection: For each line in set[index], check
valid && (tag == line.tag). - For LRU, a simple counter works: increment global counter on each access, store it in the accessed line. Victim is the line with smallest counter.
- Handle โMโ as two separate accesses: a load (which may miss) followed by a store (which always hits if the load brought in the block).
Hint Layer 3: Miss Classification
- Maintain a hash set of all block addresses ever seen. First access to a block is compulsory.
- Simulate a fully-associative cache of the same total size in parallel. If the FA cache hits but your cache misses, itโs a conflict miss.
- Everything else is a capacity miss.
Hint Layer 4: Common Bugs
- Off-by-one in bit masks: use
(1UL << n) - 1, not(1 << n) - 1(the UL matters for 64-bit). - Forgetting to update LRU counter on hits (not just misses).
- Using
intinstead ofunsigned longfor addresses (signed arithmetic bugs). - Not handling the case where a set has empty lines (no eviction needed).
20. Books That Will Help
| Topic | Book | Specific Chapters |
|---|---|---|
| Cache organization fundamentals | CS:APP (3rd ed.) | Chapter 6.1-6.4 |
| Locality and optimization | CS:APP (3rd ed.) | Chapter 6.5-6.6 |
| Performance measurement | CS:APP (3rd ed.) | Chapter 5.1-5.7 |
| Memory hierarchy design | Patterson & Hennessy: Computer Architecture | Chapter 5 |
| Advanced caching | Ulrich Drepper: โWhat Every Programmer Should Know About Memoryโ | Sections 3.1-3.3 |
| C bit manipulation | K&R: The C Programming Language | Chapter 2.9 |
| File I/O in C | K&R: The C Programming Language | Chapter 7.5-7.7 |
| Cache-oblivious algorithms | Introduction to Algorithms (CLRS) | Chapter 27 |
This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.