Project 9: Cache Lab++ – Cache Simulator + Locality Visualizer

Build a set-associative cache simulator and an ASCII locality visualizer that reveals hit/miss patterns, enabling you to optimize code through informed data layout decisions.

Quick Reference

Attribute	Value
Language	C (alt: Rust, Zig, C++)
Difficulty	Advanced
Time	2–3 weeks
Chapters	6, 5
Coolness	★★★★☆ Hardcore Tech Flex
Portfolio Value	Resume Gold

Learning Objectives

By completing this project, you will:

Understand cache organization: Master direct-mapped, set-associative, and fully-associative cache structures
Implement address decomposition: Parse addresses into tag, index, and offset components
Classify cache misses: Distinguish compulsory, capacity, and conflict misses through simulation
Apply replacement policies: Implement LRU, FIFO, and random replacement strategies
Visualize memory access patterns: Create ASCII visualizations showing cache behavior over time
Optimize for locality: Design data layouts and access patterns that exploit cache behavior
Measure and analyze: Use your simulator to quantify cache performance and validate optimizations

Deep Theoretical Foundation

The Memory Hierarchy: Why Caches Exist

Modern processors operate orders of magnitude faster than main memory. Without caches, the CPU would spend most of its time waiting for data:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        MEMORY HIERARCHY OVERVIEW                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Component        Size           Access Time    Bandwidth                  │
│   ──────────────────────────────────────────────────────────────────────   │
│   CPU Registers    ~1 KB          ~0.25 ns       100s GB/s                  │
│   L1 Cache         32-64 KB       ~1 ns          ~500 GB/s                  │
│   L2 Cache         256 KB-1 MB    ~3-4 ns        ~200 GB/s                  │
│   L3 Cache         8-64 MB        ~10-20 ns      ~100 GB/s                  │
│   Main Memory      16-256 GB      ~50-100 ns     ~50 GB/s                   │
│   SSD              256 GB-4 TB    ~100 us        ~5 GB/s                    │
│   HDD              1-20 TB        ~10 ms         ~200 MB/s                  │
│                                                                             │
│                         Speed                                               │
│                           ▲                                                 │
│                           │  Registers                                      │
│                           │    ●                                            │
│                           │      L1 Cache                                   │
│                           │        ●                                        │
│                           │          L2 Cache                               │
│                           │            ●                                    │
│                           │              L3 Cache                           │
│                           │                ●                                │
│                           │                  Main Memory                    │
│                           │                    ●                            │
│                           │                                                 │
│                           └─────────────────────────────────────► Size      │
│                                                                             │
│   Key insight: Smaller = Faster (physics + economics)                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why this hierarchy works: Programs exhibit locality of reference:

Temporal locality: Recently accessed items are likely to be accessed again soon
Spatial locality: Items near recently accessed items are likely to be accessed soon

Caches exploit both: they keep recently-used data (temporal) in cache lines that hold multiple adjacent bytes (spatial).

Cache Organization Fundamentals

A cache is organized as a collection of sets, each containing one or more lines (also called blocks):

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CACHE ORGANIZATION ANATOMY                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Cache Parameters:                                                         │
│   ─────────────────                                                         │
│   S = Number of sets                     (S = 2^s, where s = index bits)   │
│   E = Lines per set (associativity)      (E = 1: direct-mapped)            │
│   B = Bytes per line (block size)        (B = 2^b, where b = offset bits)  │
│                                                                             │
│   Total cache size: C = S × E × B bytes                                    │
│                                                                             │
│                                                                             │
│   Example: 4 sets, 2 lines/set, 4 bytes/line (32-byte 2-way set-assoc)     │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                            CACHE                                    │  │
│   ├─────────────────────────────────────────────────────────────────────┤  │
│   │ Set 0 │ Line 0: [V][Tag: 0x1A][B0][B1][B2][B3]                      │  │
│   │       │ Line 1: [V][Tag: 0x2F][B0][B1][B2][B3]                      │  │
│   ├───────┼─────────────────────────────────────────────────────────────┤  │
│   │ Set 1 │ Line 0: [V][Tag: 0x05][B0][B1][B2][B3]                      │  │
│   │       │ Line 1: [ ][Tag: ---][--][--][--][--]  (invalid)            │  │
│   ├───────┼─────────────────────────────────────────────────────────────┤  │
│   │ Set 2 │ Line 0: [V][Tag: 0x12][B0][B1][B2][B3]                      │  │
│   │       │ Line 1: [V][Tag: 0x77][B0][B1][B2][B3]                      │  │
│   ├───────┼─────────────────────────────────────────────────────────────┤  │
│   │ Set 3 │ Line 0: [V][Tag: 0x00][B0][B1][B2][B3]                      │  │
│   │       │ Line 1: [V][Tag: 0x33][B0][B1][B2][B3]                      │  │
│   └───────┴─────────────────────────────────────────────────────────────┘  │
│                                                                             │
│   Each line contains:                                                       │
│   • Valid bit (V): Is this line holding valid data?                        │
│   • Tag: High bits of address to identify which block is stored            │
│   • Data block: B bytes of actual cached data                              │
│   • (Optional) Dirty bit: Has this block been modified? (for write-back)   │
│   • (Optional) LRU bits: Track access recency for replacement              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Cache Types by Associativity

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ASSOCIATIVITY SPECTRUM                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   DIRECT-MAPPED (E = 1)                                                     │
│   ─────────────────────                                                     │
│   Each memory block maps to exactly ONE cache line.                         │
│                                                                             │
│   Pros: Simple, fast lookup (no searching)                                 │
│   Cons: High conflict miss rate                                             │
│                                                                             │
│   Memory:  [A][B][C][D][E][F][G][H][I][J][K][L]...                          │
│             │     │     │     │                                             │
│             ▼     ▼     ▼     ▼                                             │
│   Cache:   [0]   [1]   [2]   [3]   ← A, E, I all compete for line 0!       │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   SET-ASSOCIATIVE (1 < E < C/B)                                            │
│   ────────────────────────────────                                          │
│   Each block maps to a SET, can go in any line within that set.            │
│                                                                             │
│   Pros: Reduces conflict misses; reasonable lookup speed                    │
│   Cons: More complex; requires replacement policy                           │
│                                                                             │
│   2-way set-associative example:                                            │
│   Memory:  [A][B][C][D][E][F][G][H]...                                      │
│             │  │  │  │                                                      │
│             ▼  ▼  ▼  ▼                                                      │
│   Set 0:  [_][_]  ← A and E can BOTH live here now!                        │
│   Set 1:  [_][_]  ← B and F can both live here                             │
│                                                                             │
│   Common configurations: 2-way, 4-way, 8-way                                │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   FULLY-ASSOCIATIVE (E = C/B, S = 1)                                       │
│   ──────────────────────────────────                                        │
│   Any block can go in ANY cache line. One giant set.                       │
│                                                                             │
│   Pros: No conflict misses (only compulsory and capacity)                  │
│   Cons: Slow lookup (must search all lines); complex hardware              │
│                                                                             │
│   Memory:  [A][B][C][D][E][F][G][H]...                                      │
│             │  │  │  │  │  │  │  │                                          │
│             └──┴──┴──┴──┼──┴──┴──┘                                          │
│                         ▼                                                   │
│   Cache:   [_][_][_][_][_][_][_][_]  ← Any block can go anywhere           │
│                                                                             │
│   Used mainly for TLBs and very small caches                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Address Decomposition: Tag, Index, Offset

Every memory address is decomposed into three fields that determine cache behavior:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      ADDRESS DECOMPOSITION                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   For a cache with S = 2^s sets and B = 2^b bytes per block:               │
│                                                                             │
│   m-bit address:                                                            │
│   ┌────────────────────────┬───────────────┬──────────────┐                │
│   │         Tag            │    Index      │    Offset    │                │
│   │      (t bits)          │   (s bits)    │   (b bits)   │                │
│   └────────────────────────┴───────────────┴──────────────┘                │
│                                                                             │
│   Where: t = m - s - b                                                      │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   EXAMPLE: 32-bit address, 16 sets, 64 bytes/block                         │
│                                                                             │
│   S = 16 = 2^4  → s = 4 bits for index                                     │
│   B = 64 = 2^6  → b = 6 bits for offset                                    │
│   t = 32 - 4 - 6 = 22 bits for tag                                         │
│                                                                             │
│   Address 0x12345678:                                                       │
│   Binary: 0001 0010 0011 0100 0101 0110 0111 1000                          │
│                                                                             │
│   ┌──────────────────────┬──────┬────────┐                                 │
│   │ 0001001000110100010101 │ 1001 │ 111000 │                                │
│   │       Tag (22 bits)    │Index │ Offset │                                │
│   │       = 0x048D15       │ = 9  │ = 56   │                                │
│   └──────────────────────┴──────┴────────┘                                 │
│                                                                             │
│   Interpretation:                                                           │
│   • Look in Set 9                                                           │
│   • Compare tag 0x048D15 against all lines in set 9                        │
│   • If match AND valid bit set: HIT! Return byte at offset 56              │
│   • If no match: MISS. Fetch block, store with this tag                    │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   SPECIAL CASES:                                                            │
│                                                                             │
│   Direct-mapped (E=1, S=C/B):                                              │
│   • Maximum index bits                                                      │
│   • Block maps to exactly one place                                         │
│                                                                             │
│   Fully-associative (S=1):                                                  │
│   • No index bits (s=0)                                                     │
│   • Tag is almost the entire address                                        │
│   • Must compare tag against ALL lines                                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The Cache Line (Block) Concept

A cache line is the fundamental unit of transfer between cache and memory:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          CACHE LINE CONCEPT                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   When CPU requests 1 byte, the cache fetches an ENTIRE BLOCK:             │
│                                                                             │
│   CPU requests address 0x1004:                                              │
│                                                                             │
│   Memory:                                                                   │
│   ┌──────────────────────────────────────────────────────────────────┐     │
│   │ ... │0x1000│0x1001│0x1002│0x1003│0x1004│0x1005│0x1006│0x1007│ ... │    │
│   │     │  A   │  B   │  C   │  D   │ [E]  │  F   │  G   │  H   │     │    │
│   └──────────────────────────────────────────────────────────────────┘     │
│                                    │                                        │
│           ┌────────────────────────┘                                        │
│           ▼  Entire 8-byte block fetched                                    │
│   Cache Line:                                                               │
│   ┌───────────────────────────────────────────────────┐                    │
│   │ V=1 │ Tag │ A │ B │ C │ D │ E │ F │ G │ H │       │                    │
│   └───────────────────────────────────────────────────┘                    │
│                           ↑                                                 │
│                    Requested byte                                           │
│                                                                             │
│   WHY BLOCKS?                                                               │
│   ────────────                                                              │
│   1. DRAM is optimized for sequential access (row buffers)                 │
│   2. Exploits SPATIAL LOCALITY: if you access E, you probably              │
│      will soon access F, G, H... they're now already cached!               │
│   3. Amortizes the cost of the miss over multiple bytes                    │
│                                                                             │
│   BLOCK SIZE TRADEOFFS:                                                     │
│   ─────────────────────                                                     │
│   Larger blocks:                                                            │
│   + Better spatial locality exploitation                                    │
│   + Fewer tags (less overhead)                                              │
│   - Longer miss penalty (more bytes to transfer)                           │
│   - Fewer blocks for given cache size → more conflict misses               │
│   - May fetch unused data (if locality is poor)                            │
│                                                                             │
│   Typical sizes: 32, 64, or 128 bytes                                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Miss Types: The Three Cs

Understanding WHY a miss occurred is crucial for optimization:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         THE THREE Cs OF CACHE MISSES                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. COMPULSORY MISSES (Cold Misses)                                       │
│   ──────────────────────────────────                                        │
│   First access to a block that has never been in cache.                    │
│                                                                             │
│   Cause: Block simply hasn't been loaded yet                               │
│   Fix: Cannot be eliminated! (Must load data at least once)                │
│        Can be reduced with prefetching                                      │
│                                                                             │
│   Example:                                                                  │
│   for (i = 0; i < N; i++)                                                  │
│       sum += A[i];  // First access to each cache line is compulsory       │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   2. CAPACITY MISSES                                                        │
│   ──────────────────                                                        │
│   Working set is larger than the cache.                                    │
│                                                                             │
│   Cause: Too much data for cache to hold                                   │
│   Fix: Reduce working set size (blocking/tiling)                           │
│        Or get a bigger cache (hardware solution)                           │
│                                                                             │
│   Example (cache holds 4 blocks):                                           │
│   // Working set = 8 blocks (A[0..7])                                      │
│   for (pass = 0; pass < 100; pass++)                                       │
│       for (i = 0; i < 8; i++)                                              │
│           sum += A[i];  // Every pass evicts earlier blocks                │
│                                                                             │
│   Access pattern: A0,A1,A2,A3,A4,A5,A6,A7, A0(miss!),A1(miss!)...          │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   3. CONFLICT MISSES                                                        │
│   ──────────────────                                                        │
│   Block evicted due to other blocks mapping to same set.                   │
│   (Would NOT have occurred in a fully-associative cache of same size)      │
│                                                                             │
│   Cause: Limited associativity + unfortunate address mapping               │
│   Fix: Increase associativity; change data layout; padding                 │
│                                                                             │
│   Example (direct-mapped, 4 sets):                                          │
│   // A[0] and A[4] map to same set!                                        │
│   for (i = 0; i < 100; i++) {                                              │
│       sum += A[0];   // miss, load A[0]                                    │
│       sum += A[4];   // miss, evicts A[0], load A[4]                       │
│       sum += A[0];   // miss! A[0] was evicted by A[4]                     │
│       sum += A[4];   // miss! A[4] was evicted by A[0]                     │
│   }                                                                         │
│   // Every access is a miss! This is "cache thrashing"                     │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   IDENTIFYING MISS TYPE IN SIMULATION:                                      │
│                                                                             │
│   if (block never seen before):                                             │
│       → COMPULSORY                                                          │
│   else if (would hit in fully-associative cache of same size):             │
│       → CONFLICT (evicted due to set collision)                            │
│   else:                                                                     │
│       → CAPACITY (evicted due to overall cache full)                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Replacement Policies

When a set is full and we need to load a new block, which victim do we evict?

┌─────────────────────────────────────────────────────────────────────────────┐
│                         REPLACEMENT POLICIES                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. LRU (Least Recently Used)                                             │
│   ────────────────────────────                                              │
│   Evict the block that hasn't been accessed for the longest time.          │
│                                                                             │
│   Rationale: Temporal locality — recently used items likely used again     │
│   Implementation: Track access order per set                                │
│                                                                             │
│   Example (2-way set, LRU policy):                                          │
│   Access sequence: A, B, A, C                                               │
│                                                                             │
│   After A:  [A*, _]           (* = most recent)                            │
│   After B:  [A,  B*]                                                       │
│   After A:  [A*, B]           (A touched, now most recent)                 │
│   After C:  [A*, C*] → evict B (LRU), C takes its place                    │
│             [A,  C*]                                                       │
│                                                                             │
│   Pros: Often best hit rate; matches locality assumption                   │
│   Cons: Hardware complexity for high associativity;                        │
│         O(E) state per set; update on every access                         │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   2. FIFO (First In, First Out)                                            │
│   ─────────────────────────────                                             │
│   Evict the block that was loaded earliest.                                │
│                                                                             │
│   Rationale: Simple approximation; older data less likely needed           │
│   Implementation: Circular pointer per set                                  │
│                                                                             │
│   Example (2-way set, FIFO policy):                                         │
│   Access sequence: A, B, A, C                                               │
│                                                                             │
│   After A:  [A, _]  pointer→0                                              │
│   After B:  [A, B]  pointer→0                                              │
│   After A:  [A, B]  (hit, pointer unchanged)                               │
│   After C:  [C, B]  evict A (oldest), pointer→1                            │
│                                                                             │
│   Note: A was evicted even though it was just accessed!                    │
│   This is Belady's anomaly — FIFO doesn't match recency.                   │
│                                                                             │
│   Pros: Simple to implement; O(1) state per set                            │
│   Cons: Ignores recency; can have poor hit rate                            │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   3. RANDOM                                                                 │
│   ────────                                                                  │
│   Pick a random victim from the set.                                       │
│                                                                             │
│   Rationale: Unpredictable; avoids pathological worst-cases                │
│   Implementation: Random number generator                                   │
│                                                                             │
│   Pros: Simplest hardware; no state needed; avoids worst cases             │
│   Cons: Unpredictable; may evict hot blocks                                │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   4. LFU (Least Frequently Used)                                           │
│   ──────────────────────────────                                            │
│   Evict the block with fewest accesses.                                    │
│                                                                             │
│   Pros: Good for some access patterns                                      │
│   Cons: Doesn't adapt to changing access patterns;                         │
│         Infrequently used early blocks never evicted                       │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   COMPARISON (typical hit rates):                                           │
│                                                                             │
│   LRU ≥ Pseudo-LRU ≈ PLRU > FIFO > Random                                  │
│                                                                             │
│   In practice: Modern CPUs use variants like Pseudo-LRU or                 │
│   tree-based approximations for hardware efficiency.                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Write Policies

What happens when the CPU writes to a cached block?

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WRITE POLICIES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TWO INDEPENDENT DECISIONS:                                                │
│                                                                             │
│   1. WRITE HIT: What to do when we write to a block already in cache?      │
│                                                                             │
│      WRITE-THROUGH                    WRITE-BACK                            │
│      ─────────────                    ──────────                            │
│      Write to cache AND memory        Write to cache only                   │
│      immediately                      Mark block "dirty"                    │
│                                       Write to memory only on eviction      │
│                                                                             │
│      + Memory always consistent       + Fewer memory writes                 │
│      + Simple                         + Better performance                  │
│      - High memory traffic            - Complex (track dirty bits)         │
│      - Slow (wait for memory)         - Memory inconsistent until evict    │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   2. WRITE MISS: What to do when we write to a block NOT in cache?         │
│                                                                             │
│      WRITE-ALLOCATE                   NO-WRITE-ALLOCATE                    │
│      ─────────────                    ─────────────────                     │
│      Load block into cache,           Write directly to memory             │
│      then write to cache              Don't load into cache                 │
│                                                                             │
│      + Good if writing then reading   + Saves cache space for reads        │
│      + Exploits spatial locality      - No locality benefit                │
│      - Extra memory read on miss                                            │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   COMMON COMBINATIONS:                                                      │
│                                                                             │
│   Write-Through + No-Write-Allocate                                        │
│   • Writes always go to memory; reads populate cache                       │
│   • Used in some L1 caches                                                 │
│                                                                             │
│   Write-Back + Write-Allocate (most common today)                          │
│   • Writes stay in cache until eviction                                    │
│   • Exploits temporal locality on writes                                   │
│   • Used in most L2/L3 caches                                              │
│                                                                             │
│   FOR THIS PROJECT:                                                         │
│   We focus on READ behavior. Write policies are a natural extension.       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Spatial and Temporal Locality

These are the principles that make caching effective:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      LOCALITY PRINCIPLES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TEMPORAL LOCALITY                                                         │
│   ─────────────────                                                         │
│   "If you accessed it recently, you'll probably access it again soon."     │
│                                                                             │
│   Good temporal locality:              Poor temporal locality:              │
│   ────────────────────────             ─────────────────────────            │
│   sum = 0;                             // Process each element once         │
│   for (i = 0; i < 1000; i++) {         for (i = 0; i < N; i++)              │
│       sum += A[i % 10];  // Reuse!         process(A[i]);  // No reuse     │
│   }                                                                         │
│                                                                             │
│   Access pattern: 0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,...              │
│   Each element accessed 100 times → 99% reuse from cache!                  │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   SPATIAL LOCALITY                                                          │
│   ────────────────                                                          │
│   "If you accessed location X, you'll probably access nearby locations."   │
│                                                                             │
│   Good spatial locality:               Poor spatial locality:               │
│   ──────────────────────               ─────────────────────────            │
│   // Sequential access                 // Stride access                     │
│   for (i = 0; i < N; i++)              for (i = 0; i < N; i += 64)          │
│       sum += A[i];                         sum += A[i];                     │
│                                                                             │
│   Access pattern: 0,1,2,3,4,5,6,7...   Access pattern: 0,64,128,192...     │
│   Each cache line fully utilized       Each cache line: 1 useful byte      │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   QUANTIFYING LOCALITY:                                                     │
│                                                                             │
│   Miss Rate = Misses / Total Accesses                                      │
│                                                                             │
│   For sequential access with B-byte blocks:                                │
│   Miss rate ≈ 1/B (one miss per B bytes → B-1 hits)                        │
│                                                                             │
│   For stride-K access with B-byte blocks:                                  │
│   If K < B: Miss rate ≈ 1/B (still get some spatial locality)             │
│   If K ≥ B: Miss rate ≈ 1/1 = 100% (no spatial locality!)                 │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   MATRIX ACCESS PATTERNS:                                                   │
│                                                                             │
│   Row-major storage: A[i][j] at address base + i*N + j                     │
│                                                                             │
│   Row-wise access (good):              Column-wise access (bad):            │
│   for (i...) for (j...)                for (j...) for (i...)               │
│       A[i][j]                              A[i][j]                          │
│                                                                             │
│   Memory:  [0,0][0,1][0,2][0,3][1,0][1,1][1,2][1,3]...                      │
│   Access:   1    2    3    4    9   10   11   12  (row-wise)               │
│   Access:   1    5    9   13    2    6   10   14  (col-wise = strided!)    │
│                                                                             │
│   Column-wise in row-major = stride of N elements!                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Blocking (Tiling) for Cache Optimization

The key optimization technique for improving cache performance:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     BLOCKING / TILING TECHNIQUE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   PROBLEM: Matrix multiply naive version has poor locality                 │
│                                                                             │
│   // C = A × B  (N × N matrices)                                           │
│   for (i = 0; i < N; i++)                                                  │
│       for (j = 0; j < N; j++)                                              │
│           for (k = 0; k < N; k++)                                          │
│               C[i][j] += A[i][k] * B[k][j];                                │
│                                                                             │
│   Access analysis (row-major storage):                                      │
│   • A[i][k]: Row-wise access (good)                                        │
│   • B[k][j]: Column-wise access (BAD - stride N!)                          │
│   • C[i][j]: Repeated access (good temporal locality)                      │
│                                                                             │
│   For large N, B causes cache thrashing.                                   │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   SOLUTION: Process in cache-sized blocks                                   │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────┐      │
│   │                                                                 │      │
│   │    A                    B                    C                  │      │
│   │   ┌───────────┐        ┌───────────┐        ┌───────────┐      │      │
│   │   │███│   │   │        │███│   │   │        │███│   │   │      │      │
│   │   ├───┼───┼───┤   ×    ├───┼───┼───┤   =    ├───┼───┼───┤      │      │
│   │   │   │   │   │        │   │   │   │        │   │   │   │      │      │
│   │   ├───┼───┼───┤        ├───┼───┼───┤        ├───┼───┼───┤      │      │
│   │   │   │   │   │        │   │   │   │        │   │   │   │      │      │
│   │   └───────────┘        └───────────┘        └───────────┘      │      │
│   │                                                                 │      │
│   │   Process one block at a time:                                 │      │
│   │   C[block] += A[row_block] × B[col_block]                      │      │
│   │                                                                 │      │
│   └─────────────────────────────────────────────────────────────────┘      │
│                                                                             │
│   BLOCKED MATRIX MULTIPLY:                                                  │
│                                                                             │
│   // Block size B chosen so 3 blocks fit in cache                          │
│   for (ii = 0; ii < N; ii += BLOCK)                                        │
│       for (jj = 0; jj < N; jj += BLOCK)                                    │
│           for (kk = 0; kk < N; kk += BLOCK)                                │
│               // Mini matrix multiply on blocks                             │
│               for (i = ii; i < min(ii+BLOCK, N); i++)                      │
│                   for (j = jj; j < min(jj+BLOCK, N); j++)                  │
│                       for (k = kk; k < min(kk+BLOCK, N); k++)              │
│                           C[i][j] += A[i][k] * B[k][j];                    │
│                                                                             │
│   CHOOSING BLOCK SIZE:                                                      │
│   ────────────────────                                                      │
│   • Need 3 blocks in cache: A_block, B_block, C_block                      │
│   • Each block is BLOCK × BLOCK elements                                   │
│   • Total: 3 × BLOCK² × sizeof(element) < Cache Size                       │
│   • BLOCK ≈ sqrt(CacheSize / (3 × sizeof(element)))                        │
│                                                                             │
│   Example: 32KB L1 cache, double (8 bytes)                                 │
│   BLOCK ≈ sqrt(32768 / 24) ≈ 36                                            │
│   Use BLOCK = 32 (power of 2 for alignment)                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Cache-Oblivious Algorithms

An advanced concept: algorithms that are efficient without knowing cache size:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CACHE-OBLIVIOUS ALGORITHMS                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   CACHE-AWARE (blocking):                                                   │
│   • You choose block size based on known cache size                        │
│   • Good: Optimal for that cache                                           │
│   • Bad: Need to retune for different caches/hierarchy levels              │
│                                                                             │
│   CACHE-OBLIVIOUS:                                                          │
│   • Algorithm is designed to work well for ANY cache size                  │
│   • Uses recursive divide-and-conquer                                      │
│   • Automatically adapts to all levels of hierarchy                        │
│                                                                             │
│   ───────────────────────────────────────────────────────────────────────  │
│                                                                             │
│   EXAMPLE: Cache-oblivious matrix multiply                                  │
│                                                                             │
│   void matmul(A, B, C, n) {                                                │
│       if (n <= BASE_CASE) {                                                │
│           // Direct multiply (fits in registers)                           │
│           base_case_multiply(A, B, C);                                     │
│       } else {                                                             │
│           // Divide each matrix into quadrants                             │
│           //  ┌────┬────┐                                                  │
│           //  │ A11│ A12│                                                  │
│           //  ├────┼────┤                                                  │
│           //  │ A21│ A22│                                                  │
│           //  └────┴────┘                                                  │
│           //                                                               │
│           // C11 = A11×B11 + A12×B21                                      │
│           // C12 = A11×B12 + A12×B22                                      │
│           // C21 = A21×B11 + A22×B21                                      │
│           // C22 = A21×B12 + A22×B22                                      │
│                                                                             │
│           // 8 recursive multiplies, 4 additions                           │
│           matmul(A11, B11, T1, n/2);                                       │
│           matmul(A12, B21, T2, n/2);                                       │
│           add(T1, T2, C11, n/2);                                           │
│           // ... etc.                                                      │
│       }                                                                     │
│   }                                                                         │
│                                                                             │
│   WHY IT WORKS:                                                             │
│   ─────────────                                                             │
│   At some recursion level, subproblems fit in cache.                       │
│   From then on, all accesses are cache hits.                               │
│   This happens automatically for any cache size!                           │
│                                                                             │
│   THEORETICAL GUARANTEE:                                                    │
│   Cache misses = O(N³ / (B × sqrt(M)))                                     │
│   Where: N = matrix size, B = block size, M = cache size                   │
│   This is optimal for any M and B!                                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Project Specification

What You Will Build

A two-part system:

Cache Simulator: A configurable set-associative cache simulator that processes memory access traces and reports hit/miss statistics with miss classification.
Locality Visualizer: An ASCII-art visualization tool that shows cache behavior over time, making hit/miss patterns visible and intuitive.

Functional Requirements

Cache Simulator (csim):

Configuration (-s, -E, -b flags):
- Accept cache parameters: s (index bits), E (associativity), b (block bits)
- Support cache sizes from tiny (s=1, E=1, b=2) to realistic (s=10, E=8, b=6)

Trace Processing:

Read Valgrind-format memory traces:

I 0400d7d4,8      (instruction fetch - optional to handle)
 L 7ff0005c8,8    (data load)
 S 7ff0005d0,4    (data store)
 M 0421c7f0,4     (modify = load then store)

Process each access and update cache state

Statistics Reporting:
- Count: hits, misses, evictions
- Classify misses: compulsory, capacity, conflict
- Support verbose mode showing each access result
Replacement Policies:
- Implement LRU (required)
- FIFO and Random (stretch goals)

Locality Visualizer (lvis):

Visual Output:
- Display cache state as ASCII art after each access (or at intervals)
- Show which sets are being accessed
- Highlight hits (green/+) vs misses (red/-)
Pattern Recognition:
- Identify and annotate access patterns (sequential, strided, random)
- Calculate and display running miss rate
Code Path Mode:
- Accept annotated traces that include source location
- Group statistics by code region

Non-Functional Requirements

Correctness: Match reference simulator output exactly
Performance: Handle traces with millions of accesses in reasonable time
Clarity: Output must be educational, not just numerical
Modularity: Clean separation between simulator core and visualization

Example Usage and Output

Basic Simulation:

$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3

$ ./csim -v -s 4 -E 2 -b 4 -t traces/yi.trace
L 10,1 miss
M 20,1 miss hit
L 22,1 hit
S 18,1 hit
L 110,1 miss eviction
L 210,1 miss eviction
M 12,1 miss eviction hit
hits:4 misses:5 evictions:3

Miss Classification:

$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace --classify
Cache Configuration:
  Sets: 16 (s=4)
  Lines per set: 2 (E=2)
  Block size: 16 bytes (b=4)
  Total size: 512 bytes

Results:
  Hits:       4  (44.4%)
  Misses:     5  (55.6%)
    Compulsory: 4  (80.0% of misses)
    Capacity:   0  (0.0% of misses)
    Conflict:   1  (20.0% of misses)
  Evictions:  3

Locality Visualizer Output:

$ ./lvis -s 2 -E 2 -b 2 -t traces/simple.trace

=== CACHE LOCALITY VISUALIZER ===
Cache: 4 sets × 2 ways × 4 bytes = 32 bytes

Access #1: L 0x00 [MISS - Compulsory]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x00] [        ] │  ← ACCESSED (miss)
│ Set 1: [             ] [        ] │
│ Set 2: [             ] [        ] │
│ Set 3: [             ] [        ] │
└──────────────────────────────────┘
Running: hits=0 misses=1 (0.0% hit rate)

Access #2: L 0x04 [MISS - Compulsory]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x00] [        ] │
│ Set 1: [████ tag=0x00] [        ] │  ← ACCESSED (miss)
│ Set 2: [             ] [        ] │
│ Set 3: [             ] [        ] │
└──────────────────────────────────┘
Running: hits=0 misses=2 (0.0% hit rate)

Access #3: L 0x00 [HIT]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x00] [        ] │  ← ACCESSED (HIT!)
│ Set 1: [████ tag=0x00] [        ] │
│ Set 2: [             ] [        ] │
│ Set 3: [             ] [        ] │
└──────────────────────────────────┘
Running: hits=1 misses=2 (33.3% hit rate)

Access #4: L 0x10 [MISS - Conflict]
┌──────────────────────────────────┐
│ Set 0: [████ tag=0x01] [████ 0x00]│  ← ACCESSED (miss, evicted old)
│ Set 1: [████ tag=0x00] [        ] │
│ Set 2: [             ] [        ] │
│ Set 3: [             ] [        ] │
└──────────────────────────────────┘
Running: hits=1 misses=3 (25.0% hit rate)

=== PATTERN ANALYSIS ===
Stride detected: 4 bytes (matches block size)
Spatial locality: GOOD (sequential within blocks)
Temporal locality: MODERATE (some reuse observed)
Conflict misses: 1 (33% of misses) - consider increasing associativity

Real World Outcome

When you complete this project, you will have a fully functional cache simulator and locality visualizer. Here is exactly what running your tools will look like:

Cache Simulator Basic Usage

$ ./csim -s 4 -E 2 -b 4 -t traces/yi.trace
hits:4 misses:5 evictions:3

Verbose Mode with Full Trace

$ ./csim -v -s 4 -E 2 -b 4 -t traces/yi.trace
Cache Configuration:
  Sets (S):           16 (s=4 bits)
  Lines per set (E):  2 (2-way set-associative)
  Block size (B):     16 bytes (b=4 bits)
  Cache size:         512 bytes

Processing trace: traces/yi.trace
─────────────────────────────────────────────────────────────
 L 10,1       set=1   tag=0x000001   miss
 M 20,1       set=2   tag=0x000002   miss hit
 L 22,1       set=2   tag=0x000002   hit
 S 18,1       set=1   tag=0x000001   hit
 L 110,1      set=1   tag=0x000011   miss eviction
 L 210,1      set=1   tag=0x000021   miss eviction
 M 12,1       set=1   tag=0x000001   miss eviction hit
─────────────────────────────────────────────────────────────

Summary:
  hits:4 misses:5 evictions:3
  Hit rate: 44.4%
  Miss rate: 55.6%

Miss Classification Analysis

$ ./csim --classify -s 4 -E 2 -b 4 -t traces/matrix_multiply.trace
Cache Configuration:
  Sets: 16 (s=4)
  Lines per set: 2 (E=2)
  Block size: 16 bytes (b=4)
  Total size: 512 bytes

Access Summary:
  Total accesses:     10,000
  Hits:               7,234  (72.3%)
  Misses:             2,766  (27.7%)

Miss Classification:
  ┌─────────────────────────────────────────────────────┐
  │ Compulsory:  1,024  (37.0% of misses) █████████     │
  │ Capacity:      892  (32.3% of misses) ████████      │
  │ Conflict:      850  (30.7% of misses) ███████       │
  └─────────────────────────────────────────────────────┘
  Evictions: 1,742

Recommendations:
  - High conflict misses: Consider 4-way associativity
  - Capacity misses indicate working set > cache size
  - Use blocking with block size ~16 elements

Locality Visualizer Output

$ ./lvis -s 2 -E 2 -b 2 -t traces/simple.trace --animate

╔══════════════════════════════════════════════════════════════╗
║             CACHE LOCALITY VISUALIZER                         ║
║       Cache: 4 sets × 2 ways × 4 bytes = 32 bytes            ║
╚══════════════════════════════════════════════════════════════╝

Access #1: L 0x00 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x00] [        ]            │ ← MISS (Compulsory)
│ Set 1: [            ] [        ]           │
│ Set 2: [            ] [        ]           │
│ Set 3: [            ] [        ]           │
└────────────────────────────────────────────┘
Stats: hits=0  misses=1  rate=0.0%

Access #2: L 0x04 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x00] [        ]            │
│ Set 1: [████ t=0x00] [        ]            │ ← MISS (Compulsory)
│ Set 2: [            ] [        ]           │
│ Set 3: [            ] [        ]           │
└────────────────────────────────────────────┘
Stats: hits=0  misses=2  rate=0.0%

Access #3: L 0x00 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x00] [        ]            │ ← HIT!
│ Set 1: [████ t=0x00] [        ]            │
│ Set 2: [            ] [        ]           │
│ Set 3: [            ] [        ]           │
└────────────────────────────────────────────┘
Stats: hits=1  misses=2  rate=33.3%

Access #4: L 0x10 (4 bytes)
┌────────────────────────────────────────────┐
│ Set 0: [████ t=0x01] [████ t=0x00]         │ ← MISS (loaded new tag)
│ Set 1: [████ t=0x00] [        ]            │
│ Set 2: [            ] [        ]           │
│ Set 3: [            ] [        ]           │
└────────────────────────────────────────────┘
Stats: hits=1  misses=3  rate=25.0%

═══════════════════════════════════════════════════════════════
                    PATTERN ANALYSIS
═══════════════════════════════════════════════════════════════
Detected stride: 4 bytes
Spatial locality: GOOD (sequential within cache lines)
Temporal locality: MODERATE (50% reuse of cached data)
Recommendation: Current cache configuration is adequate
═══════════════════════════════════════════════════════════════

Matrix Multiply Optimization Analysis

$ ./csim --compare -s 5 -E 4 -b 6 \
    -t traces/matmul_naive.trace \
    -t traces/matmul_blocked.trace

Comparing cache behavior between two traces:
Cache: 32 sets, 4-way, 64-byte blocks (8 KB)

┌──────────────────────────────────────────────────────────────┐
│                    COMPARISON RESULTS                         │
├──────────────────┬────────────────────┬──────────────────────┤
│                  │ matmul_naive       │ matmul_blocked       │
├──────────────────┼────────────────────┼──────────────────────┤
│ Total accesses   │ 2,097,152          │ 2,097,152            │
│ Hits             │ 524,288 (25.0%)    │ 1,835,008 (87.5%)    │
│ Misses           │ 1,572,864 (75.0%)  │ 262,144 (12.5%)      │
│ Evictions        │ 1,572,832          │ 262,112              │
├──────────────────┼────────────────────┼──────────────────────┤
│ Est. cycles      │ 157,810,688        │ 28,311,552           │
│ Speedup          │ 1.0x (baseline)    │ 5.6x faster          │
└──────────────────┴────────────────────┴──────────────────────┘

Analysis: Blocking reduced misses by 83.3%
          Estimated 5.6x performance improvement

Solution Architecture

High-Level Design

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SYSTEM ARCHITECTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────┐       ┌─────────────────────────────────────────────┐    │
│   │             │       │              CACHE SIMULATOR                │    │
│   │   Trace     │       │  ┌─────────────────────────────────────────┐│    │
│   │   Parser    │──────▶│  │         Cache Controller               ││    │
│   │             │       │  │  ┌───────────┐  ┌───────────────────┐  ││    │
│   └─────────────┘       │  │  │ Address   │  │ Cache State       │  ││    │
│         ▲               │  │  │ Decomposer│  │ (sets, lines)     │  ││    │
│         │               │  │  └───────────┘  └───────────────────┘  ││    │
│   ┌─────────────┐       │  │  ┌───────────┐  ┌───────────────────┐  ││    │
│   │   Trace     │       │  │  │ Hit/Miss  │  │ Replacement       │  ││    │
│   │   Files     │       │  │  │ Logic     │  │ Policy (LRU/FIFO) │  ││    │
│   │ (valgrind)  │       │  │  └───────────┘  └───────────────────┘  ││    │
│   └─────────────┘       │  └─────────────────────────────────────────┘│    │
│                         │                      │                       │    │
│                         │                      ▼                       │    │
│                         │  ┌─────────────────────────────────────────┐│    │
│                         │  │          Statistics Collector           ││    │
│                         │  │  hits, misses, evictions, miss types    ││    │
│                         │  └─────────────────────────────────────────┘│    │
│                         └─────────────────────────────────────────────┘    │
│                                              │                              │
│                         ┌────────────────────┴────────────────────┐        │
│                         ▼                                         ▼        │
│               ┌─────────────────┐                     ┌─────────────────┐  │
│               │  Text Reporter  │                     │   ASCII         │  │
│               │  (stats output) │                     │   Visualizer    │  │
│               └─────────────────┘                     └─────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Key Data Structures

/* Cache line structure */
typedef struct {
    int valid;           /* Valid bit */
    unsigned long tag;   /* Tag bits */
    int lru_counter;     /* For LRU replacement (higher = more recent) */
    int load_time;       /* For FIFO replacement */
    int dirty;           /* Dirty bit (for write-back, extension) */
} cache_line_t;

/* Cache set structure */
typedef struct {
    cache_line_t *lines; /* Array of E lines */
    int num_lines;       /* Associativity (E) */
} cache_set_t;

/* Cache structure */
typedef struct {
    cache_set_t *sets;   /* Array of S sets */
    int num_sets;        /* S = 2^s */
    int lines_per_set;   /* E */
    int block_size;      /* B = 2^b */

    /* Bit counts for address decomposition */
    int s;               /* Index bits */
    int b;               /* Offset bits */
    int t;               /* Tag bits = m - s - b */

    /* Statistics */
    int hits;
    int misses;
    int evictions;

    /* Miss classification */
    int compulsory_misses;
    int capacity_misses;
    int conflict_misses;

    /* Global LRU counter */
    int lru_clock;

    /* For miss classification: track all unique blocks ever seen */
    /* (Using a set/hash table in real implementation) */
} cache_t;

/* Memory access structure */
typedef struct {
    char op;             /* 'L' (load), 'S' (store), 'M' (modify) */
    unsigned long addr;  /* Memory address */
    int size;            /* Size of access in bytes */
} mem_access_t;

/* Access result for visualization */
typedef struct {
    int hit;             /* 1 if hit, 0 if miss */
    int eviction;        /* 1 if eviction occurred */
    int set_index;       /* Which set was accessed */
    int line_index;      /* Which line in set */
    unsigned long tag;   /* Tag of accessed block */
    char miss_type;      /* 'C'ompulsory, 'A'pacity, 'O'nflict */
} access_result_t;

Module Structure

cache-simulator/
├── src/
│   ├── main.c              # Entry point, CLI parsing
│   ├── cache.c             # Cache simulation logic
│   ├── cache.h             # Cache data structures and API
│   ├── trace.c             # Trace file parsing
│   ├── trace.h             # Trace structures
│   ├── lru.c               # LRU replacement policy
│   ├── replacement.h       # Replacement policy interface
│   ├── classify.c          # Miss classification logic
│   ├── stats.c             # Statistics collection
│   ├── visualizer.c        # ASCII visualization
│   └── visualizer.h        # Visualizer API
├── tests/
│   ├── traces/             # Test trace files
│   │   ├── yi.trace
│   │   ├── dave.trace
│   │   └── trans.trace
│   ├── test_cache.c        # Unit tests for cache logic
│   ├── test_lru.c          # Unit tests for LRU
│   └── expected/           # Expected outputs
├── tools/
│   ├── gen_trace.py        # Generate custom traces
│   └── analyze_pattern.py  # Pattern analysis helper
├── Makefile
└── README.md

Core Algorithm: Cache Access

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CACHE ACCESS ALGORITHM                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   access_cache(cache, address):                                             │
│   ─────────────────────────────                                             │
│                                                                             │
│   1. DECOMPOSE ADDRESS                                                      │
│      ┌─────────────────────────────────────────┐                           │
│      │ tag = address >> (s + b)                │                           │
│      │ set_index = (address >> b) & ((1 << s) - 1)                         │
│      │ offset = address & ((1 << b) - 1)       │                           │
│      └─────────────────────────────────────────┘                           │
│                                                                             │
│   2. SEARCH SET FOR MATCHING TAG                                            │
│      ┌─────────────────────────────────────────┐                           │
│      │ set = cache.sets[set_index]             │                           │
│      │ for each line in set:                   │                           │
│      │     if line.valid AND line.tag == tag:  │                           │
│      │         → HIT! Update LRU, return       │                           │
│      └─────────────────────────────────────────┘                           │
│                                                                             │
│   3. CACHE MISS - NEED TO LOAD BLOCK                                        │
│      ┌─────────────────────────────────────────┐                           │
│      │ // First check for empty line           │                           │
│      │ for each line in set:                   │                           │
│      │     if NOT line.valid:                  │                           │
│      │         → Use this empty line           │                           │
│      │         → No eviction needed            │                           │
│      │         break                           │                           │
│      │                                         │                           │
│      │ // No empty line - must evict           │                           │
│      │ if all lines valid:                     │                           │
│      │     victim = select_victim(set)  // LRU │                           │
│      │     eviction = true                     │                           │
│      └─────────────────────────────────────────┘                           │
│                                                                             │
│   4. INSTALL NEW BLOCK                                                      │
│      ┌─────────────────────────────────────────┐                           │
│      │ line.valid = true                       │                           │
│      │ line.tag = tag                          │                           │
│      │ line.lru_counter = cache.lru_clock++   │                           │
│      └─────────────────────────────────────────┘                           │
│                                                                             │
│   5. UPDATE STATISTICS                                                      │
│      ┌─────────────────────────────────────────┐                           │
│      │ if hit: cache.hits++                    │                           │
│      │ else:   cache.misses++                  │                           │
│      │ if eviction: cache.evictions++          │                           │
│      │ classify_miss(...)                      │                           │
│      └─────────────────────────────────────────┘                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Guide

Development Environment Setup

# Required tools
sudo apt-get install gcc gdb valgrind build-essential

# Optional: valgrind for generating traces
# (Traces are provided, but you can generate your own)

# Create project structure
mkdir -p cache-simulator/{src,tests/traces,tools}
cd cache-simulator

Project Structure

cache-simulator/
├── src/
│   ├── main.c
│   ├── cache.c / cache.h
│   ├── trace.c / trace.h
│   ├── visualizer.c / visualizer.h
│   └── stats.c / stats.h
├── tests/
│   ├── traces/
│   └── test_*.c
├── Makefile
└── README.md

Implementation Phases

Phase 1: Address Decomposition (Days 1-2)

Goals:

Implement address parsing into tag, index, offset
Validate decomposition with known examples

Tasks:

Create cache.h with basic structures
Implement decompose_address(addr, s, b, &tag, &index, &offset)
Write unit tests with hand-calculated examples
Test edge cases (addresses at block boundaries)

Test Cases:

// Cache: s=4, b=4 (16 sets, 16-byte blocks)
// Address: 0x12345678

// Expected decomposition:
// offset = 0x8 (4 bits)
// index = 0x7 (4 bits)
// tag = 0x1234567 (24 bits)

void test_decompose() {
    int s = 4, b = 4;
    unsigned long addr = 0x12345678;
    unsigned long tag, index, offset;

    decompose_address(addr, s, b, &tag, &index, &offset);

    assert(offset == 0x8);
    assert(index == 0x7);
    assert(tag == 0x1234567);
    printf("Address decomposition: PASSED\n");
}

Checkpoint: Address decomposition passes all unit tests.

Phase 2: Cache Data Structures (Days 3-4)

Goals:

Implement cache initialization
Create set and line data structures
Implement cache state printing (for debugging)

Tasks:

Implement cache_t* cache_init(int s, int E, int b)
Implement cache_free(cache_t *cache)
Implement cache_print_state(cache_t *cache) for debugging
Validate memory allocation/deallocation

Implementation:

cache_t* cache_init(int s, int E, int b) {
    cache_t *cache = malloc(sizeof(cache_t));

    cache->s = s;
    cache->b = b;
    cache->num_sets = 1 << s;        // 2^s
    cache->lines_per_set = E;
    cache->block_size = 1 << b;      // 2^b
    cache->t = 64 - s - b;           // Assuming 64-bit addresses

    // Allocate sets
    cache->sets = malloc(cache->num_sets * sizeof(cache_set_t));
    for (int i = 0; i < cache->num_sets; i++) {
        cache->sets[i].lines = calloc(E, sizeof(cache_line_t));
        cache->sets[i].num_lines = E;
    }

    // Initialize statistics
    cache->hits = cache->misses = cache->evictions = 0;
    cache->lru_clock = 0;

    return cache;
}

Checkpoint: Cache initializes correctly; memory tools show no leaks.

Phase 3: Cache Access Logic (Days 5-8)

Goals:

Implement cache lookup
Implement cache update on miss
Track hits, misses, evictions

Tasks:

Implement access_result_t cache_access(cache_t *cache, unsigned long addr)
Implement hit detection (tag match + valid)
Implement miss handling (find empty line or evict)
Update statistics on each access
Test with simple traces

Core Implementation:

access_result_t cache_access(cache_t *cache, unsigned long addr) {
    access_result_t result = {0};

    unsigned long tag, set_index, offset;
    decompose_address(addr, cache->s, cache->b, &tag, &set_index, &offset);

    cache_set_t *set = &cache->sets[set_index];
    result.set_index = set_index;
    result.tag = tag;

    // Check for hit
    for (int i = 0; i < set->num_lines; i++) {
        if (set->lines[i].valid && set->lines[i].tag == tag) {
            // HIT
            result.hit = 1;
            result.line_index = i;
            cache->hits++;
            set->lines[i].lru_counter = cache->lru_clock++;
            return result;
        }
    }

    // MISS
    cache->misses++;
    result.hit = 0;

    // Find empty line or evict
    int target_line = -1;
    for (int i = 0; i < set->num_lines; i++) {
        if (!set->lines[i].valid) {
            target_line = i;
            break;
        }
    }

    if (target_line == -1) {
        // All lines full - evict LRU
        target_line = find_lru_victim(set);
        result.eviction = 1;
        cache->evictions++;
    }

    // Install new block
    set->lines[target_line].valid = 1;
    set->lines[target_line].tag = tag;
    set->lines[target_line].lru_counter = cache->lru_clock++;
    result.line_index = target_line;

    return result;
}

Checkpoint: Simple hand-traced examples produce correct hit/miss counts.

Phase 4: LRU Replacement (Days 9-10)

Goals:

Implement correct LRU replacement
Validate against known LRU behavior

Tasks:

Implement find_lru_victim(cache_set_t *set)
Ensure LRU counters update on every access (hit or miss)
Test LRU ordering with specific sequences

LRU Test Case:

// 2-way set associative, accessing same set
// Sequence: A, B, A, C
// Expected: A, B both cached; access A updates recency;
//           C evicts B (not A, because A more recent)

void test_lru() {
    cache_t *cache = cache_init(1, 2, 4);  // 2 sets, 2-way, 16B blocks

    // All map to set 0
    unsigned long A = 0x00;  // tag 0
    unsigned long B = 0x20;  // tag 1 (same set, different tag)
    unsigned long C = 0x40;  // tag 2 (same set)

    cache_access(cache, A);  // Miss, load A
    cache_access(cache, B);  // Miss, load B. Set: [A, B]
    cache_access(cache, A);  // Hit on A. A now more recent

    access_result_t r = cache_access(cache, C);  // Miss, evict B (LRU)

    assert(r.eviction == 1);
    // Verify B was evicted, not A
    assert(cache->sets[0].lines[0].tag == 0 ||
           cache->sets[0].lines[1].tag == 0);  // A still present
    assert(cache->sets[0].lines[0].tag == 2 ||
           cache->sets[0].lines[1].tag == 2);  // C now present

    printf("LRU replacement: PASSED\n");
    cache_free(cache);
}

Checkpoint: LRU test cases pass; replacement order is correct.

Phase 5: Trace Parser (Days 11-12)

Goals:

Parse Valgrind trace format
Handle all operation types (L, S, M)

Tasks:

Implement trace_open(filename)
Implement trace_next_access(trace, &access)
Handle ‘M’ as two accesses (load then store)
Validate against provided traces

Trace Format:

I 0400d7d4,8      <- Instruction fetch (ignore or optional)
 L 7ff0005c8,8    <- Data load (note leading space)
 S 7ff0005d0,4    <- Data store
 M 0421c7f0,4     <- Modify (load + store)

Implementation:

int trace_parse_line(char *line, mem_access_t *access) {
    if (line[0] == 'I') {
        return 0;  // Skip instruction fetches
    }

    char op;
    unsigned long addr;
    int size;

    if (sscanf(line, " %c %lx,%d", &op, &addr, &size) == 3) {
        access->op = op;
        access->addr = addr;
        access->size = size;
        return 1;
    }
    return 0;
}

Checkpoint: Trace parser correctly reads provided test traces.

Phase 6: Statistics and Verbose Mode (Days 13-14)

Goals:

Output matches reference simulator format
Verbose mode shows per-access results

Tasks:

Implement summary output: hits:X misses:Y evictions:Z
Implement verbose output for each access
Compare against reference output
Fix any discrepancies

Verbose Output Format:

void print_access_verbose(mem_access_t *access, access_result_t *result) {
    printf("%c %lx,%d ", access->op, access->addr, access->size);

    if (result->hit) {
        printf("hit ");
    } else {
        printf("miss ");
        if (result->eviction) {
            printf("eviction ");
        }
    }
    printf("\n");
}

Checkpoint: Output matches reference for all provided traces.

Phase 7: Miss Classification (Days 15-17)

Goals:

Classify each miss as compulsory, capacity, or conflict
Track block history for classification

Tasks:

Track all unique blocks ever accessed (for compulsory detection)
Simulate a fully-associative cache of same size (for conflict detection)
Classify: compulsory if first access; conflict if FA would hit; else capacity

Implementation Strategy:

// Maintain a fully-associative "oracle" cache for classification
typedef struct {
    unsigned long *blocks;  // Array of block addresses
    int capacity;           // Max blocks (same as main cache)
    int count;              // Current count
    int lru_clock;
} oracle_cache_t;

char classify_miss(cache_t *cache, unsigned long block_addr,
                   oracle_cache_t *oracle, hash_set_t *seen_blocks) {

    // Check if block was ever seen before
    if (!hash_set_contains(seen_blocks, block_addr)) {
        hash_set_add(seen_blocks, block_addr);
        return 'C';  // Compulsory
    }

    // Check if fully-associative cache would have hit
    if (oracle_access(oracle, block_addr)) {
        return 'O';  // Conflict (would have hit with full associativity)
    }

    return 'A';  // Capacity (even FA cache missed)
}

Checkpoint: Miss classification matches expected distribution for test traces.

Phase 8: ASCII Visualizer (Days 18-21)

Goals:

Display cache state visually
Show access patterns and miss types

Tasks:

Design ASCII cache representation
Implement visualize_state(cache, access, result)
Add pattern detection (stride, sequential)
Implement running statistics display

Visualization Implementation:

void visualize_state(cache_t *cache, int access_num,
                     mem_access_t *access, access_result_t *result) {
    printf("\nAccess #%d: %c 0x%lx [%s%s]\n",
           access_num, access->op, access->addr,
           result->hit ? "HIT" : "MISS",
           result->hit ? "" : get_miss_type_str(result->miss_type));

    printf("┌");
    for (int i = 0; i < 40; i++) printf("─");
    printf("┐\n");

    for (int s = 0; s < cache->num_sets; s++) {
        printf("│ Set %d: ", s);
        for (int l = 0; l < cache->lines_per_set; l++) {
            cache_line_t *line = &cache->sets[s].lines[l];
            if (line->valid) {
                printf("[%04lx] ", line->tag & 0xFFFF);
            } else {
                printf("[    ] ");
            }
        }
        if (s == result->set_index) {
            printf(" ← %s", result->hit ? "HIT" : "accessed");
        }
        printf("\n");
    }

    printf("└");
    for (int i = 0; i < 40; i++) printf("─");
    printf("┘\n");

    float hit_rate = (float)cache->hits / (cache->hits + cache->misses) * 100;
    printf("Running: hits=%d misses=%d (%.1f%% hit rate)\n",
           cache->hits, cache->misses, hit_rate);
}

Checkpoint: Visualizer shows correct cache state and highlights accessed sets.

Testing Strategy

Test Categories

Category	Purpose	Examples
Unit Tests	Test individual components	Address decomposition, LRU logic
Trace Tests	Validate against known traces	Match reference simulator output
Edge Cases	Handle unusual inputs	Empty trace, single access, full cache
Stress Tests	Performance and correctness at scale	Million-access traces

Test Traces

Create traces for specific scenarios:

trace1_sequential.trace - Sequential access, good spatial locality:

 L 0,4
 L 4,4
 L 8,4
 L c,4
 L 10,4
 L 14,4

trace2_strided.trace - Strided access, poor spatial locality:

 L 0,4
 L 40,4
 L 80,4
 L c0,4
 L 100,4

trace3_conflict.trace - Designed to cause conflict misses:

# For direct-mapped cache with 4 sets (s=2), 16-byte blocks (b=4)
# These all map to set 0
 L 0,4       # Set 0, tag 0
 L 40,4      # Set 0, tag 1 (evicts previous)
 L 0,4       # Set 0, tag 0 (conflict miss!)
 L 40,4      # Set 0, tag 1 (conflict miss!)

trace4_reuse.trace - Temporal locality test:

 L 0,4
 L 10,4
 L 0,4       # Should be hit
 L 10,4      # Should be hit
 L 0,4       # Should be hit

Reference Validator

#!/bin/bash
# validate.sh - Compare against reference simulator

TRACES="yi dave trans simple"
CONFIGS="1-1-1 4-2-4 2-4-3"  # s-E-b configurations

for trace in $TRACES; do
    for config in $CONFIGS; do
        s=$(echo $config | cut -d- -f1)
        E=$(echo $config | cut -d- -f2)
        b=$(echo $config | cut -d- -f3)

        expected=$(./csim-ref -s $s -E $E -b $b -t traces/${trace}.trace)
        actual=$(./csim -s $s -E $E -b $b -t traces/${trace}.trace)

        if [ "$expected" == "$actual" ]; then
            echo "PASS: $trace with s=$s E=$E b=$b"
        else
            echo "FAIL: $trace with s=$s E=$E b=$b"
            echo "  Expected: $expected"
            echo "  Actual:   $actual"
        fi
    done
done

Critical Test Cases

Empty cache hit detection: First access is always a miss
LRU correctness: Verify least-recently-used block is evicted
Set isolation: Accesses to different sets don’t affect each other
Block boundary: Adjacent addresses in same block share cache line
Tag uniqueness: Same index, different tags are different blocks
Modify operation: ‘M’ counts as one potential miss but two potential hits

Common Pitfalls

Implementation Mistakes

Pitfall	Symptom	Solution
Off-by-one in bit extraction	Wrong set/tag	Use `(1 << n) - 1` for n-bit masks
Forgetting to update LRU on hit	Wrong eviction	Always update LRU counter on any access
Not handling ‘M’ operations	Miscount hits	‘M’ = load + store; can hit twice
Signed vs unsigned addresses	Negative set indices	Use `unsigned long` for addresses
Wrong mask calculation	Tag/index overlap	Verify: tag_bits + index_bits + offset_bits = address_bits
Evicting wrong line	Incorrect cache state	Verify LRU logic with small examples

Debugging Strategies

// Debug mode: print every decomposition
#ifdef DEBUG
    printf("addr=0x%lx -> tag=0x%lx set=%lu offset=%lu\n",
           addr, tag, set_index, offset);
#endif

// Sanity checks
assert(set_index < cache->num_sets);
assert(offset < cache->block_size);
assert(tag < (1UL << cache->t));

Performance Traps

Large traces: Don’t store all accesses in memory; process streaming
Hash table for seen blocks: Don’t use linear search for millions of blocks
Visualization overhead: Only visualize in debug mode or with sampling

Extensions

Beginner Extensions

FIFO replacement policy: Simpler than LRU
Random replacement: Even simpler, good baseline
Write operations: Track stores separately from loads
Command-line help: Proper usage message with -h

Intermediate Extensions

Multi-level cache: Simulate L1 + L2; misses in L1 go to L2
Inclusive vs exclusive: Different multi-level policies
Write-back simulation: Track dirty bits, count writebacks
Trace generation: Create traces from actual program runs
Block-level statistics: Per-block hit/miss counts

Advanced Extensions

Prefetching simulation: Next-line, stride, or correlation prefetch
TLB simulation: Combine with page table simulation
Cache coherence: Simulate MESI protocol for multi-core
Real code optimization: Use simulator to guide actual code changes
GUI visualizer: Web-based or ncurses interface

Multi-Level Cache Example

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MULTI-LEVEL CACHE SIMULATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Typical modern hierarchy:                                                 │
│                                                                             │
│   ┌─────────────┐                                                          │
│   │     CPU     │                                                          │
│   └──────┬──────┘                                                          │
│          ▼                                                                  │
│   ┌─────────────┐  ~1ns, 32KB, 8-way                                       │
│   │  L1 Cache   │──────────────────────────┐                               │
│   └──────┬──────┘                          │ L1 miss                       │
│          ▼                                 ▼                                │
│   ┌─────────────┐  ~4ns, 256KB, 8-way                                      │
│   │  L2 Cache   │──────────────────────────┐                               │
│   └──────┬──────┘                          │ L2 miss                       │
│          ▼                                 ▼                                │
│   ┌─────────────┐  ~20ns, 8MB, 16-way                                      │
│   │  L3 Cache   │──────────────────────────┐                               │
│   └──────┬──────┘                          │ L3 miss                       │
│          ▼                                 ▼                                │
│   ┌─────────────┐  ~100ns                                                  │
│   │    DRAM     │                                                          │
│   └─────────────┘                                                          │
│                                                                             │
│   Simulation approach:                                                      │
│   ──────────────────                                                        │
│   1. Try L1: if hit, done                                                  │
│   2. If L1 miss, try L2: if hit, install in L1                            │
│   3. If L2 miss, try L3: if hit, install in L1 and L2                     │
│   4. If L3 miss, fetch from memory, install everywhere                     │
│                                                                             │
│   // Pseudocode                                                             │
│   result = access_cache(L1, addr);                                         │
│   if (!result.hit) {                                                       │
│       result = access_cache(L2, addr);                                     │
│       if (!result.hit) {                                                   │
│           result = access_cache(L3, addr);                                 │
│           if (!result.hit) {                                               │
│               // Fetch from memory                                         │
│           }                                                                │
│           install_block(L2, addr);                                         │
│       }                                                                    │
│       install_block(L1, addr);                                             │
│   }                                                                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Real-World Connections

Performance Engineering

Your cache simulator skills directly apply to:

Database systems: Query planners consider cache behavior
Game engines: Data-oriented design for cache efficiency
Scientific computing: Blocking for matrix operations
Embedded systems: Understanding tight memory constraints

Tools That Use These Concepts

Tool	Purpose	Cache Relevance
Valgrind/Cachegrind	Cache profiling	Uses simulation like yours
perf	Hardware counters	Measures real cache events
Intel VTune	Performance analysis	Cache miss attribution
Compiler (-O3)	Optimization	Loop blocking, prefetching

Interview Applications

Common questions your project prepares you for:

“Why is row-major traversal faster than column-major for matrices?”
“How would you optimize a function with poor cache behavior?”
“Explain the tradeoffs in cache associativity.”
“What’s the difference between L1 and L3 cache?”
“How do you measure cache performance?”

Modern CPU Cache Parameters

For reference, typical modern CPU cache configurations:

Level	Size	Associativity	Block Size	Latency
L1D	32-48 KB	8-12 way	64 bytes	4-5 cycles
L1I	32 KB	8 way	64 bytes	4-5 cycles
L2	256-512 KB	4-8 way	64 bytes	12-14 cycles
L3	8-64 MB	12-16 way	64 bytes	40-75 cycles

Resources

Essential Reading

CS:APP Chapter 6: “The Memory Hierarchy” - Core material
CS:APP Chapter 5: “Optimizing Program Performance” - Measurement discipline
“What Every Programmer Should Know About Memory” by Ulrich Drepper - Deep dive

Tools

Valgrind/Cachegrind: valgrind --tool=cachegrind ./program
perf: perf stat -e cache-references,cache-misses ./program
PAPI: Performance API for hardware counters

Trace Generation

# Generate trace with Valgrind lackey
valgrind --tool=lackey --trace-mem=yes ./your_program 2>&1 | \
    grep -E '^ [LS]' > trace.trace

Previous: P8 (Performance Clinic) for measurement methodology
Next: P13 (VM Map Visualizer) extends memory hierarchy understanding
Related: P14 (Malloc) uses locality principles

Online Resources

Cache Simulator Visualization
Interactive Cache Demo
CS:APP Cache Lab materials at CMU

Self-Assessment Checklist

Understanding

I can explain address decomposition (tag/index/offset) without notes
I can draw a set-associative cache and trace an access through it
I understand why spatial locality works (blocks, not bytes)
I can classify a miss as compulsory, capacity, or conflict
I can explain LRU replacement step-by-step
I understand why blocking improves matrix multiply performance

Implementation

Simulator correctly handles all trace operations (L, S, M)
Hit/miss/eviction counts match reference for all test traces
LRU replacement is implemented correctly
Address decomposition is verified with manual calculations
Visualizer shows clear cache state representation

Application

I can predict relative cache performance of two code variants
I used the simulator to measure a real optimization’s effect
I can explain a miss-rate reduction in terms of locality principles
I can choose appropriate cache parameters for a simulation

Growth

I understand how to extend to multi-level caches
I know how real tools (Cachegrind, perf) relate to my simulator
I can apply cache knowledge to optimize real programs

Completion Criteria

Minimum Viable Completion

Direct-mapped and set-associative simulation works
Address decomposition is correct
Hit/miss/eviction counting matches reference
Processes provided trace files
Basic text output works

Full Completion

LRU replacement is correct
All trace formats handled (L, S, M)
Miss classification (compulsory/capacity/conflict)
Verbose mode shows per-access details
ASCII visualizer displays cache state
All reference traces match expected output

Excellence (Going Above & Beyond)

Multi-level cache simulation
Multiple replacement policies (LRU, FIFO, Random)
Prefetching simulation
Real program optimization demonstrated with simulator
Interactive visualization or web interface

The Core Question You’re Answering

“How does the memory hierarchy create the illusion of a large, fast memory, and how can I write code that exploits this illusion?”

This project forces you to deeply understand that CPU performance is dominated by memory access patterns. A cache miss can cost 100+ CPU cycles while a hit costs only 1-4 cycles. By building a simulator, you internalize how your data layout and access patterns directly translate to cache behavior - knowledge that separates average programmers from performance-conscious systems engineers.

Concepts You Must Understand First

Before starting this project, ensure you have a solid grasp of these foundational concepts:

Concept	Where to Learn	Why It Matters
Binary/Hexadecimal representation	CS:APP 2.1	Address decomposition requires bit manipulation
Bitwise operations (AND, OR, shifts)	CS:APP 2.1.4	Extracting tag/index/offset from addresses
Memory addressing	CS:APP 6.1	Understanding how addresses map to physical memory
Locality of reference	CS:APP 6.2	Why caches work - temporal and spatial locality
C pointers and dynamic allocation	CS:APP 2.1, K&R Ch. 5	Building cache data structures
Structs and arrays	K&R Ch. 6	Representing cache organization
File I/O in C	K&R Ch. 7	Parsing trace files
Command-line argument parsing	K&R 5.10	Handling -s, -E, -b flags

Questions to Guide Your Design

Work through these questions before writing any code:

Address Decomposition: Given a 64-bit address, s=4, and b=6, write out the formula to extract tag, set index, and block offset. Try it with address 0x7FF000A3C8.
Data Structure Choice: How will you represent a cache line? A set? The entire cache? Why might a 2D array work for some configurations but not others?
LRU Tracking: How will you track access recency for LRU replacement? Counter per line? Linked list? What’s the time complexity of each approach?
Miss Classification: To classify a miss as conflict vs capacity, you need to know if a fully-associative cache of the same size would have hit. How will you simulate this efficiently?
Modify Operations: A trace line “M 7ff000a3c8,4” represents a load followed by a store. How do you handle this? Can both the load and store be hits?
Block Boundaries: If a load accesses bytes 60-67 of a 64-byte block, what happens? Do you need to track byte-level access or just block-level?

Thinking Exercise

Before coding, hand-trace this sequence through a cache with s=1 (2 sets), E=2 (2-way), b=2 (4-byte blocks):

L 0x00    # Load from address 0
L 0x04    # Load from address 4
L 0x08    # Load from address 8
L 0x00    # Load from address 0 again
L 0x10    # Load from address 16
L 0x00    # Load from address 0 again

For each access, determine:

Which set is accessed?
Is it a hit or miss?
If miss, is it compulsory, capacity, or conflict?
Which line is evicted (if any)?
What is the final cache state?

Show your work before looking at the solution. Draw the cache state after each access.

Solution (click to expand)

Address decomposition for this cache:

Block offset: bits [1:0] (2 bits)
Set index: bit [2] (1 bit, so 2 sets)
Tag: bits [63:3]

Access	Address	Set	Tag	Result	Cache State After
L 0x00	0b000	0	0	MISS (compulsory)	Set0: [tag=0, _]
L 0x04	0b100	1	0	MISS (compulsory)	Set0: [tag=0, _] Set1: [tag=0, _]
L 0x08	0b1000	0	1	MISS (compulsory)	Set0: [tag=0, tag=1] Set1: [tag=0, _]
L 0x00	0b000	0	0	HIT	Set0: [tag=0, tag=1]
L 0x10	0b10000	0	2	MISS (capacity)	Set0: [tag=0, tag=2] (evicted tag=1)
L 0x00	0b000	0	0	HIT	Set0: [tag=0, tag=2]

Final: hits=2, misses=4, evictions=1

Hints in Layers

Use these hints progressively if you get stuck. Try to solve problems yourself before advancing to the next hint level.

Hint Layer 1: Getting Started

Start with address decomposition. Write a function that takes an address and prints tag/index/offset. Test it extensively before moving on.
For cache state, a 2D array cache_line_t cache[S][E] works well for small caches.
Parse traces with fscanf(fp, " %c %lx,%d", &op, &addr, &size).

Hint Layer 2: Core Logic

Hit detection: For each line in set[index], check valid && (tag == line.tag).
For LRU, a simple counter works: increment global counter on each access, store it in the accessed line. Victim is the line with smallest counter.
Handle ‘M’ as two separate accesses: a load (which may miss) followed by a store (which always hits if the load brought in the block).

Hint Layer 3: Miss Classification

Maintain a hash set of all block addresses ever seen. First access to a block is compulsory.
Simulate a fully-associative cache of the same total size in parallel. If the FA cache hits but your cache misses, it’s a conflict miss.
Everything else is a capacity miss.

Hint Layer 4: Common Bugs

Off-by-one in bit masks: use (1UL << n) - 1, not (1 << n) - 1 (the UL matters for 64-bit).
Forgetting to update LRU counter on hits (not just misses).
Using int instead of unsigned long for addresses (signed arithmetic bugs).
Not handling the case where a set has empty lines (no eviction needed).

The Interview Questions They’ll Ask

After completing this project, you should be able to confidently answer these questions:

“Explain the difference between direct-mapped, set-associative, and fully-associative caches. What are the tradeoffs?”
- Expect follow-ups about hardware cost, hit latency, and miss rate
“Why does row-major traversal of a matrix outperform column-major traversal in C?”
- Must connect to cache line size and spatial locality
“What is cache thrashing and how would you detect and fix it in code?”
- Discuss conflict misses, power-of-2 strides, and padding
“Walk me through exactly what happens when the CPU accesses memory at address X.”
- Cover TLB, address decomposition, tag comparison, miss handling
“How does blocking (loop tiling) improve matrix multiplication performance?”
- Quantify: 3 blocks must fit in cache, calculate optimal block size
“You have a program with 90% cache miss rate. How do you diagnose and fix it?”
- Discuss Cachegrind, perf, miss classification, data layout changes

Books That Will Help

Topic	Book	Specific Chapters
Cache organization fundamentals	CS:APP (3rd ed.)	Chapter 6.1-6.4
Locality and optimization	CS:APP (3rd ed.)	Chapter 6.5-6.6
Performance measurement	CS:APP (3rd ed.)	Chapter 5.1-5.7
Memory hierarchy design	Patterson & Hennessy: Computer Architecture	Chapter 5
Advanced caching	Ulrich Drepper: “What Every Programmer Should Know About Memory”	Sections 3.1-3.3
C bit manipulation	K&R: The C Programming Language	Chapter 2.9
File I/O in C	K&R: The C Programming Language	Chapter 7.5-7.7
Cache-oblivious algorithms	Introduction to Algorithms (CLRS)	Chapter 27

This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.

Project 9: Cache Lab++ – Cache Simulator + Locality Visualizer

Quick Reference

Learning Objectives

Deep Theoretical Foundation

The Memory Hierarchy: Why Caches Exist

Cache Organization Fundamentals

Cache Types by Associativity

Address Decomposition: Tag, Index, Offset

The Cache Line (Block) Concept

Miss Types: The Three Cs

Replacement Policies

Write Policies

Spatial and Temporal Locality

Blocking (Tiling) for Cache Optimization

Cache-Oblivious Algorithms

Project Specification

What You Will Build

Functional Requirements

Non-Functional Requirements

Example Usage and Output

Real World Outcome

Cache Simulator Basic Usage

Verbose Mode with Full Trace

Miss Classification Analysis

Locality Visualizer Output

Matrix Multiply Optimization Analysis

Solution Architecture

High-Level Design

Key Data Structures

Module Structure

Core Algorithm: Cache Access

Implementation Guide

Development Environment Setup

Project Structure

Implementation Phases

Phase 1: Address Decomposition (Days 1-2)

Phase 2: Cache Data Structures (Days 3-4)

Phase 3: Cache Access Logic (Days 5-8)

Phase 4: LRU Replacement (Days 9-10)

Phase 5: Trace Parser (Days 11-12)

Phase 6: Statistics and Verbose Mode (Days 13-14)

Phase 7: Miss Classification (Days 15-17)

Phase 8: ASCII Visualizer (Days 18-21)

Testing Strategy

Test Categories

Test Traces

Reference Validator

Critical Test Cases

Common Pitfalls

Implementation Mistakes

Debugging Strategies

Performance Traps

Extensions

Beginner Extensions

Intermediate Extensions

Advanced Extensions

Multi-Level Cache Example

Real-World Connections

Performance Engineering

Tools That Use These Concepts

Interview Applications

Modern CPU Cache Parameters

Resources

Essential Reading

Tools

Trace Generation

Related Projects

Online Resources

Self-Assessment Checklist

Understanding

Implementation

Application

Growth

Completion Criteria

Minimum Viable Completion

Full Completion

Excellence (Going Above & Beyond)

The Core Question You’re Answering

Concepts You Must Understand First

Questions to Guide Your Design