Project 8: Performance Clinic

Project 8: Performance Clinic

Build a benchmark suite of small kernels plus a written optimization report explaining changes in terms of ILP, branch prediction, and locality.


Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Language C (Alternatives: Rust, Zig, C++)
Prerequisites Project 1 (Toolchain Explorer)
Key Topics Loop optimization, ILP, branch prediction, memory hierarchy, profiling
CS:APP Chapters 5 (primary), 6, 1

1. Learning Objectives

By completing this project, you will:

  1. Measure performance reliably: Obtain stable, repeatable timing measurements that resist noise and variance
  2. Understand CPU execution pipelines: Reason about instruction-level parallelism, functional units, and throughput vs latency
  3. Master loop transformations: Apply unrolling, reassociation, and accumulator parallelism systematically
  4. Predict branch behavior: Understand branch prediction mechanics and identify misprediction-prone patterns
  5. Reason about memory hierarchy: Connect cache behavior (Chapter 6) to observed performance
  6. Apply Amdahlโ€™s Law: Quantify optimization limits and prioritize improvement efforts
  7. Avoid โ€œfaster by accidentโ€: Distinguish genuine improvements from measurement artifacts

2. Deep Theoretical Foundation

2.1 Amdahlโ€™s Law: The Speed Limit of Optimization

Every optimization has limits. Gene Amdahl formalized this in 1967:

                    1
Speedup(S, p) = โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                (1 - p) + (p / S)

Where:
- p = fraction of execution time in the optimizable portion
- S = speedup of that portion
- (1 - p) = fraction that cannot be optimized

Example: Your program spends 80% of time in one function. You optimize that function to run 10x faster.

p = 0.80 (80% in optimizable portion)
S = 10   (10x speedup)

Speedup = 1 / (0.20 + 0.80/10)
        = 1 / (0.20 + 0.08)
        = 1 / 0.28
        = 3.57x overall speedup

Even with infinite speedup (S = infinity):
Speedup_max = 1 / (1 - 0.80) = 1 / 0.20 = 5x maximum

You can never exceed 5x no matter how fast you make that 80%!

Key Implications:

  1. Profile before optimizing: Find the p-dominant portions
  2. The sequential portion is the ceiling: Reducing (1-p) may matter more than increasing S
  3. Diminishing returns: Each additional speedup of the hot portion yields less overall benefit
  4. Always calculate the bound: Before spending a week optimizing, compute the maximum possible gain
Amdahl's Law Visualization:

Overall speedup when portion p is sped up by factor S:

p = 50%:    S=2  -> 1.33x    S=10 -> 1.82x    S=inf -> 2.00x
p = 80%:    S=2  -> 1.67x    S=10 -> 3.57x    S=inf -> 5.00x
p = 90%:    S=2  -> 1.82x    S=10 -> 5.26x    S=inf -> 10.00x
p = 95%:    S=2  -> 1.90x    S=10 -> 6.90x    S=inf -> 20.00x
p = 99%:    S=2  -> 1.98x    S=10 -> 9.17x    S=inf -> 100.00x

The larger p is, the more room for improvement.

2.2 Understanding Modern CPU Pipelines

Modern processors execute instructions through a multi-stage pipeline:

Classic 5-Stage Pipeline:

         Cycle:   1    2    3    4    5    6    7    8
Inst 1:        [IF] [ID] [EX] [MEM][WB]
Inst 2:             [IF] [ID] [EX] [MEM][WB]
Inst 3:                  [IF] [ID] [EX] [MEM][WB]
Inst 4:                       [IF] [ID] [EX] [MEM][WB]

IF  = Instruction Fetch
ID  = Instruction Decode
EX  = Execute (ALU operation)
MEM = Memory access
WB  = Write Back (to register)

Ideally: One instruction completes per cycle (throughput = 1 IPC)

Modern Superscalar Processors:

Real processors are far more complex:

Intel Skylake-class Core (simplified):

     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚                        Front End                               โ”‚
     โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
     โ”‚  โ”‚   Fetch     โ”‚โ”€โ”€โ”€โ–ถโ”‚   Decode    โ”‚โ”€โ”€โ”€โ–ถโ”‚   Rename/   โ”‚        โ”‚
     โ”‚  โ”‚ (16B/cycle) โ”‚    โ”‚ (4 inst/cyc)โ”‚    โ”‚   Allocate  โ”‚        โ”‚
     โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
                                    โ–ผ
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚              Reservation Station (Scheduler)                   โ”‚
     โ”‚                    ~100 entries                                โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚            โ”‚               โ”‚               โ”‚            โ”‚
       โ–ผ            โ–ผ               โ–ผ               โ–ผ            โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ INT 0 โ”‚   โ”‚ INT 1 โ”‚       โ”‚ INT 2 โ”‚       โ”‚ FP 0  โ”‚   โ”‚ FP 1  โ”‚
   โ”‚ ALU   โ”‚   โ”‚ ALU   โ”‚       โ”‚ ALU   โ”‚       โ”‚ FMA   โ”‚   โ”‚ FMA   โ”‚
   โ”‚ BRANCHโ”‚   โ”‚ BRANCHโ”‚       โ”‚ MUL   โ”‚       โ”‚ ADD   โ”‚   โ”‚ ADD   โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚            โ”‚               โ”‚               โ”‚            โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
                                    โ–ผ
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚                    Reorder Buffer                              โ”‚
     โ”‚               (preserves sequential semantics)                 โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key insight: Multiple instructions execute simultaneously across different functional units. Your code should provide enough independent work to keep these units busy.


2.3 Instruction-Level Parallelism (ILP)

ILP is the amount of independent work available at the instruction level.

Data Dependencies Limit ILP:

// Sequential chain - NO parallelism
// Each instruction depends on the previous result
x = a + b;      // cycle 1-4 (assume 4-cycle latency)
y = x + c;      // cycle 5-8 (must wait for x)
z = y + d;      // cycle 9-12 (must wait for y)
// Total: 12 cycles for 3 operations

// Independent operations - FULL parallelism
// No dependencies between operations
x = a + b;      // cycles 1-4 โ”€โ”ฌโ”€ all execute
y = c + d;      // cycles 1-4 โ”€โ”ค  simultaneously
z = e + f;      // cycles 1-4 โ”€โ”˜
// Total: 4 cycles for 3 operations (3x speedup)

Dependency Types:

Type Also Called Description Can Be Eliminated?
RAW True Dependency Read After Write: a = 1; b = a; No (fundamental)
WAR Anti-Dependency Write After Read: a = b; b = 2; Yes (renaming)
WAW Output Dependency Write After Write: a = 1; a = 2; Yes (renaming)

The processorโ€™s register renaming eliminates WAR and WAW dependencies, but RAW dependencies are fundamental and limit ILP.


2.4 Latency vs Throughput

Critical distinction for understanding performance:

Latency: Time for a single operation to complete (cycles)

Throughput: Rate at which operations can be initiated (operations per cycle)

Example: Floating-point multiplication on modern Intel

Latency:    4 cycles (one multiply takes 4 cycles start to finish)
Throughput: 0.5 cycles (can start a new multiply every 0.5 cycles)

This means the multiplier is pipelined:

Cycle:     1     2     3     4     5     6     7     8
Mult 1:  [S1]  [S2]  [S3]  [S4] done
Mult 2:        [S1]  [S2]  [S3]  [S4] done
Mult 3:              [S1]  [S2]  [S3]  [S4] done
Mult 4:                    [S1]  [S2]  [S3]  [S4] done

With 8 independent multiplies, throughput bound is:
  8 operations / (8 * 0.5 cycles) = 8 / 4 = 2 ops/cycle

But with a dependency chain:
  8 operations / (8 * 4 cycles) = 8 / 32 = 0.25 ops/cycle

  That's 8x slower!

The Performance Equation:

Performance = min(Latency_bound, Throughput_bound)

Latency_bound:    Time = n * Latency
                  (serial chain of n operations)

Throughput_bound: Time = n * Throughput
                  (n independent operations)

Goal: Make your code throughput-bound, not latency-bound!

2.5 CPE: Cycles Per Element

The CPE metric from CS:APP measures efficiency for array/loop computations:

CPE = (Total cycles to process array) / (Number of elements)

Example: Sum an array of 1000 elements

Version A: 4000 cycles  -> CPE = 4.0
Version B: 1500 cycles  -> CPE = 1.5
Version C: 1000 cycles  -> CPE = 1.0

Lower CPE = faster code

Theoretical CPE Limits:

For a loop that performs one floating-point add per element:

for (int i = 0; i < n; i++) {
    sum += data[i];  // one FP add per iteration
}
Bottleneck Bound Reason
Latency (3-cycle add, dependency chain) CPE >= 3.0 Each add waits for previous
Throughput (1 add/cycle possible) CPE >= 1.0 Adder can do 1 per cycle
Memory (assuming cached) CPE >= 0.5 2 loads/cycle possible

If your measured CPE is 3.0, youโ€™re latency-bound. Optimization means breaking the dependency chain.


2.6 Loop Unrolling

Loop unrolling reduces loop overhead and exposes more work per iteration:

// Original loop
for (int i = 0; i < n; i++) {
    sum += data[i];
}

// 2x unrolled
for (int i = 0; i < n; i += 2) {
    sum += data[i];
    sum += data[i+1];
}

// 4x unrolled
for (int i = 0; i < n; i += 4) {
    sum += data[i];
    sum += data[i+1];
    sum += data[i+2];
    sum += data[i+3];
}

Benefits:

  1. Fewer loop counter increments and comparisons
  2. More work visible to the out-of-order engine
  3. Better instruction cache utilization (fewer branches)

But still latency-bound! All adds still chain through sum. Unrolling alone may not help.


2.7 Reassociation and Multiple Accumulators

The key to breaking latency bounds is parallelizing the accumulation:

// Original: all adds chain through one accumulator
// CPE ~= latency of addition (3-4 cycles)
double sum = 0;
for (int i = 0; i < n; i++) {
    sum += data[i];  // sum = sum + data[i]
}

// 2x parallelism: two independent accumulator chains
// CPE ~= latency / 2
double sum0 = 0, sum1 = 0;
for (int i = 0; i < n; i += 2) {
    sum0 += data[i];      // chain A
    sum1 += data[i+1];    // chain B (independent!)
}
double sum = sum0 + sum1;

// 4x parallelism: four independent chains
// CPE ~= latency / 4
double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < n; i += 4) {
    sum0 += data[i];
    sum1 += data[i+1];
    sum2 += data[i+2];
    sum3 += data[i+3];
}
double sum = (sum0 + sum1) + (sum2 + sum3);

Visualization:

Original (serial):

data[0] โ”€โ”€addโ”€โ”€โ–ถ sum โ”€โ”€addโ”€โ”€โ–ถ sum โ”€โ”€addโ”€โ”€โ–ถ sum โ”€โ”€addโ”€โ”€โ–ถ ...
              โ”‚ data[1]   โ”‚ data[2]   โ”‚ data[3]
              wait        wait        wait

Cycle: 0โ”€โ”€โ”€3โ”€โ”€โ”€6โ”€โ”€โ”€9โ”€โ”€โ”€12โ”€โ”€โ”€...  (CPE = 3.0)


4x Parallel:

data[0] โ”€โ”€addโ”€โ”€โ–ถ sum0
data[1] โ”€โ”€addโ”€โ”€โ–ถ sum1      All four chains run
data[2] โ”€โ”€addโ”€โ”€โ–ถ sum2      simultaneously!
data[3] โ”€โ”€addโ”€โ”€โ–ถ sum3
         โ”‚
data[4] โ”€โ”€addโ”€โ”€โ–ถ sum0      Next iteration starts
...                        after 3 cycles, not 12

Cycle: 0โ”€โ”€โ”€3โ”€โ”€โ”€6โ”€โ”€โ”€9โ”€โ”€โ”€...  (CPE = 0.75)

2.8 Reassociation Transformations

Reassociation changes the order of operations to expose parallelism:

// Original: Left-to-right evaluation
// acc = ((acc * x[i]) * x[i+1]) * x[i+2]
// Each multiply depends on the previous

for (int i = 0; i < n; i += 2) {
    acc = (acc * data[i]) * data[i+1];  // Serial!
}

// Reassociated: Pair independent operations
// acc = acc * (x[i] * x[i+1])
// The (x[i] * x[i+1]) can proceed in parallel with the chain

for (int i = 0; i < n; i += 2) {
    acc = acc * (data[i] * data[i+1]);  // Partial parallel!
}

Warning: Reassociation changes results for floating-point due to rounding!

Floating-point is NOT associative:

a = 1.0e20
b = -1.0e20
c = 1.0

(a + b) + c = 0 + 1 = 1.0
a + (b + c) = 1.0e20 + (-1.0e20 + 1.0) = 1.0e20 + (-1.0e20) = 0.0

Different results! Compiler won't reassociate unless you explicitly allow it.

Use -ffast-math to permit reassociation, or do it manually (accepting the precision trade-off).


2.9 Branch Prediction Mechanics

Modern CPUs predict branch outcomes to avoid pipeline stalls:

Pipeline with branch at cycle 3:

         Cycle:   1    2    3    4    5    6    7    8
Fetch:         [A]  [B]  [br] [?]  [?]  [?]  [?]  [?]
                              โ”‚
                              โ”œโ”€โ”€ Predict TAKEN: fetch from target
                              โ””โ”€โ”€ Predict NOT-TAKEN: fetch next instruction

If prediction is WRONG:
- Pipeline must be flushed
- Restart from correct address
- Penalty: 15-20 cycles on modern CPUs!

Branch Prediction Strategies:

Strategy How it Works Accuracy
Static (taken) Always predict backward branches taken, forward not-taken ~60-70%
One-bit Remember last outcome ~80%
Two-bit (saturating counter) Change prediction after 2 wrong ~85-90%
Correlating Consider history of other branches ~90-95%
Neural Pattern recognition on branch history ~95%+

Branch-Unfriendly Patterns:

// Random condition - terrible for prediction
if (rand() % 2) { ... }  // 50% misprediction rate!

// Data-dependent unpredictable branch
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {  // depends on data pattern
        result += data[i];
    }
}

Branch-Friendly Patterns:

// Predictable loop
for (int i = 0; i < 1000; i++) { ... }
// Predict: i < 1000 is TRUE for 999 iterations

// Sorted data makes branches predictable
sort(data);
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {  // After some point, always true
        result += data[i];
    }
}

2.10 Avoiding Branches: Conditional Moves

Replace branches with branchless computation:

// Branch version - misprediction possible
if (a > b) {
    max = a;
} else {
    max = b;
}

// Branchless version using conditional move
max = (a > b) ? a : b;  // May compile to CMOV instruction

// Manual branchless (works always)
int mask = -(a > b);    // All 1s if true, all 0s if false
max = (a & mask) | (b & ~mask);

When to use branchless:

  • Unpredictable conditions (near 50/50 split)
  • Inner loops where branch overhead matters
  • When the computation is simple (conditional move overhead is ~2 cycles)

When branches are better:

  • Predictable patterns (strongly biased one way)
  • Complex alternatives (one path is expensive)
  • The work saved by skipping outweighs misprediction cost

2.11 Memory Hierarchy Impact

Performance depends heavily on memory access patterns (Chapter 6):

Memory Hierarchy Latencies (approximate, Intel Skylake):

Level           Size        Latency (cycles)    Bandwidth
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
L1 Cache        32 KB       4 cycles           ~200 GB/s
L2 Cache        256 KB      12 cycles          ~80 GB/s
L3 Cache        8+ MB       40 cycles          ~40 GB/s
Main Memory     GBs         200+ cycles        ~20 GB/s
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

A cache miss costs 50-200x more than a hit!

Locality Principles:

// Temporal locality: recently accessed items accessed again
for (int iter = 0; iter < 100; iter++) {
    for (int i = 0; i < N; i++) {
        sum += data[i];  // Same data[i] accessed 100 times
    }                    // Stays in cache after first pass
}

// Spatial locality: access nearby memory locations
for (int i = 0; i < N; i++) {
    sum += data[i];      // data[i] and data[i+1] are adjacent
}                        // Cache line prefetched helps

// Poor locality (stride pattern)
for (int i = 0; i < N; i += 64) {
    sum += data[i];      // Skip cache lines, waste prefetch
}

// Poor locality (random access)
for (int i = 0; i < N; i++) {
    sum += data[indices[i]];  // Random access - cache useless
}

Performance Impact:

Sequential access of 1 million doubles:

L1 hit:   1M * 4 cycles = 4M cycles
L3 hit:   1M * 40 cycles = 40M cycles (10x slower)
DRAM:     1M * 200 cycles = 200M cycles (50x slower)

Locality is CRITICAL for performance!

2.12 Profiling Methodologies

Before optimizing, measure:

1. Time Measurement:

#include <time.h>

// Wall-clock time
clock_t start = clock();
// ... code to measure ...
clock_t end = clock();
double seconds = (double)(end - start) / CLOCKS_PER_SEC;

// More precise: POSIX high-resolution timer
#include <time.h>
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// ... code to measure ...
clock_gettime(CLOCK_MONOTONIC, &end);
double ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);

2. Cycle Counting (x86):

#include <x86intrin.h>

unsigned long long start = __rdtsc();
// ... code to measure ...
unsigned long long end = __rdtsc();
unsigned long long cycles = end - start;

3. Hardware Performance Counters (Linux perf):

# Overall stats
perf stat ./my_program

# Detailed cache behavior
perf stat -e cache-misses,cache-references ./my_program

# Branch prediction
perf stat -e branches,branch-misses ./my_program

# Profile and identify hotspots
perf record ./my_program
perf report

2.13 Microbenchmarking Pitfalls

Getting stable measurements is harder than it seems:

Pitfall 1: Dead Code Elimination

// BAD: Compiler may optimize away!
double result = 0;
for (int i = 0; i < N; i++) {
    result += expensive_computation(data[i]);
}
// result is never used - compiler removes the loop!

// GOOD: Prevent elimination
volatile double result = 0;
// or
printf("Result: %f\n", result);  // Force result to be computed

Pitfall 2: CPU Frequency Scaling

// CPU starts at low frequency, ramps up during benchmark
// First measurements are slower!

// SOLUTION 1: Warmup iterations
for (int i = 0; i < 1000; i++) {
    test_function(data);  // Warmup - don't measure
}
// Now measure
start = rdtsc();
for (int i = 0; i < TRIALS; i++) {
    test_function(data);
}
end = rdtsc();

// SOLUTION 2: Disable frequency scaling
// sudo cpupower frequency-set --governor performance

Pitfall 3: Measurement Overhead

// BAD: Timing overhead dominates for fast operations
for (int i = 0; i < N; i++) {
    start = rdtsc();
    x = x + 1;        // 1 cycle
    end = rdtsc();    // ~50 cycles overhead!
    total += (end - start);
}

// GOOD: Batch many operations
start = rdtsc();
for (int i = 0; i < 1000000; i++) {
    x = x + 1;
}
end = rdtsc();
cycles_per_op = (end - start) / 1000000.0;

Pitfall 4: Cache State Variability

// First access: cold cache (slow)
// Second access: warm cache (fast)

// SOLUTION: Consistent cache state
// Option A: Flush cache before each trial (cold)
// Option B: Warmup loop before measuring (warm)
// Option C: Large enough data to exceed cache (always miss)

// Document which state you're measuring!

Pitfall 5: Variance and Outliers

// Run multiple trials, report minimum or median
double times[TRIALS];
for (int t = 0; t < TRIALS; t++) {
    start = now();
    test_function(data);
    times[t] = now() - start;
}
// Use minimum (represents "best case" without interference)
// Or median (more robust to outliers)
double min_time = minimum(times, TRIALS);
double median_time = median(times, TRIALS);

// Report both and explain methodology!

2.14 Compiler Optimizations Overview

Know what the compiler can do so you donโ€™t duplicate its work:

Common Compiler Optimizations:

Optimization What It Does Flag
Inlining Replace function call with body -O1+
Loop unrolling Replicate loop body -O2+ or -funroll-loops
Constant folding Compute constants at compile time -O1+
Dead code elimination Remove unused code -O1+
Common subexpression Reuse computed values -O1+
Strength reduction Replace expensive ops (a*2 -> aยซ1) -O1+
Vectorization Use SIMD instructions -O3 or -ftree-vectorize
Instruction scheduling Reorder for pipeline efficiency -O2+

Compiler Explorer Strategy:

  1. Write your original code
  2. Compile with -O0 and -O3
  3. Examine assembly (godbolt.org is excellent)
  4. See what the compiler already does
  5. Only manually optimize what the compiler misses

Critical Compiler Flags:

# Standard optimization levels
gcc -O0 ...  # No optimization (debugging)
gcc -O1 ...  # Basic optimizations
gcc -O2 ...  # Standard optimizations (recommended for most cases)
gcc -O3 ...  # Aggressive optimizations (may increase code size)
gcc -Os ...  # Optimize for size

# Architecture-specific
gcc -march=native ...  # Use all instructions available on this CPU
gcc -mtune=native ...  # Tune for this CPU (without requiring features)

# Floating-point
gcc -ffast-math ...    # Allows reassociation, breaks IEEE compliance
gcc -ffp-contract=fast # Allow FMA contraction

# Vectorization
gcc -ftree-vectorize ... # Enable auto-vectorization
gcc -fopt-info-vec ...   # Report what was vectorized

3. Project Specification

3.1 What You Will Build

A benchmark suite consisting of:

  1. Small computational kernels: Array operations that stress different aspects of the CPU
  2. Measurement infrastructure: Reliable timing with statistical analysis
  3. Optimization laboratory: Before/after versions of each kernel
  4. Written report: Explanation of each optimization in architectural terms

3.2 Core Benchmark Kernels

Implement and optimize these kernels (minimum 4):

Kernel 1: Sum Reduction

double sum_naive(double *data, size_t n) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += data[i];
    }
    return sum;
}

Target: Transform from latency-bound to throughput-bound.

Kernel 2: Polynomial Evaluation (Hornerโ€™s Rule)

double poly_naive(double *coeffs, size_t degree, double x) {
    double result = coeffs[degree];
    for (size_t i = degree; i > 0; i--) {
        result = result * x + coeffs[i-1];
    }
    return result;
}

Target: Explore dependency chain limitations.

Kernel 3: Conditional Accumulation

double conditional_sum_naive(double *data, size_t n, double threshold) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        if (data[i] > threshold) {
            sum += data[i];
        }
    }
    return sum;
}

Target: Explore branch prediction effects, try branchless.

Kernel 4: Matrix-Vector Multiply

void matvec_naive(double *A, double *x, double *y, size_t n) {
    for (size_t i = 0; i < n; i++) {
        y[i] = 0;
        for (size_t j = 0; j < n; j++) {
            y[i] += A[i*n + j] * x[j];
        }
    }
}

Target: Explore memory access patterns and locality.

Kernel 5: Inner Product

double inner_product_naive(double *a, double *b, size_t n) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

Target: Two data streams, multiply-add dependency chain.

3.3 Functional Requirements

  1. Measurement System:
    • Multiple trials with statistical reporting
    • CPE calculation for each kernel
    • Cycle-accurate timing (rdtsc or similar)
    • Warmup iterations before measurement
  2. Kernel Variants:
    • At least 3 optimization levels per kernel
    • Each variant must produce correct results
    • Document the transformation applied
  3. Report Requirements:
    • CPE table for all variants
    • Speedup calculation relative to naive
    • Explanation of why each optimization works (or doesnโ€™t)
    • Discussion of architectural bottleneck shifted

3.4 Example Output

$ ./perf-clinic --kernel sum --trials 100

=== PERFORMANCE CLINIC: Sum Reduction ===

Array size: 1,000,000 doubles
Trials: 100 (reporting median)

Variant                 CPE      Speedup   Bottleneck
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
naive                   3.02     1.00x     Latency (FP add chain)
unroll_2x               3.01     1.00x     Still latency-bound
unroll_4x               2.98     1.01x     Minimal improvement
parallel_2acc           1.52     1.99x     Latency / 2
parallel_4acc           0.78     3.87x     Approaching throughput
parallel_8acc           0.52     5.81x     Near throughput bound
vectorized              0.13     23.23x    SIMD (4 ops/instr)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Analysis:
- Naive version is limited by 3-cycle FP add latency
- Unrolling alone doesn't help (still one dependency chain)
- Multiple accumulators break the dependency chain
- 8 accumulators approach the throughput bound (~0.5 CPE)
- SIMD achieves 4x further by processing 4 elements per instruction

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        perf-clinic                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚  โ”‚   Timing     โ”‚   โ”‚   Kernel     โ”‚   โ”‚   Analysis   โ”‚            โ”‚
โ”‚  โ”‚   Harness    โ”‚   โ”‚   Library    โ”‚   โ”‚   Engine     โ”‚            โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ”‚         โ”‚                  โ”‚                  โ”‚                     โ”‚
โ”‚         โ–ผ                  โ–ผ                  โ–ผ                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚                    Benchmark Runner                           โ”‚  โ”‚
โ”‚  โ”‚  - Load data                                                  โ”‚  โ”‚
โ”‚  โ”‚  - Run warmup                                                 โ”‚  โ”‚
โ”‚  โ”‚  - Execute trials                                             โ”‚  โ”‚
โ”‚  โ”‚  - Collect statistics                                         โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                              โ”‚                                      โ”‚
โ”‚                              โ–ผ                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚                     Report Generator                          โ”‚  โ”‚
โ”‚  โ”‚  - CPE calculations                                           โ”‚  โ”‚
โ”‚  โ”‚  - Speedup ratios                                             โ”‚  โ”‚
โ”‚  โ”‚  - Statistical summary                                        โ”‚  โ”‚
โ”‚  โ”‚  - Bottleneck identification                                  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                              โ”‚                                      โ”‚
โ”‚                              โ–ผ                                      โ”‚
โ”‚                     [Console / File Output]                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Data Structures

// Timing result for one trial
typedef struct {
    uint64_t cycles;
    double   seconds;
    size_t   elements;
} trial_result_t;

// Statistics across all trials
typedef struct {
    double min_cycles;
    double median_cycles;
    double mean_cycles;
    double stddev_cycles;
    double cpe;          // cycles per element
    size_t num_trials;
} benchmark_stats_t;

// Kernel variant
typedef struct {
    const char *name;
    const char *description;
    void (*setup)(void *ctx);            // Optional setup
    void (*kernel)(void *ctx);           // The actual work
    void (*verify)(void *ctx);           // Correctness check
    void (*teardown)(void *ctx);         // Cleanup
} kernel_variant_t;

// Benchmark configuration
typedef struct {
    const char *kernel_name;
    kernel_variant_t *variants;
    size_t num_variants;
    size_t element_count;
    size_t num_trials;
    size_t warmup_iters;
    bool verify_results;
} benchmark_config_t;

// Result for one kernel variant
typedef struct {
    const char *variant_name;
    benchmark_stats_t stats;
    bool correct;
    double speedup;  // relative to baseline
} variant_result_t;

4.3 Module Structure

perf-clinic/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ main.c                  # CLI and orchestration
โ”‚   โ”œโ”€โ”€ timing.c                # Cycle counting and timing
โ”‚   โ”œโ”€โ”€ timing.h
โ”‚   โ”œโ”€โ”€ stats.c                 # Statistical analysis
โ”‚   โ”œโ”€โ”€ stats.h
โ”‚   โ”œโ”€โ”€ runner.c                # Benchmark execution
โ”‚   โ”œโ”€โ”€ runner.h
โ”‚   โ”œโ”€โ”€ report.c                # Output formatting
โ”‚   โ”œโ”€โ”€ report.h
โ”‚   โ””โ”€โ”€ kernels/
โ”‚       โ”œโ”€โ”€ sum_reduce.c        # Sum variants
โ”‚       โ”œโ”€โ”€ poly_eval.c         # Polynomial variants
โ”‚       โ”œโ”€โ”€ conditional.c       # Conditional sum variants
โ”‚       โ”œโ”€โ”€ matvec.c            # Matrix-vector variants
โ”‚       โ””โ”€โ”€ inner_product.c     # Inner product variants
โ”œโ”€โ”€ include/
โ”‚   โ”œโ”€โ”€ kernels.h               # Kernel declarations
โ”‚   โ””โ”€โ”€ common.h                # Shared definitions
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_correctness.c      # Verify all variants match
โ”‚   โ””โ”€โ”€ test_timing.c           # Validate timing infrastructure
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ generate_data.c         # Test data generation
โ”œโ”€โ”€ Makefile
โ””โ”€โ”€ README.md

5. Implementation Guide

Phase 1: Measurement Infrastructure (Days 1-3)

Goals:

  • Reliable cycle-accurate timing
  • Statistical analysis of multiple trials
  • Avoid measurement pitfalls

Tasks:

  1. Implement cycle counter access:
// timing.h
#include <stdint.h>

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__ (
        "rdtsc" : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

// More precise: use rdtscp (serializing)
static inline uint64_t rdtscp(void) {
    uint32_t lo, hi, aux;
    __asm__ __volatile__ (
        "rdtscp" : "=a"(lo), "=d"(hi), "=c"(aux)
    );
    return ((uint64_t)hi << 32) | lo;
}
  1. Implement statistical functions:
// stats.c
double compute_median(double *values, size_t n);
double compute_mean(double *values, size_t n);
double compute_stddev(double *values, size_t n);
double compute_min(double *values, size_t n);
  1. Implement benchmark harness:
// runner.c
benchmark_stats_t run_benchmark(
    void (*kernel)(void *ctx),
    void *ctx,
    size_t elements,
    size_t trials,
    size_t warmup
);

Checkpoint: Can measure a simple loop and get consistent cycle counts across runs.

Phase 2: Baseline Kernels (Days 4-5)

Goals:

  • Implement naive versions of all kernels
  • Verify correctness
  • Establish baseline CPE

Tasks:

  1. Implement each kernelโ€™s naive version
  2. Create verification functions (compare against known results)
  3. Generate test data with different patterns:
    • Sequential (0, 1, 2, โ€ฆ)
    • Random uniform
    • Random with patterns (sorted, partially sorted)
  4. Measure baseline CPE:
void measure_baselines(void) {
    double data[N];
    fill_random(data, N);

    benchmark_stats_t sum_stats = run_benchmark(
        sum_naive, data, N, 100, 10
    );

    printf("sum_naive: CPE = %.2f\n", sum_stats.cpe);
}

Checkpoint: All naive kernels produce correct results. Baseline CPE recorded.

Phase 3: Loop Optimizations (Days 6-8)

Goals:

  • Implement unrolled variants
  • Implement parallel accumulator variants
  • Understand dependency effects

Tasks:

  1. Implement 2x, 4x, 8x unrolling:
double sum_unroll4(double *data, size_t n) {
    double sum = 0;
    size_t i;
    for (i = 0; i < n - 3; i += 4) {
        sum += data[i];
        sum += data[i+1];
        sum += data[i+2];
        sum += data[i+3];
    }
    for (; i < n; i++) {
        sum += data[i];
    }
    return sum;
}
  1. Implement parallel accumulator variants:
double sum_parallel4(double *data, size_t n) {
    double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    size_t i;
    for (i = 0; i < n - 3; i += 4) {
        sum0 += data[i];
        sum1 += data[i+1];
        sum2 += data[i+2];
        sum3 += data[i+3];
    }
    for (; i < n; i++) {
        sum0 += data[i];
    }
    return (sum0 + sum1) + (sum2 + sum3);
}
  1. Measure and compare CPE for all variants

Checkpoint: Parallel accumulator shows significant speedup. Can explain why unrolling alone doesnโ€™t help.

Phase 4: Branch Optimizations (Days 9-10)

Goals:

  • Measure branch misprediction effects
  • Implement branchless alternatives
  • Compare with different data patterns

Tasks:

  1. Implement branchless conditional:
double conditional_sum_branchless(double *data, size_t n, double threshold) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        // mask is 0 or all-1s based on condition
        double mask = (data[i] > threshold) ? 1.0 : 0.0;
        sum += data[i] * mask;
    }
    return sum;
}
  1. Test with different data distributions:
    • Random (50% pass threshold) - unpredictable
    • Sorted (all pass or all fail) - predictable
    • Biased (90% one way) - mostly predictable
  2. Measure with perf stat -e branches,branch-misses:
# High misprediction rate expected for random data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data random

# Low misprediction rate for sorted data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data sorted

Checkpoint: Can demonstrate branch misprediction costs. Branchless version wins for random data.

Phase 5: Memory Optimizations (Days 11-12)

Goals:

  • Explore locality effects
  • Measure cache miss impact
  • Optimize memory access patterns

Tasks:

  1. Implement matrix-vector with different access patterns:
// Row-major (good cache behavior for row-major storage)
void matvec_rowmajor(double *A, double *x, double *y, size_t n) {
    for (size_t i = 0; i < n; i++) {
        double sum = 0;
        for (size_t j = 0; j < n; j++) {
            sum += A[i*n + j] * x[j];  // A accessed sequentially
        }
        y[i] = sum;
    }
}

// Column-major (bad cache behavior for row-major storage)
void matvec_colmajor(double *A, double *x, double *y, size_t n) {
    for (size_t i = 0; i < n; i++) y[i] = 0;
    for (size_t j = 0; j < n; j++) {
        for (size_t i = 0; i < n; i++) {
            y[i] += A[i*n + j] * x[j];  // A accessed with stride n
        }
    }
}
  1. Measure cache misses:
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant rowmajor
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant colmajor
  1. Test with matrices of different sizes (L1, L2, L3, larger than cache)

Checkpoint: Row-major significantly faster. Cache miss rate correlates with slowdown.

Phase 6: Report and Analysis (Days 13-14)

Goals:

  • Document all results
  • Explain each optimization architecturally
  • Identify when optimizations backfire

Tasks:

  1. Create comprehensive CPE table
  2. Write analysis for each kernel:
    • What was the bottleneck?
    • How did each transformation address it?
    • What is the limiting factor now?
  3. Document cases where optimizations didnโ€™t help:
    • Unrolling without multiple accumulators
    • Branchless for predictable data
    • Over-unrolling that causes register pressure
  4. Produce final report document

Checkpoint: Portfolio-quality report with before/after results and strong architectural narrative.


6. Testing Strategy

6.1 Correctness Testing

Every optimized variant must produce the same result as the naive version:

void test_sum_correctness(void) {
    double data[N];
    fill_random(data, N);

    double expected = sum_naive(data, N);

    assert_close(sum_unroll4(data, N), expected, 1e-10);
    assert_close(sum_parallel4(data, N), expected, 1e-10);
    assert_close(sum_parallel8(data, N), expected, 1e-10);
    // Note: floating-point may differ slightly due to reassociation
}

void assert_close(double actual, double expected, double tolerance) {
    double diff = fabs(actual - expected);
    double rel = diff / (fabs(expected) + 1e-10);
    if (rel > tolerance) {
        fprintf(stderr, "FAIL: expected %f, got %f (rel diff: %e)\n",
                expected, actual, rel);
        exit(1);
    }
}

6.2 Measurement Validation

Verify timing infrastructure is reliable:

void test_timing_stability(void) {
    // Measure a known-duration operation
    double times[100];
    for (int i = 0; i < 100; i++) {
        uint64_t start = rdtsc();
        busy_loop(10000);  // Calibrated delay
        times[i] = rdtsc() - start;
    }

    double stddev = compute_stddev(times, 100);
    double mean = compute_mean(times, 100);
    double cv = stddev / mean;  // Coefficient of variation

    // CV should be < 5% for stable measurements
    assert(cv < 0.05);
}

6.3 Compiler Interference Check

Ensure the compiler isnโ€™t removing your benchmark code:

void test_not_optimized_away(void) {
    volatile double result;

    uint64_t start = rdtsc();
    result = sum_naive(data, N);
    uint64_t cycles = rdtsc() - start;

    // Should take at least N cycles (one memory access per element)
    assert(cycles > N / 2);

    // Use result to prevent elimination
    assert(result != 12345.6789);  // Some value it won't equal
}

6.4 Edge Case Testing

void test_edge_cases(void) {
    double data[8] = {1, 2, 3, 4, 5, 6, 7, 8};

    // Test with n not divisible by unroll factor
    assert_close(sum_parallel4(data, 1), 1.0, 1e-10);
    assert_close(sum_parallel4(data, 2), 3.0, 1e-10);
    assert_close(sum_parallel4(data, 5), 15.0, 1e-10);
    assert_close(sum_parallel4(data, 7), 28.0, 1e-10);

    // Test with empty array
    assert_close(sum_parallel4(data, 0), 0.0, 1e-10);

    // Test with special values
    data[0] = INFINITY;
    assert(isinf(sum_parallel4(data, 8)));

    data[0] = NAN;
    assert(isnan(sum_parallel4(data, 8)));
}

7. Common Pitfalls and Debugging

7.1 Measurement Errors

Pitfall Symptom Solution
Dead code elimination CPE near 0 Use volatile or print result
Frequency scaling First runs slower Warmup iterations, disable scaling
Turbo boost inconsistency High variance Disable turbo or use more trials
Timer overflow Negative times Use 64-bit counters
Memory prefetching Arrays too small Use arrays > L3 cache

7.2 CPU Frequency Scaling

# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance (requires root)
sudo cpupower frequency-set --governor performance

# Or for all CPUs:
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee $cpu
done

# Disable turbo boost (Intel)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

7.3 Compiler Optimization Interference

# Check what the compiler is doing
gcc -O3 -S -fverbose-asm kernel.c -o kernel.s

# Use Compiler Explorer (https://godbolt.org)
# to visualize assembly

# Prevent specific optimizations
__attribute__((optimize("no-unroll-loops")))
double sum_my_unroll(double *data, size_t n) { ... }

7.4 Memory Alignment Issues

// Ensure proper alignment for SIMD
double *data = aligned_alloc(32, n * sizeof(double));

// Or use compiler attributes
double data[N] __attribute__((aligned(32)));

// Check alignment at runtime
assert(((uintptr_t)data % 32) == 0);

7.5 False Sharing in Parallel Code

// BAD: Accumulators may share cache line
double sums[4];  // Could be on same 64-byte line

// GOOD: Pad to separate cache lines
struct {
    double value;
    char padding[56];  // 64 - 8 = 56 bytes padding
} sums[4];

8. Extensions

8.1 Beginner Extensions

  • Add CSV output: Export results for spreadsheet analysis
  • Implement warm/cold cache modes: Measure both scenarios
  • Add progress indicator: Show which benchmark is running
  • Create comparison plots: Visualize CPE across variants

8.2 Intermediate Extensions

  • SIMD implementation: Use intrinsics for vectorized kernels
  • Profile-guided optimization: Use hardware counters to guide changes
  • Multi-threaded variants: Parallelize across cores (be careful with reduction!)
  • Different data types: Compare float vs double vs int performance

8.3 Advanced Extensions

  • Auto-tuning framework: Automatically search for best unroll factor
  • Roofline model: Plot achieved performance vs hardware limits
  • Assembly kernels: Hand-write inner loops in assembly
  • Cross-platform support: ARM NEON, AMD specific counters

9. Real-World Connections

9.1 Industry Applications

High-Frequency Trading: Every nanosecond matters. The techniques in this project are exactly what HFT firms use to optimize their systems.

Game Engines: Physics simulations, graphics rendering, and audio processing all depend on tight inner loops.

Machine Learning: Matrix operations in neural networks benefit from the same optimization strategies.

Scientific Computing: Weather modeling, molecular dynamics, and other simulations spend most time in computational kernels.

  • perf: Linux performance analysis tool
  • Intel VTune: Commercial profiler with deep microarchitecture insight
  • PAPI: Portable interface to hardware performance counters
  • likwid: Lightweight performance tools for Linux
  • BLAS/LAPACK: Highly optimized linear algebra libraries

9.3 Interview Relevance

This project prepares you for questions like:

  • โ€œHow would you optimize this code?โ€ (Can demonstrate systematic approach)
  • โ€œWhat limits the performance of this loop?โ€ (Can identify bottleneck)
  • โ€œWhy is this version faster?โ€ (Can explain in terms of architecture)
  • โ€œHow would you measure performance?โ€ (Know pitfalls and methodology)
  • โ€œExplain Amdahlโ€™s Lawโ€ (Can calculate and apply limits)

10. Resources

10.1 Essential Reading

  • CS:APP Chapter 5: โ€œOptimizing Program Performanceโ€ - The foundation for this project
  • CS:APP Chapter 6: โ€œThe Memory Hierarchyโ€ - Understanding cache effects
  • CS:APP Chapter 1.9: โ€œAmdahlโ€™s Lawโ€ - Optimization limits
  • Agner Fogโ€™s Optimization Manuals: agner.org/optimize - CPU architecture details

10.2 Tools

  • perf: man perf - Linux performance analysis
  • valgrind โ€“tool=cachegrind: Cache simulation
  • Intel Intrinsics Guide: intrinsics
  • Compiler Explorer: godbolt.org - See generated assembly

10.3 Video Resources

  • MIT 6.172: Performance Engineering of Software Systems (MIT OpenCourseWare)
  • What Every Programmer Should Know About Memory: Ulrich Drepper (LWN.net)
  • CppCon talks: Search for โ€œoptimizationโ€, โ€œperformanceโ€
  • Previous: P1 (Toolchain Explorer) - understanding compilation
  • Next: P9 (Cache Lab++) - deep dive into cache behavior
  • Related: P7 (Y86-64 Simulator) - CPU pipeline internals

11. Self-Assessment Checklist

Understanding

  • I can explain Amdahlโ€™s Law and calculate speedup limits
  • I understand the difference between latency and throughput
  • I can identify when code is latency-bound vs throughput-bound
  • I can explain why multiple accumulators help performance
  • I understand how branch misprediction affects performance
  • I can reason about memory access patterns and cache behavior

Implementation

  • My timing infrastructure produces stable, repeatable results
  • All optimized variants produce correct results
  • I have measured CPE for at least 4 kernels
  • Each kernel has at least 3 variants with different optimizations
  • I can demonstrate cases where optimizations backfire

Analysis

  • My report explains each optimization in architectural terms
  • I can predict when an optimization will help or hurt
  • I have identified the bottleneck for each kernel variant
  • I understand why unrolling alone doesnโ€™t help sum reduction
  • I can explain the performance difference with different data patterns

Growth

  • I used hardware performance counters (perf) to validate understanding
  • I examined compiler-generated assembly to understand transformations
  • I can now read a loop and estimate its CPE
  • I trust measurements over intuition

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Measurement infrastructure works reliably
  • At least 4 kernels with naive and 2+ optimized variants each
  • Correctness tests pass
  • Basic CPE table with results

Full Completion:

  • All 5 kernels with 4+ variants each
  • Statistical analysis (mean, median, stddev)
  • Branch and cache measurements using perf
  • Written report explaining each optimization
  • Cases documented where optimization didnโ€™t help

Excellence (Going Above & Beyond):

  • SIMD implementations using intrinsics
  • Roofline model analysis
  • Auto-tuning for optimal parameters
  • Cross-architecture comparison (Intel vs AMD vs ARM)
  • Assembly-level kernels for maximum performance

13. Real World Outcome

When you complete this project, here is exactly what you will see when running your performance clinic:

$ cd perf-clinic
$ make
gcc -O2 -march=native -Wall -g -c src/main.c -o build/main.o
gcc -O2 -march=native -Wall -g -c src/timing.c -o build/timing.o
gcc -O2 -march=native -Wall -g -c src/kernels/sum_reduce.c -o build/sum_reduce.o
...
gcc -o perf-clinic build/*.o -lm

$ ./perf-clinic --kernel sum --trials 100 --size 1000000

=== PERFORMANCE CLINIC: Sum Reduction ===

Array size: 1,000,000 doubles (8 MB)
Trials: 100 (reporting minimum, median, mean)
CPU: Intel Core i7-10700 @ 3.80GHz

Variant                 Min CPE   Med CPE   Mean CPE   Speedup   Bottleneck
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
naive                    3.01      3.04       3.05      1.00x    Latency (FP add)
unroll_2x                3.00      3.02       3.03      1.01x    Still latency-bound
unroll_4x                2.98      3.01       3.02      1.01x    Unroll doesn't help alone
parallel_2acc            1.51      1.53       1.54      1.99x    Latency / 2
parallel_4acc            0.76      0.78       0.79      3.90x    Near throughput
parallel_8acc            0.51      0.53       0.54      5.70x    Approaching limit
parallel_16acc           0.50      0.52       0.53      5.83x    Diminishing returns
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Theoretical Bounds:
  Latency bound (3-cycle add):  CPE >= 3.0
  Throughput bound (1 add/cyc): CPE >= 1.0
  Memory bound (L1 cached):     CPE >= 0.5

Analysis:
  naive: Limited by 3-cycle FP add latency (dependency chain)
  unroll_*: Unrolling alone cannot break dependency chain
  parallel_*: Multiple accumulators create independent chains
  Best practical: 8 accumulators achieves ~6x speedup

$ ./perf-clinic --kernel conditional --trials 100 --data random

=== PERFORMANCE CLINIC: Conditional Sum ===

Testing branch prediction effects with random data (50% threshold)

Variant                 CPE       Mispredicts/1K   Speedup
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
branch_naive            8.42         498.2          1.00x
branch_sorted          2.15           1.3          3.92x
branchless             3.12           0.0          2.70x
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Analysis:
  Random data: ~50% misprediction rate (498/1000 branches)
  Sorted data: Highly predictable, near-perfect prediction
  Branchless: No branches to mispredict, constant time

$ perf stat -e cycles,instructions,cache-misses ./perf-clinic --kernel matvec --size 1024

=== PERFORMANCE CLINIC: Matrix-Vector Multiply ===

Matrix: 1024x1024 doubles (8 MB)

 Performance counter stats for './perf-clinic':
     1,234,567,890      cycles
       987,654,321      instructions    #    0.80  insn per cycle
            12,345      cache-misses

Variant                 CPE       L1 Miss Rate    Speedup
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
row_major               1.23         0.1%         1.00x
col_major               8.45        12.4%         0.15x
blocked_32             0.95         0.2%         1.29x
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Analysis:
  col_major: Strided access pattern causes cache thrashing
  row_major: Sequential access, good spatial locality
  blocked: Even better cache utilization through blocking

$ cat reports/optimization_report.md
# Performance Optimization Report

## Executive Summary
Achieved 5.7x speedup on sum reduction through loop parallelism.
Branch optimization yielded 2.7-3.9x speedup depending on data.
Memory access pattern optimization: 6.5x improvement on matrix ops.

## Key Findings
1. Loop unrolling WITHOUT multiple accumulators provides minimal benefit
2. Dependency chains are the primary bottleneck for reduction operations
3. Branch misprediction costs ~15 cycles on this architecture
4. Cache miss penalty is ~100 cycles for L3, ~200 for DRAM
...

14. The Core Question Youโ€™re Answering

โ€œWhat prevents code from running at hardware speed, and how do I systematically identify and remove performance bottlenecks - while proving that my optimizations actually work?โ€

This question is at the heart of performance engineering. Youโ€™re learning:

  • How to measure performance reliably (avoiding measurement pitfalls)
  • How to identify whether code is latency-bound, throughput-bound, or memory-bound
  • How compiler optimizations and CPU features interact with your code
  • The discipline of hypothesis-driven optimization: measure, hypothesize, transform, verify

15. Concepts You Must Understand First

Before starting this project, ensure you understand these concepts:

Concept Where to Learn Why Itโ€™s Needed
Amdahlโ€™s Law CS:APP 1.9.1 Calculate maximum speedup before optimizing
Instruction Pipelining CS:APP 4.4 Understand latency vs throughput
Data Dependencies CS:APP 5.5 Identify what limits instruction-level parallelism
Loop Unrolling CS:APP 5.8 Know when and why unrolling helps
Multiple Accumulators CS:APP 5.9 Break dependency chains
Branch Prediction CS:APP 5.12 Understand misprediction costs
Cache Hierarchy CS:APP 6.1-6.4 Reason about memory access patterns
Condition Codes CS:APP 3.6.1 Understand what affects branch behavior

16. Questions to Guide Your Design

As you develop your benchmark suite, ask yourself:

  1. Am I measuring what I think Iโ€™m measuring? Is the compiler optimizing away my code? Is the timer granular enough?
  2. What is the theoretical bound? Given latency and throughput, whatโ€™s the best possible CPE?
  3. Which bound am I hitting? Am I latency-bound (dependency chain) or throughput-bound (functional units)?
  4. Is my data realistic? Cache-cold vs cache-hot? Random vs sequential access?
  5. Can I explain the speedup? For each optimization, can I say exactly why it helped?
  6. What didnโ€™t work? Understanding failures is as important as successes.
  7. Is this reproducible? Can someone else get the same results?

17. Thinking Exercise

Before you start coding, work through this analysis on paper:

Consider this reduction loop:

double sum_naive(double *data, int n) {
    double sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[i];
    }
    return sum;
}

Given:

  • FP add latency: 3 cycles
  • FP add throughput: 1 per cycle
  • Load throughput: 2 per cycle
  • Array is cached in L1

Questions to answer:

  1. Draw the dependency graph for iterations i=0, 1, 2, 3. What depends on what?
  2. What is the CPE if weโ€™re latency-bound?
  3. What is the CPE if weโ€™re throughput-bound?
  4. This loop is latency-bound. Why? (What creates the dependency chain?)
  5. If we use 4 accumulators (sum0, sum1, sum2, sum3), what is the new CPE? Why?
  6. How many accumulators do we need to become throughput-bound?

Expected answers:

  1. Each sum += depends on the previous: sum_0 -> sum_1 -> sum_2 -> sum_3 (serial chain)
  2. Latency-bound: CPE = 3.0 (one add every 3 cycles due to dependency)
  3. Throughput-bound: CPE = 1.0 (one add per cycle if no dependencies)
  4. Each addition waits for the previous sum value - this is a RAW dependency chain
  5. With 4 accumulators: 4 independent chains, CPE = 3/4 = 0.75 (parallel execution)
  6. To hit throughput bound of 1 add/cycle with 3-cycle latency: need 3 accumulators minimum

18. The Interview Questions Theyโ€™ll Ask

After completing this project, you should be able to answer these interview questions:

  1. โ€œHow would you optimize a slow function?โ€
    • First: Profile to find where time is spent (donโ€™t guess)
    • Calculate Amdahlโ€™s Law limit - is optimization worthwhile?
    • Identify the bottleneck: latency? throughput? memory? branches?
    • Apply targeted transformation: unroll+multiple accumulators for latency, cache blocking for memory
    • Measure again to verify improvement
  2. โ€œWhat is the difference between latency and throughput?โ€
    • Latency: Time for one operation to complete (start to finish)
    • Throughput: Rate at which operations can be started
    • Pipelined units have low throughput (can start frequently) but high latency (each takes time)
    • Dependency chains are latency-bound; independent work is throughput-bound
  3. โ€œExplain Amdahlโ€™s Law and give an example.โ€
    • Speedup = 1 / ((1-p) + p/S) where p = parallelizable fraction, S = speedup of that fraction
    • Example: 90% of time in function, optimize 10x: Speedup = 1/(0.1 + 0.9/10) = 1/0.19 = 5.26x
    • Key insight: Serial portion limits maximum speedup (1/(1-p))
  4. โ€œWhy does loop unrolling help performance?โ€
    • Reduces loop overhead (fewer increments, compares, branches)
    • More work visible to out-of-order engine
    • BUT: doesnโ€™t break dependency chains alone
    • Need multiple accumulators to truly parallelize work
  5. โ€œHow does branch misprediction affect performance?โ€
    • CPU speculatively executes predicted path
    • On mispredict: flush pipeline, restart from correct path
    • Cost: 15-20 cycles on modern CPUs
    • Random branches = ~50% mispredict rate = significant overhead
  6. โ€œHow do you ensure benchmark results are valid?โ€
    • Use volatile or print results to prevent dead code elimination
    • Warm up: run several iterations before measuring
    • Multiple trials: report minimum, median, and standard deviation
    • Consistent state: control cache (cold vs warm), CPU frequency (disable turbo)
    • Verify correctness: optimized code must produce same results

19. Hints in Layers

If you get stuck, reveal hints one at a time. Try each level before moving to the next.

Hint 1 - Reliable Timing

Use rdtsc for cycle counting or clock_gettime(CLOCK_MONOTONIC) for wall time. Always run multiple trials and take the minimum (represents best case without interference).

Hint 2 - Preventing Dead Code Elimination

Declare your result as volatile, or print it at the end. Check the assembly (use -S or godbolt.org) to verify your loop wasnโ€™t optimized away.

Hint 3 - Understanding the Bottleneck

If unrolling doesnโ€™t help, youโ€™re probably latency-bound. Count the operations in your dependency chain. Add accumulators equal to latency/throughput ratio.

Hint 4 - When Branch Optimization Matters

Only optimize branches if: (1) data is unpredictable, (2) branch is in hot loop, (3) both paths are cheap. Use perf stat -e branch-misses to measure.


20. Books That Will Help

Topic Book Chapter/Section
Performance Fundamentals CS:APP 3e Chapter 5 (entire chapter)
Amdahlโ€™s Law CS:APP 3e Section 1.9.1
Loop Unrolling & Parallelism CS:APP 3e Sections 5.8-5.9
Branch Prediction CS:APP 3e Section 5.12
Memory Hierarchy CS:APP 3e Chapter 6
CPU Microarchitecture Agner Fogโ€™s Manuals Microarchitecture guide
Performance Analysis โ€œSystems Performanceโ€ by Brendan Gregg Chapters 6-7
Low-Level Optimization โ€œWhat Every Programmer Should Know About Memoryโ€ Sections 3-6

This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.