Project 8: Performance Clinic

Build a benchmark suite of small kernels plus a written optimization report explaining changes in terms of ILP, branch prediction, and locality.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Language	C (Alternatives: Rust, Zig, C++)
Prerequisites	Project 1 (Toolchain Explorer)
Key Topics	Loop optimization, ILP, branch prediction, memory hierarchy, profiling
CS:APP Chapters	5 (primary), 6, 1

1. Learning Objectives

By completing this project, you will:

Measure performance reliably: Obtain stable, repeatable timing measurements that resist noise and variance
Understand CPU execution pipelines: Reason about instruction-level parallelism, functional units, and throughput vs latency
Master loop transformations: Apply unrolling, reassociation, and accumulator parallelism systematically
Predict branch behavior: Understand branch prediction mechanics and identify misprediction-prone patterns
Reason about memory hierarchy: Connect cache behavior (Chapter 6) to observed performance
Apply Amdahl’s Law: Quantify optimization limits and prioritize improvement efforts
Avoid “faster by accident”: Distinguish genuine improvements from measurement artifacts

2. Deep Theoretical Foundation

2.1 Amdahl’s Law: The Speed Limit of Optimization

Every optimization has limits. Gene Amdahl formalized this in 1967:

                    1
Speedup(S, p) = ─────────────────
                (1 - p) + (p / S)

Where:
- p = fraction of execution time in the optimizable portion
- S = speedup of that portion
- (1 - p) = fraction that cannot be optimized

Example: Your program spends 80% of time in one function. You optimize that function to run 10x faster.

p = 0.80 (80% in optimizable portion)
S = 10   (10x speedup)

Speedup = 1 / (0.20 + 0.80/10)
        = 1 / (0.20 + 0.08)
        = 1 / 0.28
        = 3.57x overall speedup

Even with infinite speedup (S = infinity):
Speedup_max = 1 / (1 - 0.80) = 1 / 0.20 = 5x maximum

You can never exceed 5x no matter how fast you make that 80%!

Key Implications:

Profile before optimizing: Find the p-dominant portions
The sequential portion is the ceiling: Reducing (1-p) may matter more than increasing S
Diminishing returns: Each additional speedup of the hot portion yields less overall benefit
Always calculate the bound: Before spending a week optimizing, compute the maximum possible gain

Amdahl's Law Visualization:

Overall speedup when portion p is sped up by factor S:

p = 50%:    S=2  -> 1.33x    S=10 -> 1.82x    S=inf -> 2.00x
p = 80%:    S=2  -> 1.67x    S=10 -> 3.57x    S=inf -> 5.00x
p = 90%:    S=2  -> 1.82x    S=10 -> 5.26x    S=inf -> 10.00x
p = 95%:    S=2  -> 1.90x    S=10 -> 6.90x    S=inf -> 20.00x
p = 99%:    S=2  -> 1.98x    S=10 -> 9.17x    S=inf -> 100.00x

The larger p is, the more room for improvement.

2.2 Understanding Modern CPU Pipelines

Modern processors execute instructions through a multi-stage pipeline:

Classic 5-Stage Pipeline:

         Cycle:   1    2    3    4    5    6    7    8
Inst 1:        [IF] [ID] [EX] [MEM][WB]
Inst 2:             [IF] [ID] [EX] [MEM][WB]
Inst 3:                  [IF] [ID] [EX] [MEM][WB]
Inst 4:                       [IF] [ID] [EX] [MEM][WB]

IF  = Instruction Fetch
ID  = Instruction Decode
EX  = Execute (ALU operation)
MEM = Memory access
WB  = Write Back (to register)

Ideally: One instruction completes per cycle (throughput = 1 IPC)

Modern Superscalar Processors:

Real processors are far more complex:

Intel Skylake-class Core (simplified):

     ┌────────────────────────────────────────────────────────────────┐
     │                        Front End                               │
     │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
     │  │   Fetch     │───▶│   Decode    │───▶│   Rename/   │        │
     │  │ (16B/cycle) │    │ (4 inst/cyc)│    │   Allocate  │        │
     │  └─────────────┘    └─────────────┘    └─────────────┘        │
     └────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
     ┌────────────────────────────────────────────────────────────────┐
     │              Reservation Station (Scheduler)                   │
     │                    ~100 entries                                │
     └────────────────────────────────────────────────────────────────┘
                                    │
       ┌────────────────────────────┼────────────────────────────┐
       │            │               │               │            │
       ▼            ▼               ▼               ▼            ▼
   ┌───────┐   ┌───────┐       ┌───────┐       ┌───────┐   ┌───────┐
   │ INT 0 │   │ INT 1 │       │ INT 2 │       │ FP 0  │   │ FP 1  │
   │ ALU   │   │ ALU   │       │ ALU   │       │ FMA   │   │ FMA   │
   │ BRANCH│   │ BRANCH│       │ MUL   │       │ ADD   │   │ ADD   │
   └───────┘   └───────┘       └───────┘       └───────┘   └───────┘
       │            │               │               │            │
       └────────────┴───────────────┴───────────────┴────────────┘
                                    │
                                    ▼
     ┌────────────────────────────────────────────────────────────────┐
     │                    Reorder Buffer                              │
     │               (preserves sequential semantics)                 │
     └────────────────────────────────────────────────────────────────┘

Key insight: Multiple instructions execute simultaneously across different functional units. Your code should provide enough independent work to keep these units busy.

2.3 Instruction-Level Parallelism (ILP)

ILP is the amount of independent work available at the instruction level.

Data Dependencies Limit ILP:

// Sequential chain - NO parallelism
// Each instruction depends on the previous result
x = a + b;      // cycle 1-4 (assume 4-cycle latency)
y = x + c;      // cycle 5-8 (must wait for x)
z = y + d;      // cycle 9-12 (must wait for y)
// Total: 12 cycles for 3 operations

// Independent operations - FULL parallelism
// No dependencies between operations
x = a + b;      // cycles 1-4 ─┬─ all execute
y = c + d;      // cycles 1-4 ─┤  simultaneously
z = e + f;      // cycles 1-4 ─┘
// Total: 4 cycles for 3 operations (3x speedup)

Dependency Types:

Type	Also Called	Description	Can Be Eliminated?
RAW	True Dependency	Read After Write: `a = 1; b = a;`	No (fundamental)
WAR	Anti-Dependency	Write After Read: `a = b; b = 2;`	Yes (renaming)
WAW	Output Dependency	Write After Write: `a = 1; a = 2;`	Yes (renaming)

The processor’s register renaming eliminates WAR and WAW dependencies, but RAW dependencies are fundamental and limit ILP.

2.4 Latency vs Throughput

Critical distinction for understanding performance:

Latency: Time for a single operation to complete (cycles)

Throughput: Rate at which operations can be initiated (operations per cycle)

Example: Floating-point multiplication on modern Intel

Latency:    4 cycles (one multiply takes 4 cycles start to finish)
Throughput: 0.5 cycles (can start a new multiply every 0.5 cycles)

This means the multiplier is pipelined:

Cycle:     1     2     3     4     5     6     7     8
Mult 1:  [S1]  [S2]  [S3]  [S4] done
Mult 2:        [S1]  [S2]  [S3]  [S4] done
Mult 3:              [S1]  [S2]  [S3]  [S4] done
Mult 4:                    [S1]  [S2]  [S3]  [S4] done

With 8 independent multiplies, throughput bound is:
  8 operations / (8 * 0.5 cycles) = 8 / 4 = 2 ops/cycle

But with a dependency chain:
  8 operations / (8 * 4 cycles) = 8 / 32 = 0.25 ops/cycle

  That's 8x slower!

The Performance Equation:

Performance = min(Latency_bound, Throughput_bound)

Latency_bound:    Time = n * Latency
                  (serial chain of n operations)

Throughput_bound: Time = n * Throughput
                  (n independent operations)

Goal: Make your code throughput-bound, not latency-bound!

2.5 CPE: Cycles Per Element

The CPE metric from CS:APP measures efficiency for array/loop computations:

CPE = (Total cycles to process array) / (Number of elements)

Example: Sum an array of 1000 elements

Version A: 4000 cycles  -> CPE = 4.0
Version B: 1500 cycles  -> CPE = 1.5
Version C: 1000 cycles  -> CPE = 1.0

Lower CPE = faster code

Theoretical CPE Limits:

For a loop that performs one floating-point add per element:

for (int i = 0; i < n; i++) {
    sum += data[i];  // one FP add per iteration
}

Bottleneck	Bound	Reason
Latency (3-cycle add, dependency chain)	CPE >= 3.0	Each add waits for previous
Throughput (1 add/cycle possible)	CPE >= 1.0	Adder can do 1 per cycle
Memory (assuming cached)	CPE >= 0.5	2 loads/cycle possible

If your measured CPE is 3.0, you’re latency-bound. Optimization means breaking the dependency chain.

2.6 Loop Unrolling

Loop unrolling reduces loop overhead and exposes more work per iteration:

// Original loop
for (int i = 0; i < n; i++) {
    sum += data[i];
}

// 2x unrolled
for (int i = 0; i < n; i += 2) {
    sum += data[i];
    sum += data[i+1];
}

// 4x unrolled
for (int i = 0; i < n; i += 4) {
    sum += data[i];
    sum += data[i+1];
    sum += data[i+2];
    sum += data[i+3];
}

Benefits:

Fewer loop counter increments and comparisons
More work visible to the out-of-order engine
Better instruction cache utilization (fewer branches)

But still latency-bound! All adds still chain through sum. Unrolling alone may not help.

2.7 Reassociation and Multiple Accumulators

The key to breaking latency bounds is parallelizing the accumulation:

// Original: all adds chain through one accumulator
// CPE ~= latency of addition (3-4 cycles)
double sum = 0;
for (int i = 0; i < n; i++) {
    sum += data[i];  // sum = sum + data[i]
}

// 2x parallelism: two independent accumulator chains
// CPE ~= latency / 2
double sum0 = 0, sum1 = 0;
for (int i = 0; i < n; i += 2) {
    sum0 += data[i];      // chain A
    sum1 += data[i+1];    // chain B (independent!)
}
double sum = sum0 + sum1;

// 4x parallelism: four independent chains
// CPE ~= latency / 4
double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < n; i += 4) {
    sum0 += data[i];
    sum1 += data[i+1];
    sum2 += data[i+2];
    sum3 += data[i+3];
}
double sum = (sum0 + sum1) + (sum2 + sum3);

Visualization:

Original (serial):

data[0] ──add──▶ sum ──add──▶ sum ──add──▶ sum ──add──▶ ...
              │ data[1]   │ data[2]   │ data[3]
              wait        wait        wait

Cycle: 0───3───6───9───12───...  (CPE = 3.0)


4x Parallel:

data[0] ──add──▶ sum0
data[1] ──add──▶ sum1      All four chains run
data[2] ──add──▶ sum2      simultaneously!
data[3] ──add──▶ sum3
         │
data[4] ──add──▶ sum0      Next iteration starts
...                        after 3 cycles, not 12

Cycle: 0───3───6───9───...  (CPE = 0.75)

2.8 Reassociation Transformations

Reassociation changes the order of operations to expose parallelism:

// Original: Left-to-right evaluation
// acc = ((acc * x[i]) * x[i+1]) * x[i+2]
// Each multiply depends on the previous

for (int i = 0; i < n; i += 2) {
    acc = (acc * data[i]) * data[i+1];  // Serial!
}

// Reassociated: Pair independent operations
// acc = acc * (x[i] * x[i+1])
// The (x[i] * x[i+1]) can proceed in parallel with the chain

for (int i = 0; i < n; i += 2) {
    acc = acc * (data[i] * data[i+1]);  // Partial parallel!
}

Warning: Reassociation changes results for floating-point due to rounding!

Floating-point is NOT associative:

a = 1.0e20
b = -1.0e20
c = 1.0

(a + b) + c = 0 + 1 = 1.0
a + (b + c) = 1.0e20 + (-1.0e20 + 1.0) = 1.0e20 + (-1.0e20) = 0.0

Different results! Compiler won't reassociate unless you explicitly allow it.

Use -ffast-math to permit reassociation, or do it manually (accepting the precision trade-off).

2.9 Branch Prediction Mechanics

Modern CPUs predict branch outcomes to avoid pipeline stalls:

Pipeline with branch at cycle 3:

         Cycle:   1    2    3    4    5    6    7    8
Fetch:         [A]  [B]  [br] [?]  [?]  [?]  [?]  [?]
                              │
                              ├── Predict TAKEN: fetch from target
                              └── Predict NOT-TAKEN: fetch next instruction

If prediction is WRONG:
- Pipeline must be flushed
- Restart from correct address
- Penalty: 15-20 cycles on modern CPUs!

Branch Prediction Strategies:

Strategy	How it Works	Accuracy
Static (taken)	Always predict backward branches taken, forward not-taken	~60-70%
One-bit	Remember last outcome	~80%
Two-bit (saturating counter)	Change prediction after 2 wrong	~85-90%
Correlating	Consider history of other branches	~90-95%
Neural	Pattern recognition on branch history	~95%+

Branch-Unfriendly Patterns:

// Random condition - terrible for prediction
if (rand() % 2) { ... }  // 50% misprediction rate!

// Data-dependent unpredictable branch
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {  // depends on data pattern
        result += data[i];
    }
}

Branch-Friendly Patterns:

// Predictable loop
for (int i = 0; i < 1000; i++) { ... }
// Predict: i < 1000 is TRUE for 999 iterations

// Sorted data makes branches predictable
sort(data);
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) {  // After some point, always true
        result += data[i];
    }
}

2.10 Avoiding Branches: Conditional Moves

Replace branches with branchless computation:

// Branch version - misprediction possible
if (a > b) {
    max = a;
} else {
    max = b;
}

// Branchless version using conditional move
max = (a > b) ? a : b;  // May compile to CMOV instruction

// Manual branchless (works always)
int mask = -(a > b);    // All 1s if true, all 0s if false
max = (a & mask) | (b & ~mask);

When to use branchless:

Unpredictable conditions (near 50/50 split)
Inner loops where branch overhead matters
When the computation is simple (conditional move overhead is ~2 cycles)

When branches are better:

Predictable patterns (strongly biased one way)
Complex alternatives (one path is expensive)
The work saved by skipping outweighs misprediction cost

2.11 Memory Hierarchy Impact

Performance depends heavily on memory access patterns (Chapter 6):

Memory Hierarchy Latencies (approximate, Intel Skylake):

Level           Size        Latency (cycles)    Bandwidth
─────────────────────────────────────────────────────────
L1 Cache        32 KB       4 cycles           ~200 GB/s
L2 Cache        256 KB      12 cycles          ~80 GB/s
L3 Cache        8+ MB       40 cycles          ~40 GB/s
Main Memory     GBs         200+ cycles        ~20 GB/s
─────────────────────────────────────────────────────────

A cache miss costs 50-200x more than a hit!

Locality Principles:

// Temporal locality: recently accessed items accessed again
for (int iter = 0; iter < 100; iter++) {
    for (int i = 0; i < N; i++) {
        sum += data[i];  // Same data[i] accessed 100 times
    }                    // Stays in cache after first pass
}

// Spatial locality: access nearby memory locations
for (int i = 0; i < N; i++) {
    sum += data[i];      // data[i] and data[i+1] are adjacent
}                        // Cache line prefetched helps

// Poor locality (stride pattern)
for (int i = 0; i < N; i += 64) {
    sum += data[i];      // Skip cache lines, waste prefetch
}

// Poor locality (random access)
for (int i = 0; i < N; i++) {
    sum += data[indices[i]];  // Random access - cache useless
}

Performance Impact:

Sequential access of 1 million doubles:

L1 hit:   1M * 4 cycles = 4M cycles
L3 hit:   1M * 40 cycles = 40M cycles (10x slower)
DRAM:     1M * 200 cycles = 200M cycles (50x slower)

Locality is CRITICAL for performance!

2.12 Profiling Methodologies

Before optimizing, measure:

1. Time Measurement:

#include <time.h>

// Wall-clock time
clock_t start = clock();
// ... code to measure ...
clock_t end = clock();
double seconds = (double)(end - start) / CLOCKS_PER_SEC;

// More precise: POSIX high-resolution timer
#include <time.h>
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// ... code to measure ...
clock_gettime(CLOCK_MONOTONIC, &end);
double ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);

2. Cycle Counting (x86):

#include <x86intrin.h>

unsigned long long start = __rdtsc();
// ... code to measure ...
unsigned long long end = __rdtsc();
unsigned long long cycles = end - start;

3. Hardware Performance Counters (Linux perf):

# Overall stats
perf stat ./my_program

# Detailed cache behavior
perf stat -e cache-misses,cache-references ./my_program

# Branch prediction
perf stat -e branches,branch-misses ./my_program

# Profile and identify hotspots
perf record ./my_program
perf report

2.13 Microbenchmarking Pitfalls

Getting stable measurements is harder than it seems:

Pitfall 1: Dead Code Elimination

// BAD: Compiler may optimize away!
double result = 0;
for (int i = 0; i < N; i++) {
    result += expensive_computation(data[i]);
}
// result is never used - compiler removes the loop!

// GOOD: Prevent elimination
volatile double result = 0;
// or
printf("Result: %f\n", result);  // Force result to be computed

Pitfall 2: CPU Frequency Scaling

// CPU starts at low frequency, ramps up during benchmark
// First measurements are slower!

// SOLUTION 1: Warmup iterations
for (int i = 0; i < 1000; i++) {
    test_function(data);  // Warmup - don't measure
}
// Now measure
start = rdtsc();
for (int i = 0; i < TRIALS; i++) {
    test_function(data);
}
end = rdtsc();

// SOLUTION 2: Disable frequency scaling
// sudo cpupower frequency-set --governor performance

Pitfall 3: Measurement Overhead

// BAD: Timing overhead dominates for fast operations
for (int i = 0; i < N; i++) {
    start = rdtsc();
    x = x + 1;        // 1 cycle
    end = rdtsc();    // ~50 cycles overhead!
    total += (end - start);
}

// GOOD: Batch many operations
start = rdtsc();
for (int i = 0; i < 1000000; i++) {
    x = x + 1;
}
end = rdtsc();
cycles_per_op = (end - start) / 1000000.0;

Pitfall 4: Cache State Variability

// First access: cold cache (slow)
// Second access: warm cache (fast)

// SOLUTION: Consistent cache state
// Option A: Flush cache before each trial (cold)
// Option B: Warmup loop before measuring (warm)
// Option C: Large enough data to exceed cache (always miss)

// Document which state you're measuring!

Pitfall 5: Variance and Outliers

// Run multiple trials, report minimum or median
double times[TRIALS];
for (int t = 0; t < TRIALS; t++) {
    start = now();
    test_function(data);
    times[t] = now() - start;
}
// Use minimum (represents "best case" without interference)
// Or median (more robust to outliers)
double min_time = minimum(times, TRIALS);
double median_time = median(times, TRIALS);

// Report both and explain methodology!

2.14 Compiler Optimizations Overview

Know what the compiler can do so you don’t duplicate its work:

Common Compiler Optimizations:

Optimization	What It Does	Flag
Inlining	Replace function call with body	-O1+
Loop unrolling	Replicate loop body	-O2+ or -funroll-loops
Constant folding	Compute constants at compile time	-O1+
Dead code elimination	Remove unused code	-O1+
Common subexpression	Reuse computed values	-O1+
Strength reduction	Replace expensive ops (a*2 -> a«1)	-O1+
Vectorization	Use SIMD instructions	-O3 or -ftree-vectorize
Instruction scheduling	Reorder for pipeline efficiency	-O2+

Compiler Explorer Strategy:

Write your original code
Compile with -O0 and -O3
Examine assembly (godbolt.org is excellent)
See what the compiler already does
Only manually optimize what the compiler misses

Critical Compiler Flags:

# Standard optimization levels
gcc -O0 ...  # No optimization (debugging)
gcc -O1 ...  # Basic optimizations
gcc -O2 ...  # Standard optimizations (recommended for most cases)
gcc -O3 ...  # Aggressive optimizations (may increase code size)
gcc -Os ...  # Optimize for size

# Architecture-specific
gcc -march=native ...  # Use all instructions available on this CPU
gcc -mtune=native ...  # Tune for this CPU (without requiring features)

# Floating-point
gcc -ffast-math ...    # Allows reassociation, breaks IEEE compliance
gcc -ffp-contract=fast # Allow FMA contraction

# Vectorization
gcc -ftree-vectorize ... # Enable auto-vectorization
gcc -fopt-info-vec ...   # Report what was vectorized

3. Project Specification

3.1 What You Will Build

A benchmark suite consisting of:

Small computational kernels: Array operations that stress different aspects of the CPU
Measurement infrastructure: Reliable timing with statistical analysis
Optimization laboratory: Before/after versions of each kernel
Written report: Explanation of each optimization in architectural terms

3.2 Core Benchmark Kernels

Implement and optimize these kernels (minimum 4):

Kernel 1: Sum Reduction

double sum_naive(double *data, size_t n) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += data[i];
    }
    return sum;
}

Target: Transform from latency-bound to throughput-bound.

Kernel 2: Polynomial Evaluation (Horner’s Rule)

double poly_naive(double *coeffs, size_t degree, double x) {
    double result = coeffs[degree];
    for (size_t i = degree; i > 0; i--) {
        result = result * x + coeffs[i-1];
    }
    return result;
}

Target: Explore dependency chain limitations.

Kernel 3: Conditional Accumulation

double conditional_sum_naive(double *data, size_t n, double threshold) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        if (data[i] > threshold) {
            sum += data[i];
        }
    }
    return sum;
}

Target: Explore branch prediction effects, try branchless.

Kernel 4: Matrix-Vector Multiply

void matvec_naive(double *A, double *x, double *y, size_t n) {
    for (size_t i = 0; i < n; i++) {
        y[i] = 0;
        for (size_t j = 0; j < n; j++) {
            y[i] += A[i*n + j] * x[j];
        }
    }
}

Target: Explore memory access patterns and locality.

Kernel 5: Inner Product

double inner_product_naive(double *a, double *b, size_t n) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

Target: Two data streams, multiply-add dependency chain.

3.3 Functional Requirements

Measurement System:
- Multiple trials with statistical reporting
- CPE calculation for each kernel
- Cycle-accurate timing (rdtsc or similar)
- Warmup iterations before measurement
Kernel Variants:
- At least 3 optimization levels per kernel
- Each variant must produce correct results
- Document the transformation applied
Report Requirements:
- CPE table for all variants
- Speedup calculation relative to naive
- Explanation of why each optimization works (or doesn’t)
- Discussion of architectural bottleneck shifted

3.4 Example Output

$ ./perf-clinic --kernel sum --trials 100

=== PERFORMANCE CLINIC: Sum Reduction ===

Array size: 1,000,000 doubles
Trials: 100 (reporting median)

Variant                 CPE      Speedup   Bottleneck
───────────────────────────────────────────────────────
naive                   3.02     1.00x     Latency (FP add chain)
unroll_2x               3.01     1.00x     Still latency-bound
unroll_4x               2.98     1.01x     Minimal improvement
parallel_2acc           1.52     1.99x     Latency / 2
parallel_4acc           0.78     3.87x     Approaching throughput
parallel_8acc           0.52     5.81x     Near throughput bound
vectorized              0.13     23.23x    SIMD (4 ops/instr)
───────────────────────────────────────────────────────

Analysis:
- Naive version is limited by 3-cycle FP add latency
- Unrolling alone doesn't help (still one dependency chain)
- Multiple accumulators break the dependency chain
- 8 accumulators approach the throughput bound (~0.5 CPE)
- SIMD achieves 4x further by processing 4 elements per instruction

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────────┐
│                        perf-clinic                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐            │
│  │   Timing     │   │   Kernel     │   │   Analysis   │            │
│  │   Harness    │   │   Library    │   │   Engine     │            │
│  └──────────────┘   └──────────────┘   └──────────────┘            │
│         │                  │                  │                     │
│         ▼                  ▼                  ▼                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Benchmark Runner                           │  │
│  │  - Load data                                                  │  │
│  │  - Run warmup                                                 │  │
│  │  - Execute trials                                             │  │
│  │  - Collect statistics                                         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                     Report Generator                          │  │
│  │  - CPE calculations                                           │  │
│  │  - Speedup ratios                                             │  │
│  │  - Statistical summary                                        │  │
│  │  - Bottleneck identification                                  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│                     [Console / File Output]                         │
└─────────────────────────────────────────────────────────────────────┘

4.2 Key Data Structures

// Timing result for one trial
typedef struct {
    uint64_t cycles;
    double   seconds;
    size_t   elements;
} trial_result_t;

// Statistics across all trials
typedef struct {
    double min_cycles;
    double median_cycles;
    double mean_cycles;
    double stddev_cycles;
    double cpe;          // cycles per element
    size_t num_trials;
} benchmark_stats_t;

// Kernel variant
typedef struct {
    const char *name;
    const char *description;
    void (*setup)(void *ctx);            // Optional setup
    void (*kernel)(void *ctx);           // The actual work
    void (*verify)(void *ctx);           // Correctness check
    void (*teardown)(void *ctx);         // Cleanup
} kernel_variant_t;

// Benchmark configuration
typedef struct {
    const char *kernel_name;
    kernel_variant_t *variants;
    size_t num_variants;
    size_t element_count;
    size_t num_trials;
    size_t warmup_iters;
    bool verify_results;
} benchmark_config_t;

// Result for one kernel variant
typedef struct {
    const char *variant_name;
    benchmark_stats_t stats;
    bool correct;
    double speedup;  // relative to baseline
} variant_result_t;

4.3 Module Structure

perf-clinic/
├── src/
│   ├── main.c                  # CLI and orchestration
│   ├── timing.c                # Cycle counting and timing
│   ├── timing.h
│   ├── stats.c                 # Statistical analysis
│   ├── stats.h
│   ├── runner.c                # Benchmark execution
│   ├── runner.h
│   ├── report.c                # Output formatting
│   ├── report.h
│   └── kernels/
│       ├── sum_reduce.c        # Sum variants
│       ├── poly_eval.c         # Polynomial variants
│       ├── conditional.c       # Conditional sum variants
│       ├── matvec.c            # Matrix-vector variants
│       └── inner_product.c     # Inner product variants
├── include/
│   ├── kernels.h               # Kernel declarations
│   └── common.h                # Shared definitions
├── tests/
│   ├── test_correctness.c      # Verify all variants match
│   └── test_timing.c           # Validate timing infrastructure
├── data/
│   └── generate_data.c         # Test data generation
├── Makefile
└── README.md

5. Implementation Guide

Phase 1: Measurement Infrastructure (Days 1-3)

Goals:

Reliable cycle-accurate timing
Statistical analysis of multiple trials
Avoid measurement pitfalls

Tasks:

Implement cycle counter access:

// timing.h
#include <stdint.h>

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__ (
        "rdtsc" : "=a"(lo), "=d"(hi)
    );
    return ((uint64_t)hi << 32) | lo;
}

// More precise: use rdtscp (serializing)
static inline uint64_t rdtscp(void) {
    uint32_t lo, hi, aux;
    __asm__ __volatile__ (
        "rdtscp" : "=a"(lo), "=d"(hi), "=c"(aux)
    );
    return ((uint64_t)hi << 32) | lo;
}

Implement statistical functions:

// stats.c
double compute_median(double *values, size_t n);
double compute_mean(double *values, size_t n);
double compute_stddev(double *values, size_t n);
double compute_min(double *values, size_t n);

Implement benchmark harness:

// runner.c
benchmark_stats_t run_benchmark(
    void (*kernel)(void *ctx),
    void *ctx,
    size_t elements,
    size_t trials,
    size_t warmup
);

Checkpoint: Can measure a simple loop and get consistent cycle counts across runs.

Phase 2: Baseline Kernels (Days 4-5)

Goals:

Implement naive versions of all kernels
Verify correctness
Establish baseline CPE

Tasks:

Implement each kernel’s naive version
Create verification functions (compare against known results)
Generate test data with different patterns:
- Sequential (0, 1, 2, …)
- Random uniform
- Random with patterns (sorted, partially sorted)
Measure baseline CPE:

void measure_baselines(void) {
    double data[N];
    fill_random(data, N);

    benchmark_stats_t sum_stats = run_benchmark(
        sum_naive, data, N, 100, 10
    );

    printf("sum_naive: CPE = %.2f\n", sum_stats.cpe);
}

Checkpoint: All naive kernels produce correct results. Baseline CPE recorded.

Phase 3: Loop Optimizations (Days 6-8)

Goals:

Implement unrolled variants
Implement parallel accumulator variants
Understand dependency effects

Tasks:

Implement 2x, 4x, 8x unrolling:

double sum_unroll4(double *data, size_t n) {
    double sum = 0;
    size_t i;
    for (i = 0; i < n - 3; i += 4) {
        sum += data[i];
        sum += data[i+1];
        sum += data[i+2];
        sum += data[i+3];
    }
    for (; i < n; i++) {
        sum += data[i];
    }
    return sum;
}

Implement parallel accumulator variants:

double sum_parallel4(double *data, size_t n) {
    double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
    size_t i;
    for (i = 0; i < n - 3; i += 4) {
        sum0 += data[i];
        sum1 += data[i+1];
        sum2 += data[i+2];
        sum3 += data[i+3];
    }
    for (; i < n; i++) {
        sum0 += data[i];
    }
    return (sum0 + sum1) + (sum2 + sum3);
}

Measure and compare CPE for all variants

Checkpoint: Parallel accumulator shows significant speedup. Can explain why unrolling alone doesn’t help.

Phase 4: Branch Optimizations (Days 9-10)

Goals:

Measure branch misprediction effects
Implement branchless alternatives
Compare with different data patterns

Tasks:

Implement branchless conditional:

double conditional_sum_branchless(double *data, size_t n, double threshold) {
    double sum = 0;
    for (size_t i = 0; i < n; i++) {
        // mask is 0 or all-1s based on condition
        double mask = (data[i] > threshold) ? 1.0 : 0.0;
        sum += data[i] * mask;
    }
    return sum;
}

Test with different data distributions:
- Random (50% pass threshold) - unpredictable
- Sorted (all pass or all fail) - predictable
- Biased (90% one way) - mostly predictable
Measure with perf stat -e branches,branch-misses:

# High misprediction rate expected for random data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data random

# Low misprediction rate for sorted data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data sorted

Checkpoint: Can demonstrate branch misprediction costs. Branchless version wins for random data.

Phase 5: Memory Optimizations (Days 11-12)

Goals:

Explore locality effects
Measure cache miss impact
Optimize memory access patterns

Tasks:

Implement matrix-vector with different access patterns:

// Row-major (good cache behavior for row-major storage)
void matvec_rowmajor(double *A, double *x, double *y, size_t n) {
    for (size_t i = 0; i < n; i++) {
        double sum = 0;
        for (size_t j = 0; j < n; j++) {
            sum += A[i*n + j] * x[j];  // A accessed sequentially
        }
        y[i] = sum;
    }
}

// Column-major (bad cache behavior for row-major storage)
void matvec_colmajor(double *A, double *x, double *y, size_t n) {
    for (size_t i = 0; i < n; i++) y[i] = 0;
    for (size_t j = 0; j < n; j++) {
        for (size_t i = 0; i < n; i++) {
            y[i] += A[i*n + j] * x[j];  // A accessed with stride n
        }
    }
}

Measure cache misses:

perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant rowmajor
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant colmajor

Test with matrices of different sizes (L1, L2, L3, larger than cache)

Checkpoint: Row-major significantly faster. Cache miss rate correlates with slowdown.

Phase 6: Report and Analysis (Days 13-14)

Goals:

Document all results
Explain each optimization architecturally
Identify when optimizations backfire

Tasks:

Create comprehensive CPE table
Write analysis for each kernel:
- What was the bottleneck?
- How did each transformation address it?
- What is the limiting factor now?
Document cases where optimizations didn’t help:
- Unrolling without multiple accumulators
- Branchless for predictable data
- Over-unrolling that causes register pressure
Produce final report document

Checkpoint: Portfolio-quality report with before/after results and strong architectural narrative.

6. Testing Strategy

6.1 Correctness Testing

Every optimized variant must produce the same result as the naive version:

void test_sum_correctness(void) {
    double data[N];
    fill_random(data, N);

    double expected = sum_naive(data, N);

    assert_close(sum_unroll4(data, N), expected, 1e-10);
    assert_close(sum_parallel4(data, N), expected, 1e-10);
    assert_close(sum_parallel8(data, N), expected, 1e-10);
    // Note: floating-point may differ slightly due to reassociation
}

void assert_close(double actual, double expected, double tolerance) {
    double diff = fabs(actual - expected);
    double rel = diff / (fabs(expected) + 1e-10);
    if (rel > tolerance) {
        fprintf(stderr, "FAIL: expected %f, got %f (rel diff: %e)\n",
                expected, actual, rel);
        exit(1);
    }
}

6.2 Measurement Validation

Verify timing infrastructure is reliable:

void test_timing_stability(void) {
    // Measure a known-duration operation
    double times[100];
    for (int i = 0; i < 100; i++) {
        uint64_t start = rdtsc();
        busy_loop(10000);  // Calibrated delay
        times[i] = rdtsc() - start;
    }

    double stddev = compute_stddev(times, 100);
    double mean = compute_mean(times, 100);
    double cv = stddev / mean;  // Coefficient of variation

    // CV should be < 5% for stable measurements
    assert(cv < 0.05);
}

6.3 Compiler Interference Check

Ensure the compiler isn’t removing your benchmark code:

void test_not_optimized_away(void) {
    volatile double result;

    uint64_t start = rdtsc();
    result = sum_naive(data, N);
    uint64_t cycles = rdtsc() - start;

    // Should take at least N cycles (one memory access per element)
    assert(cycles > N / 2);

    // Use result to prevent elimination
    assert(result != 12345.6789);  // Some value it won't equal
}

6.4 Edge Case Testing

void test_edge_cases(void) {
    double data[8] = {1, 2, 3, 4, 5, 6, 7, 8};

    // Test with n not divisible by unroll factor
    assert_close(sum_parallel4(data, 1), 1.0, 1e-10);
    assert_close(sum_parallel4(data, 2), 3.0, 1e-10);
    assert_close(sum_parallel4(data, 5), 15.0, 1e-10);
    assert_close(sum_parallel4(data, 7), 28.0, 1e-10);

    // Test with empty array
    assert_close(sum_parallel4(data, 0), 0.0, 1e-10);

    // Test with special values
    data[0] = INFINITY;
    assert(isinf(sum_parallel4(data, 8)));

    data[0] = NAN;
    assert(isnan(sum_parallel4(data, 8)));
}

7. Common Pitfalls and Debugging

7.1 Measurement Errors

Pitfall	Symptom	Solution
Dead code elimination	CPE near 0	Use `volatile` or print result
Frequency scaling	First runs slower	Warmup iterations, disable scaling
Turbo boost inconsistency	High variance	Disable turbo or use more trials
Timer overflow	Negative times	Use 64-bit counters
Memory prefetching	Arrays too small	Use arrays > L3 cache

7.2 CPU Frequency Scaling

# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set to performance (requires root)
sudo cpupower frequency-set --governor performance

# Or for all CPUs:
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance | sudo tee $cpu
done

# Disable turbo boost (Intel)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

7.3 Compiler Optimization Interference

# Check what the compiler is doing
gcc -O3 -S -fverbose-asm kernel.c -o kernel.s

# Use Compiler Explorer (https://godbolt.org)
# to visualize assembly

# Prevent specific optimizations
__attribute__((optimize("no-unroll-loops")))
double sum_my_unroll(double *data, size_t n) { ... }

7.4 Memory Alignment Issues

// Ensure proper alignment for SIMD
double *data = aligned_alloc(32, n * sizeof(double));

// Or use compiler attributes
double data[N] __attribute__((aligned(32)));

// Check alignment at runtime
assert(((uintptr_t)data % 32) == 0);

// BAD: Accumulators may share cache line
double sums[4];  // Could be on same 64-byte line

// GOOD: Pad to separate cache lines
struct {
    double value;
    char padding[56];  // 64 - 8 = 56 bytes padding
} sums[4];

8. Extensions

8.1 Beginner Extensions

Add CSV output: Export results for spreadsheet analysis
Implement warm/cold cache modes: Measure both scenarios
Add progress indicator: Show which benchmark is running
Create comparison plots: Visualize CPE across variants

8.2 Intermediate Extensions

SIMD implementation: Use intrinsics for vectorized kernels
Profile-guided optimization: Use hardware counters to guide changes
Multi-threaded variants: Parallelize across cores (be careful with reduction!)
Different data types: Compare float vs double vs int performance

8.3 Advanced Extensions

Auto-tuning framework: Automatically search for best unroll factor
Roofline model: Plot achieved performance vs hardware limits
Assembly kernels: Hand-write inner loops in assembly
Cross-platform support: ARM NEON, AMD specific counters

9. Real-World Connections

9.1 Industry Applications

High-Frequency Trading: Every nanosecond matters. The techniques in this project are exactly what HFT firms use to optimize their systems.

Game Engines: Physics simulations, graphics rendering, and audio processing all depend on tight inner loops.

Machine Learning: Matrix operations in neural networks benefit from the same optimization strategies.

Scientific Computing: Weather modeling, molecular dynamics, and other simulations spend most time in computational kernels.

perf: Linux performance analysis tool
Intel VTune: Commercial profiler with deep microarchitecture insight
PAPI: Portable interface to hardware performance counters
likwid: Lightweight performance tools for Linux
BLAS/LAPACK: Highly optimized linear algebra libraries

9.3 Interview Relevance

This project prepares you for questions like:

“How would you optimize this code?” (Can demonstrate systematic approach)
“What limits the performance of this loop?” (Can identify bottleneck)
“Why is this version faster?” (Can explain in terms of architecture)
“How would you measure performance?” (Know pitfalls and methodology)
“Explain Amdahl’s Law” (Can calculate and apply limits)

10. Resources

10.1 Essential Reading

CS:APP Chapter 5: “Optimizing Program Performance” - The foundation for this project
CS:APP Chapter 6: “The Memory Hierarchy” - Understanding cache effects
CS:APP Chapter 1.9: “Amdahl’s Law” - Optimization limits
Agner Fog’s Optimization Manuals: agner.org/optimize - CPU architecture details

10.2 Tools

perf: man perf - Linux performance analysis
valgrind –tool=cachegrind: Cache simulation
Intel Intrinsics Guide: intrinsics
Compiler Explorer: godbolt.org - See generated assembly

10.3 Video Resources

MIT 6.172: Performance Engineering of Software Systems (MIT OpenCourseWare)
What Every Programmer Should Know About Memory: Ulrich Drepper (LWN.net)
CppCon talks: Search for “optimization”, “performance”

Previous: P1 (Toolchain Explorer) - understanding compilation
Next: P9 (Cache Lab++) - deep dive into cache behavior
Related: P7 (Y86-64 Simulator) - CPU pipeline internals

11. Self-Assessment Checklist

Understanding

I can explain Amdahl’s Law and calculate speedup limits
I understand the difference between latency and throughput
I can identify when code is latency-bound vs throughput-bound
I can explain why multiple accumulators help performance
I understand how branch misprediction affects performance
I can reason about memory access patterns and cache behavior

Implementation

My timing infrastructure produces stable, repeatable results
All optimized variants produce correct results
I have measured CPE for at least 4 kernels
Each kernel has at least 3 variants with different optimizations
I can demonstrate cases where optimizations backfire

Analysis

My report explains each optimization in architectural terms
I can predict when an optimization will help or hurt
I have identified the bottleneck for each kernel variant
I understand why unrolling alone doesn’t help sum reduction
I can explain the performance difference with different data patterns

Growth

I used hardware performance counters (perf) to validate understanding
I examined compiler-generated assembly to understand transformations
I can now read a loop and estimate its CPE
I trust measurements over intuition

12. Submission / Completion Criteria

Minimum Viable Completion:

Measurement infrastructure works reliably
At least 4 kernels with naive and 2+ optimized variants each
Correctness tests pass
Basic CPE table with results

Full Completion:

All 5 kernels with 4+ variants each
Statistical analysis (mean, median, stddev)
Branch and cache measurements using perf
Written report explaining each optimization
Cases documented where optimization didn’t help

Excellence (Going Above & Beyond):

SIMD implementations using intrinsics
Roofline model analysis
Auto-tuning for optimal parameters
Cross-architecture comparison (Intel vs AMD vs ARM)
Assembly-level kernels for maximum performance

13. Real World Outcome

When you complete this project, here is exactly what you will see when running your performance clinic:

$ cd perf-clinic
$ make
gcc -O2 -march=native -Wall -g -c src/main.c -o build/main.o
gcc -O2 -march=native -Wall -g -c src/timing.c -o build/timing.o
gcc -O2 -march=native -Wall -g -c src/kernels/sum_reduce.c -o build/sum_reduce.o
...
gcc -o perf-clinic build/*.o -lm

$ ./perf-clinic --kernel sum --trials 100 --size 1000000

=== PERFORMANCE CLINIC: Sum Reduction ===

Array size: 1,000,000 doubles (8 MB)
Trials: 100 (reporting minimum, median, mean)
CPU: Intel Core i7-10700 @ 3.80GHz

Variant                 Min CPE   Med CPE   Mean CPE   Speedup   Bottleneck
─────────────────────────────────────────────────────────────────────────────
naive                    3.01      3.04       3.05      1.00x    Latency (FP add)
unroll_2x                3.00      3.02       3.03      1.01x    Still latency-bound
unroll_4x                2.98      3.01       3.02      1.01x    Unroll doesn't help alone
parallel_2acc            1.51      1.53       1.54      1.99x    Latency / 2
parallel_4acc            0.76      0.78       0.79      3.90x    Near throughput
parallel_8acc            0.51      0.53       0.54      5.70x    Approaching limit
parallel_16acc           0.50      0.52       0.53      5.83x    Diminishing returns
─────────────────────────────────────────────────────────────────────────────

Theoretical Bounds:
  Latency bound (3-cycle add):  CPE >= 3.0
  Throughput bound (1 add/cyc): CPE >= 1.0
  Memory bound (L1 cached):     CPE >= 0.5

Analysis:
  naive: Limited by 3-cycle FP add latency (dependency chain)
  unroll_*: Unrolling alone cannot break dependency chain
  parallel_*: Multiple accumulators create independent chains
  Best practical: 8 accumulators achieves ~6x speedup

$ ./perf-clinic --kernel conditional --trials 100 --data random

=== PERFORMANCE CLINIC: Conditional Sum ===

Testing branch prediction effects with random data (50% threshold)

Variant                 CPE       Mispredicts/1K   Speedup
────────────────────────────────────────────────────────────
branch_naive            8.42         498.2          1.00x
branch_sorted          2.15           1.3          3.92x
branchless             3.12           0.0          2.70x
────────────────────────────────────────────────────────────

Analysis:
  Random data: ~50% misprediction rate (498/1000 branches)
  Sorted data: Highly predictable, near-perfect prediction
  Branchless: No branches to mispredict, constant time

$ perf stat -e cycles,instructions,cache-misses ./perf-clinic --kernel matvec --size 1024

=== PERFORMANCE CLINIC: Matrix-Vector Multiply ===

Matrix: 1024x1024 doubles (8 MB)

 Performance counter stats for './perf-clinic':
     1,234,567,890      cycles
       987,654,321      instructions    #    0.80  insn per cycle
            12,345      cache-misses

Variant                 CPE       L1 Miss Rate    Speedup
────────────────────────────────────────────────────────────
row_major               1.23         0.1%         1.00x
col_major               8.45        12.4%         0.15x
blocked_32             0.95         0.2%         1.29x
────────────────────────────────────────────────────────────

Analysis:
  col_major: Strided access pattern causes cache thrashing
  row_major: Sequential access, good spatial locality
  blocked: Even better cache utilization through blocking

$ cat reports/optimization_report.md
# Performance Optimization Report

## Executive Summary
Achieved 5.7x speedup on sum reduction through loop parallelism.
Branch optimization yielded 2.7-3.9x speedup depending on data.
Memory access pattern optimization: 6.5x improvement on matrix ops.

## Key Findings
1. Loop unrolling WITHOUT multiple accumulators provides minimal benefit
2. Dependency chains are the primary bottleneck for reduction operations
3. Branch misprediction costs ~15 cycles on this architecture
4. Cache miss penalty is ~100 cycles for L3, ~200 for DRAM
...

14. The Core Question You’re Answering

“What prevents code from running at hardware speed, and how do I systematically identify and remove performance bottlenecks - while proving that my optimizations actually work?”

This question is at the heart of performance engineering. You’re learning:

How to measure performance reliably (avoiding measurement pitfalls)
How to identify whether code is latency-bound, throughput-bound, or memory-bound
How compiler optimizations and CPU features interact with your code
The discipline of hypothesis-driven optimization: measure, hypothesize, transform, verify

15. Concepts You Must Understand First

Before starting this project, ensure you understand these concepts:

Concept	Where to Learn	Why It’s Needed
Amdahl’s Law	CS:APP 1.9.1	Calculate maximum speedup before optimizing
Instruction Pipelining	CS:APP 4.4	Understand latency vs throughput
Data Dependencies	CS:APP 5.5	Identify what limits instruction-level parallelism
Loop Unrolling	CS:APP 5.8	Know when and why unrolling helps
Multiple Accumulators	CS:APP 5.9	Break dependency chains
Branch Prediction	CS:APP 5.12	Understand misprediction costs
Cache Hierarchy	CS:APP 6.1-6.4	Reason about memory access patterns
Condition Codes	CS:APP 3.6.1	Understand what affects branch behavior

16. Questions to Guide Your Design

As you develop your benchmark suite, ask yourself:

Am I measuring what I think I’m measuring? Is the compiler optimizing away my code? Is the timer granular enough?
What is the theoretical bound? Given latency and throughput, what’s the best possible CPE?
Which bound am I hitting? Am I latency-bound (dependency chain) or throughput-bound (functional units)?
Is my data realistic? Cache-cold vs cache-hot? Random vs sequential access?
Can I explain the speedup? For each optimization, can I say exactly why it helped?
What didn’t work? Understanding failures is as important as successes.
Is this reproducible? Can someone else get the same results?

17. Thinking Exercise

Before you start coding, work through this analysis on paper:

Consider this reduction loop:

double sum_naive(double *data, int n) {
    double sum = 0;
    for (int i = 0; i < n; i++) {
        sum += data[i];
    }
    return sum;
}

Given:

FP add latency: 3 cycles
FP add throughput: 1 per cycle
Load throughput: 2 per cycle
Array is cached in L1

Questions to answer:

Draw the dependency graph for iterations i=0, 1, 2, 3. What depends on what?
What is the CPE if we’re latency-bound?
What is the CPE if we’re throughput-bound?
This loop is latency-bound. Why? (What creates the dependency chain?)
If we use 4 accumulators (sum0, sum1, sum2, sum3), what is the new CPE? Why?
How many accumulators do we need to become throughput-bound?

Expected answers:

Each sum += depends on the previous: sum_0 -> sum_1 -> sum_2 -> sum_3 (serial chain)
Latency-bound: CPE = 3.0 (one add every 3 cycles due to dependency)
Throughput-bound: CPE = 1.0 (one add per cycle if no dependencies)
Each addition waits for the previous sum value - this is a RAW dependency chain
With 4 accumulators: 4 independent chains, CPE = 3/4 = 0.75 (parallel execution)
To hit throughput bound of 1 add/cycle with 3-cycle latency: need 3 accumulators minimum

18. The Interview Questions They’ll Ask

After completing this project, you should be able to answer these interview questions:

“How would you optimize a slow function?”
- First: Profile to find where time is spent (don’t guess)
- Calculate Amdahl’s Law limit - is optimization worthwhile?
- Identify the bottleneck: latency? throughput? memory? branches?
- Apply targeted transformation: unroll+multiple accumulators for latency, cache blocking for memory
- Measure again to verify improvement
“What is the difference between latency and throughput?”
- Latency: Time for one operation to complete (start to finish)
- Throughput: Rate at which operations can be started
- Pipelined units have low throughput (can start frequently) but high latency (each takes time)
- Dependency chains are latency-bound; independent work is throughput-bound
“Explain Amdahl’s Law and give an example.”
- Speedup = 1 / ((1-p) + p/S) where p = parallelizable fraction, S = speedup of that fraction
- Example: 90% of time in function, optimize 10x: Speedup = 1/(0.1 + 0.9/10) = 1/0.19 = 5.26x
- Key insight: Serial portion limits maximum speedup (1/(1-p))
“Why does loop unrolling help performance?”
- Reduces loop overhead (fewer increments, compares, branches)
- More work visible to out-of-order engine
- BUT: doesn’t break dependency chains alone
- Need multiple accumulators to truly parallelize work
“How does branch misprediction affect performance?”
- CPU speculatively executes predicted path
- On mispredict: flush pipeline, restart from correct path
- Cost: 15-20 cycles on modern CPUs
- Random branches = ~50% mispredict rate = significant overhead
“How do you ensure benchmark results are valid?”
- Use volatile or print results to prevent dead code elimination
- Warm up: run several iterations before measuring
- Multiple trials: report minimum, median, and standard deviation
- Consistent state: control cache (cold vs warm), CPU frequency (disable turbo)
- Verify correctness: optimized code must produce same results

19. Hints in Layers

If you get stuck, reveal hints one at a time. Try each level before moving to the next.

Hint 1 - Reliable Timing

Use rdtsc for cycle counting or clock_gettime(CLOCK_MONOTONIC) for wall time. Always run multiple trials and take the minimum (represents best case without interference).

Hint 2 - Preventing Dead Code Elimination

Declare your result as volatile, or print it at the end. Check the assembly (use -S or godbolt.org) to verify your loop wasn’t optimized away.

Hint 3 - Understanding the Bottleneck

If unrolling doesn’t help, you’re probably latency-bound. Count the operations in your dependency chain. Add accumulators equal to latency/throughput ratio.

Hint 4 - When Branch Optimization Matters

Only optimize branches if: (1) data is unpredictable, (2) branch is in hot loop, (3) both paths are cheap. Use perf stat -e branch-misses to measure.

20. Books That Will Help

Topic	Book	Chapter/Section
Performance Fundamentals	CS:APP 3e	Chapter 5 (entire chapter)
Amdahl’s Law	CS:APP 3e	Section 1.9.1
Loop Unrolling & Parallelism	CS:APP 3e	Sections 5.8-5.9
Branch Prediction	CS:APP 3e	Section 5.12
Memory Hierarchy	CS:APP 3e	Chapter 6
CPU Microarchitecture	Agner Fog’s Manuals	Microarchitecture guide
Performance Analysis	“Systems Performance” by Brendan Gregg	Chapters 6-7
Low-Level Optimization	“What Every Programmer Should Know About Memory”	Sections 3-6

This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.

Project 8: Performance Clinic

Quick Reference

1. Learning Objectives

2. Deep Theoretical Foundation

2.1 Amdahl’s Law: The Speed Limit of Optimization

2.2 Understanding Modern CPU Pipelines

2.3 Instruction-Level Parallelism (ILP)

2.4 Latency vs Throughput

2.5 CPE: Cycles Per Element

2.6 Loop Unrolling

2.7 Reassociation and Multiple Accumulators

2.8 Reassociation Transformations

2.9 Branch Prediction Mechanics

2.10 Avoiding Branches: Conditional Moves

2.11 Memory Hierarchy Impact

2.12 Profiling Methodologies

2.13 Microbenchmarking Pitfalls

2.14 Compiler Optimizations Overview

3. Project Specification

3.1 What You Will Build

3.2 Core Benchmark Kernels

3.3 Functional Requirements

3.4 Example Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Data Structures

4.3 Module Structure

5. Implementation Guide

Phase 1: Measurement Infrastructure (Days 1-3)

Phase 2: Baseline Kernels (Days 4-5)

Phase 3: Loop Optimizations (Days 6-8)

Phase 4: Branch Optimizations (Days 9-10)

Phase 5: Memory Optimizations (Days 11-12)

Phase 6: Report and Analysis (Days 13-14)

6. Testing Strategy

6.1 Correctness Testing

6.2 Measurement Validation

6.3 Compiler Interference Check

6.4 Edge Case Testing

7. Common Pitfalls and Debugging

7.1 Measurement Errors

7.2 CPU Frequency Scaling

7.3 Compiler Optimization Interference

7.4 Memory Alignment Issues

7.5 False Sharing in Parallel Code

8. Extensions

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Tools and Libraries

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Tools

10.3 Video Resources

10.4 Related Projects in This Series

11. Self-Assessment Checklist

Understanding

Implementation

Analysis

Growth

12. Submission / Completion Criteria

13. Real World Outcome

14. The Core Question You’re Answering

15. Concepts You Must Understand First

16. Questions to Guide Your Design

17. Thinking Exercise

18. The Interview Questions They’ll Ask

19. Hints in Layers

Hint 1 - Reliable Timing

Hint 2 - Preventing Dead Code Elimination

Hint 3 - Understanding the Bottleneck

Hint 4 - When Branch Optimization Matters

20. Books That Will Help