Project 8: Performance Clinic
Build a benchmark suite of small kernels plus a written optimization report explaining changes in terms of ILP, branch prediction, and locality.
Quick Reference
| Attribute | Value |
|---|---|
| Language | C (alt: Rust, Zig, C++) |
| Difficulty | Advanced |
| Time | 1–2 weeks |
| Chapters | 5, 6, 1 |
| Coolness | ★★★☆☆ Genuinely Clever |
| Portfolio Value | Micro-SaaS/Pro Tool |
Learning Objectives
By completing this project, you will:
- Measure performance reliably: Obtain stable, repeatable timing measurements that resist noise and variance
- Understand CPU execution pipelines: Reason about instruction-level parallelism, functional units, and throughput vs latency
- Master loop transformations: Apply unrolling, reassociation, and accumulator parallelism systematically
- Predict branch behavior: Understand branch prediction mechanics and identify misprediction-prone patterns
- Reason about memory hierarchy: Connect cache behavior (Chapter 6) to observed performance
- Apply Amdahl’s Law: Quantify optimization limits and prioritize improvement efforts
- Avoid “faster by accident”: Distinguish genuine improvements from measurement artifacts
Deep Theoretical Foundation
Amdahl’s Law: The Speed Limit of Optimization
Every optimization has limits. Gene Amdahl formalized this in 1967:
1
Speedup(S, p) = ─────────────────
(1 - p) + (p / S)
Where:
- p = fraction of execution time in the optimizable portion
- S = speedup of that portion
- (1 - p) = fraction that cannot be optimized
Example: Your program spends 80% of time in one function. You optimize that function to run 10x faster.
p = 0.80 (80% in optimizable portion)
S = 10 (10x speedup)
Speedup = 1 / (0.20 + 0.80/10)
= 1 / (0.20 + 0.08)
= 1 / 0.28
= 3.57x overall speedup
Even with infinite speedup (S = infinity):
Speedup_max = 1 / (1 - 0.80) = 1 / 0.20 = 5x maximum
You can never exceed 5x no matter how fast you make that 80%!
Key Implications:
- Profile before optimizing: Find the p-dominant portions
- The sequential portion is the ceiling: Reducing (1-p) may matter more than increasing S
- Diminishing returns: Each additional speedup of the hot portion yields less overall benefit
- Always calculate the bound: Before spending a week optimizing, compute the maximum possible gain
Amdahl's Law Visualization:
Overall speedup when portion p is sped up by factor S:
p = 50%: S=2 -> 1.33x S=10 -> 1.82x S=inf -> 2.00x
p = 80%: S=2 -> 1.67x S=10 -> 3.57x S=inf -> 5.00x
p = 90%: S=2 -> 1.82x S=10 -> 5.26x S=inf -> 10.00x
p = 95%: S=2 -> 1.90x S=10 -> 6.90x S=inf -> 20.00x
p = 99%: S=2 -> 1.98x S=10 -> 9.17x S=inf -> 100.00x
The larger p is, the more room for improvement.
Understanding Modern CPU Pipelines
Modern processors execute instructions through a multi-stage pipeline:
Classic 5-Stage Pipeline:
Cycle: 1 2 3 4 5 6 7 8
Inst 1: [IF] [ID] [EX] [MEM][WB]
Inst 2: [IF] [ID] [EX] [MEM][WB]
Inst 3: [IF] [ID] [EX] [MEM][WB]
Inst 4: [IF] [ID] [EX] [MEM][WB]
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute (ALU operation)
MEM = Memory access
WB = Write Back (to register)
Ideally: One instruction completes per cycle (throughput = 1 IPC)
Modern Superscalar Processors:
Real processors are far more complex:
Intel Skylake-class Core (simplified):
┌────────────────────────────────────────────────────────────────┐
│ Front End │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Fetch │───▶│ Decode │───▶│ Rename/ │ │
│ │ (16B/cycle) │ │ (4 inst/cyc)│ │ Allocate │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Reservation Station (Scheduler) │
│ ~100 entries │
└────────────────────────────────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ INT 0 │ │ INT 1 │ │ INT 2 │ │ FP 0 │ │ FP 1 │
│ ALU │ │ ALU │ │ ALU │ │ FMA │ │ FMA │
│ BRANCH│ │ BRANCH│ │ MUL │ │ ADD │ │ ADD │
└───────┘ └───────┘ └───────┘ └───────┘ └───────┘
│ │ │ │ │
└────────────┴───────────────┴───────────────┴────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ Reorder Buffer │
│ (preserves sequential semantics) │
└────────────────────────────────────────────────────────────────┘
Key insight: Multiple instructions execute simultaneously across different functional units. Your code should provide enough independent work to keep these units busy.
Instruction-Level Parallelism (ILP)
ILP is the amount of independent work available at the instruction level.
Data Dependencies Limit ILP:
// Sequential chain - NO parallelism
// Each instruction depends on the previous result
x = a + b; // cycle 1-4 (assume 4-cycle latency)
y = x + c; // cycle 5-8 (must wait for x)
z = y + d; // cycle 9-12 (must wait for y)
// Total: 12 cycles for 3 operations
// Independent operations - FULL parallelism
// No dependencies between operations
x = a + b; // cycles 1-4 ─┬─ all execute
y = c + d; // cycles 1-4 ─┤ simultaneously
z = e + f; // cycles 1-4 ─┘
// Total: 4 cycles for 3 operations (3x speedup)
Dependency Types:
| Type | Also Called | Description | Can Be Eliminated? |
|---|---|---|---|
| RAW | True Dependency | Read After Write: a = 1; b = a; |
No (fundamental) |
| WAR | Anti-Dependency | Write After Read: a = b; b = 2; |
Yes (renaming) |
| WAW | Output Dependency | Write After Write: a = 1; a = 2; |
Yes (renaming) |
The processor’s register renaming eliminates WAR and WAW dependencies, but RAW dependencies are fundamental and limit ILP.
Latency vs Throughput
Critical distinction for understanding performance:
Latency: Time for a single operation to complete (cycles)
Throughput: Rate at which operations can be initiated (operations per cycle)
Example: Floating-point multiplication on modern Intel
Latency: 4 cycles (one multiply takes 4 cycles start to finish)
Throughput: 0.5 cycles (can start a new multiply every 0.5 cycles)
This means the multiplier is pipelined:
Cycle: 1 2 3 4 5 6 7 8
Mult 1: [S1] [S2] [S3] [S4] done
Mult 2: [S1] [S2] [S3] [S4] done
Mult 3: [S1] [S2] [S3] [S4] done
Mult 4: [S1] [S2] [S3] [S4] done
With 8 independent multiplies, throughput bound is:
8 operations / (8 * 0.5 cycles) = 8 / 4 = 2 ops/cycle
But with a dependency chain:
8 operations / (8 * 4 cycles) = 8 / 32 = 0.25 ops/cycle
That's 8x slower!
The Performance Equation:
Performance = min(Latency_bound, Throughput_bound)
Latency_bound: Time = n * Latency
(serial chain of n operations)
Throughput_bound: Time = n * Throughput
(n independent operations)
Goal: Make your code throughput-bound, not latency-bound!
CPE: Cycles Per Element
The CPE metric from CS:APP measures efficiency for array/loop computations:
CPE = (Total cycles to process array) / (Number of elements)
Example: Sum an array of 1000 elements
Version A: 4000 cycles -> CPE = 4.0
Version B: 1500 cycles -> CPE = 1.5
Version C: 1000 cycles -> CPE = 1.0
Lower CPE = faster code
Theoretical CPE Limits:
For a loop that performs one floating-point add per element:
for (int i = 0; i < n; i++) {
sum += data[i]; // one FP add per iteration
}
| Bottleneck | Bound | Reason |
|---|---|---|
| Latency (3-cycle add, dependency chain) | CPE >= 3.0 | Each add waits for previous |
| Throughput (1 add/cycle possible) | CPE >= 1.0 | Adder can do 1 per cycle |
| Memory (assuming cached) | CPE >= 0.5 | 2 loads/cycle possible |
If your measured CPE is 3.0, you’re latency-bound. Optimization means breaking the dependency chain.
Loop Unrolling
Loop unrolling reduces loop overhead and exposes more work per iteration:
// Original loop
for (int i = 0; i < n; i++) {
sum += data[i];
}
// 2x unrolled
for (int i = 0; i < n; i += 2) {
sum += data[i];
sum += data[i+1];
}
// 4x unrolled
for (int i = 0; i < n; i += 4) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
}
Benefits:
- Fewer loop counter increments and comparisons
- More work visible to the out-of-order engine
- Better instruction cache utilization (fewer branches)
But still latency-bound! All adds still chain through sum. Unrolling alone may not help.
Reassociation and Multiple Accumulators
The key to breaking latency bounds is parallelizing the accumulation:
// Original: all adds chain through one accumulator
// CPE ~= latency of addition (3-4 cycles)
double sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i]; // sum = sum + data[i]
}
// 2x parallelism: two independent accumulator chains
// CPE ~= latency / 2
double sum0 = 0, sum1 = 0;
for (int i = 0; i < n; i += 2) {
sum0 += data[i]; // chain A
sum1 += data[i+1]; // chain B (independent!)
}
double sum = sum0 + sum1;
// 4x parallelism: four independent chains
// CPE ~= latency / 4
double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < n; i += 4) {
sum0 += data[i];
sum1 += data[i+1];
sum2 += data[i+2];
sum3 += data[i+3];
}
double sum = (sum0 + sum1) + (sum2 + sum3);
Visualization:
Original (serial):
data[0] ──add──▶ sum ──add──▶ sum ──add──▶ sum ──add──▶ ...
│ data[1] │ data[2] │ data[3]
wait wait wait
Cycle: 0───3───6───9───12───... (CPE = 3.0)
4x Parallel:
data[0] ──add──▶ sum0
data[1] ──add──▶ sum1 All four chains run
data[2] ──add──▶ sum2 simultaneously!
data[3] ──add──▶ sum3
│
data[4] ──add──▶ sum0 Next iteration starts
... after 3 cycles, not 12
Cycle: 0───3───6───9───... (CPE = 0.75)
Reassociation Transformations
Reassociation changes the order of operations to expose parallelism:
// Original: Left-to-right evaluation
// acc = ((acc * x[i]) * x[i+1]) * x[i+2]
// Each multiply depends on the previous
for (int i = 0; i < n; i += 2) {
acc = (acc * data[i]) * data[i+1]; // Serial!
}
// Reassociated: Pair independent operations
// acc = acc * (x[i] * x[i+1])
// The (x[i] * x[i+1]) can proceed in parallel with the chain
for (int i = 0; i < n; i += 2) {
acc = acc * (data[i] * data[i+1]); // Partial parallel!
}
Warning: Reassociation changes results for floating-point due to rounding!
Floating-point is NOT associative:
a = 1.0e20
b = -1.0e20
c = 1.0
(a + b) + c = 0 + 1 = 1.0
a + (b + c) = 1.0e20 + (-1.0e20 + 1.0) = 1.0e20 + (-1.0e20) = 0.0
Different results! Compiler won't reassociate unless you explicitly allow it.
Use -ffast-math to permit reassociation, or do it manually (accepting the precision trade-off).
Branch Prediction Mechanics
Modern CPUs predict branch outcomes to avoid pipeline stalls:
Pipeline with branch at cycle 3:
Cycle: 1 2 3 4 5 6 7 8
Fetch: [A] [B] [br] [?] [?] [?] [?] [?]
│
├── Predict TAKEN: fetch from target
└── Predict NOT-TAKEN: fetch next instruction
If prediction is WRONG:
- Pipeline must be flushed
- Restart from correct address
- Penalty: 15-20 cycles on modern CPUs!
Branch Prediction Strategies:
| Strategy | How it Works | Accuracy |
|---|---|---|
| Static (taken) | Always predict backward branches taken, forward not-taken | ~60-70% |
| One-bit | Remember last outcome | ~80% |
| Two-bit (saturating counter) | Change prediction after 2 wrong | ~85-90% |
| Correlating | Consider history of other branches | ~90-95% |
| Neural | Pattern recognition on branch history | ~95%+ |
Branch-Unfriendly Patterns:
// Random condition - terrible for prediction
if (rand() % 2) { ... } // 50% misprediction rate!
// Data-dependent unpredictable branch
for (int i = 0; i < n; i++) {
if (data[i] > threshold) { // depends on data pattern
result += data[i];
}
}
Branch-Friendly Patterns:
// Predictable loop
for (int i = 0; i < 1000; i++) { ... }
// Predict: i < 1000 is TRUE for 999 iterations
// Sorted data makes branches predictable
sort(data);
for (int i = 0; i < n; i++) {
if (data[i] > threshold) { // After some point, always true
result += data[i];
}
}
Avoiding Branches: Conditional Moves
Replace branches with branchless computation:
// Branch version - misprediction possible
if (a > b) {
max = a;
} else {
max = b;
}
// Branchless version using conditional move
max = (a > b) ? a : b; // May compile to CMOV instruction
// Manual branchless (works always)
int mask = -(a > b); // All 1s if true, all 0s if false
max = (a & mask) | (b & ~mask);
When to use branchless:
- Unpredictable conditions (near 50/50 split)
- Inner loops where branch overhead matters
- When the computation is simple (conditional move overhead is ~2 cycles)
When branches are better:
- Predictable patterns (strongly biased one way)
- Complex alternatives (one path is expensive)
- The work saved by skipping outweighs misprediction cost
Memory Hierarchy Impact
Performance depends heavily on memory access patterns (Chapter 6):
Memory Hierarchy Latencies (approximate, Intel Skylake):
Level Size Latency (cycles) Bandwidth
─────────────────────────────────────────────────────────
L1 Cache 32 KB 4 cycles ~200 GB/s
L2 Cache 256 KB 12 cycles ~80 GB/s
L3 Cache 8+ MB 40 cycles ~40 GB/s
Main Memory GBs 200+ cycles ~20 GB/s
─────────────────────────────────────────────────────────
A cache miss costs 50-200x more than a hit!
Locality Principles:
// Temporal locality: recently accessed items accessed again
for (int iter = 0; iter < 100; iter++) {
for (int i = 0; i < N; i++) {
sum += data[i]; // Same data[i] accessed 100 times
} // Stays in cache after first pass
}
// Spatial locality: access nearby memory locations
for (int i = 0; i < N; i++) {
sum += data[i]; // data[i] and data[i+1] are adjacent
} // Cache line prefetched helps
// Poor locality (stride pattern)
for (int i = 0; i < N; i += 64) {
sum += data[i]; // Skip cache lines, waste prefetch
}
// Poor locality (random access)
for (int i = 0; i < N; i++) {
sum += data[indices[i]]; // Random access - cache useless
}
Performance Impact:
Sequential access of 1 million doubles:
L1 hit: 1M * 4 cycles = 4M cycles
L3 hit: 1M * 40 cycles = 40M cycles (10x slower)
DRAM: 1M * 200 cycles = 200M cycles (50x slower)
Locality is CRITICAL for performance!
Profiling Methodologies
Before optimizing, measure:
1. Time Measurement:
#include <time.h>
// Wall-clock time
clock_t start = clock();
// ... code to measure ...
clock_t end = clock();
double seconds = (double)(end - start) / CLOCKS_PER_SEC;
// More precise: POSIX high-resolution timer
#include <time.h>
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// ... code to measure ...
clock_gettime(CLOCK_MONOTONIC, &end);
double ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);
2. Cycle Counting (x86):
#include <x86intrin.h>
unsigned long long start = __rdtsc();
// ... code to measure ...
unsigned long long end = __rdtsc();
unsigned long long cycles = end - start;
3. Hardware Performance Counters (Linux perf):
# Overall stats
perf stat ./my_program
# Detailed cache behavior
perf stat -e cache-misses,cache-references ./my_program
# Branch prediction
perf stat -e branches,branch-misses ./my_program
# Profile and identify hotspots
perf record ./my_program
perf report
Microbenchmarking Pitfalls
Getting stable measurements is harder than it seems:
Pitfall 1: Dead Code Elimination
// BAD: Compiler may optimize away!
double result = 0;
for (int i = 0; i < N; i++) {
result += expensive_computation(data[i]);
}
// result is never used - compiler removes the loop!
// GOOD: Prevent elimination
volatile double result = 0;
// or
printf("Result: %f\n", result); // Force result to be computed
Pitfall 2: CPU Frequency Scaling
// CPU starts at low frequency, ramps up during benchmark
// First measurements are slower!
// SOLUTION 1: Warmup iterations
for (int i = 0; i < 1000; i++) {
test_function(data); // Warmup - don't measure
}
// Now measure
start = rdtsc();
for (int i = 0; i < TRIALS; i++) {
test_function(data);
}
end = rdtsc();
// SOLUTION 2: Disable frequency scaling
// sudo cpupower frequency-set --governor performance
Pitfall 3: Measurement Overhead
// BAD: Timing overhead dominates for fast operations
for (int i = 0; i < N; i++) {
start = rdtsc();
x = x + 1; // 1 cycle
end = rdtsc(); // ~50 cycles overhead!
total += (end - start);
}
// GOOD: Batch many operations
start = rdtsc();
for (int i = 0; i < 1000000; i++) {
x = x + 1;
}
end = rdtsc();
cycles_per_op = (end - start) / 1000000.0;
Pitfall 4: Cache State Variability
// First access: cold cache (slow)
// Second access: warm cache (fast)
// SOLUTION: Consistent cache state
// Option A: Flush cache before each trial (cold)
// Option B: Warmup loop before measuring (warm)
// Option C: Large enough data to exceed cache (always miss)
// Document which state you're measuring!
Pitfall 5: Variance and Outliers
// Run multiple trials, report minimum or median
double times[TRIALS];
for (int t = 0; t < TRIALS; t++) {
start = now();
test_function(data);
times[t] = now() - start;
}
// Use minimum (represents "best case" without interference)
// Or median (more robust to outliers)
double min_time = minimum(times, TRIALS);
double median_time = median(times, TRIALS);
// Report both and explain methodology!
Compiler Optimizations Overview
Know what the compiler can do so you don’t duplicate its work:
Common Compiler Optimizations:
| Optimization | What It Does | Flag |
|---|---|---|
| Inlining | Replace function call with body | -O1+ |
| Loop unrolling | Replicate loop body | -O2+ or -funroll-loops |
| Constant folding | Compute constants at compile time | -O1+ |
| Dead code elimination | Remove unused code | -O1+ |
| Common subexpression | Reuse computed values | -O1+ |
| Strength reduction | Replace expensive ops (a*2 -> a«1) | -O1+ |
| Vectorization | Use SIMD instructions | -O3 or -ftree-vectorize |
| Instruction scheduling | Reorder for pipeline efficiency | -O2+ |
Compiler Explorer Strategy:
- Write your original code
- Compile with
-O0and-O3 - Examine assembly (godbolt.org is excellent)
- See what the compiler already does
- Only manually optimize what the compiler misses
Critical Compiler Flags:
# Standard optimization levels
gcc -O0 ... # No optimization (debugging)
gcc -O1 ... # Basic optimizations
gcc -O2 ... # Standard optimizations (recommended for most cases)
gcc -O3 ... # Aggressive optimizations (may increase code size)
gcc -Os ... # Optimize for size
# Architecture-specific
gcc -march=native ... # Use all instructions available on this CPU
gcc -mtune=native ... # Tune for this CPU (without requiring features)
# Floating-point
gcc -ffast-math ... # Allows reassociation, breaks IEEE compliance
gcc -ffp-contract=fast # Allow FMA contraction
# Vectorization
gcc -ftree-vectorize ... # Enable auto-vectorization
gcc -fopt-info-vec ... # Report what was vectorized
Project Specification
What You Will Build
A benchmark suite consisting of:
- Small computational kernels: Array operations that stress different aspects of the CPU
- Measurement infrastructure: Reliable timing with statistical analysis
- Optimization laboratory: Before/after versions of each kernel
- Written report: Explanation of each optimization in architectural terms
Core Benchmark Kernels
Implement and optimize these kernels (minimum 4):
Kernel 1: Sum Reduction
double sum_naive(double *data, size_t n) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
sum += data[i];
}
return sum;
}
Target: Transform from latency-bound to throughput-bound.
Kernel 2: Polynomial Evaluation (Horner’s Rule)
double poly_naive(double *coeffs, size_t degree, double x) {
double result = coeffs[degree];
for (size_t i = degree; i > 0; i--) {
result = result * x + coeffs[i-1];
}
return result;
}
Target: Explore dependency chain limitations.
Kernel 3: Conditional Accumulation
double conditional_sum_naive(double *data, size_t n, double threshold) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
if (data[i] > threshold) {
sum += data[i];
}
}
return sum;
}
Target: Explore branch prediction effects, try branchless.
Kernel 4: Matrix-Vector Multiply
void matvec_naive(double *A, double *x, double *y, size_t n) {
for (size_t i = 0; i < n; i++) {
y[i] = 0;
for (size_t j = 0; j < n; j++) {
y[i] += A[i*n + j] * x[j];
}
}
}
Target: Explore memory access patterns and locality.
Kernel 5: Inner Product
double inner_product_naive(double *a, double *b, size_t n) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
sum += a[i] * b[i];
}
return sum;
}
Target: Two data streams, multiply-add dependency chain.
Functional Requirements
- Measurement System:
- Multiple trials with statistical reporting
- CPE calculation for each kernel
- Cycle-accurate timing (rdtsc or similar)
- Warmup iterations before measurement
- Kernel Variants:
- At least 3 optimization levels per kernel
- Each variant must produce correct results
- Document the transformation applied
- Report Requirements:
- CPE table for all variants
- Speedup calculation relative to naive
- Explanation of why each optimization works (or doesn’t)
- Discussion of architectural bottleneck shifted
Example Output
$ ./perf-clinic --kernel sum --trials 100
=== PERFORMANCE CLINIC: Sum Reduction ===
Array size: 1,000,000 doubles
Trials: 100 (reporting median)
Variant CPE Speedup Bottleneck
───────────────────────────────────────────────────────
naive 3.02 1.00x Latency (FP add chain)
unroll_2x 3.01 1.00x Still latency-bound
unroll_4x 2.98 1.01x Minimal improvement
parallel_2acc 1.52 1.99x Latency / 2
parallel_4acc 0.78 3.87x Approaching throughput
parallel_8acc 0.52 5.81x Near throughput bound
vectorized 0.13 23.23x SIMD (4 ops/instr)
───────────────────────────────────────────────────────
Analysis:
- Naive version is limited by 3-cycle FP add latency
- Unrolling alone doesn't help (still one dependency chain)
- Multiple accumulators break the dependency chain
- 8 accumulators approach the throughput bound (~0.5 CPE)
- SIMD achieves 4x further by processing 4 elements per instruction
Real World Outcome
When you complete this project, here is exactly what you will see when running your performance clinic:
$ cd perf-clinic
$ make
gcc -O2 -march=native -Wall -g -c src/main.c -o build/main.o
gcc -O2 -march=native -Wall -g -c src/timing.c -o build/timing.o
gcc -O2 -march=native -Wall -g -c src/kernels/sum_reduce.c -o build/sum_reduce.o
...
gcc -o perf-clinic build/*.o -lm
$ ./perf-clinic --kernel sum --trials 100 --size 1000000
=== PERFORMANCE CLINIC: Sum Reduction ===
Array size: 1,000,000 doubles (8 MB)
Trials: 100 (reporting minimum, median, mean)
CPU: Intel Core i7-10700 @ 3.80GHz
Variant Min CPE Med CPE Mean CPE Speedup Bottleneck
─────────────────────────────────────────────────────────────────────────────
naive 3.01 3.04 3.05 1.00x Latency (FP add)
unroll_2x 3.00 3.02 3.03 1.01x Still latency-bound
unroll_4x 2.98 3.01 3.02 1.01x Unroll doesn't help alone
parallel_2acc 1.51 1.53 1.54 1.99x Latency / 2
parallel_4acc 0.76 0.78 0.79 3.90x Near throughput
parallel_8acc 0.51 0.53 0.54 5.70x Approaching limit
parallel_16acc 0.50 0.52 0.53 5.83x Diminishing returns
─────────────────────────────────────────────────────────────────────────────
Theoretical Bounds:
Latency bound (3-cycle add): CPE >= 3.0
Throughput bound (1 add/cyc): CPE >= 1.0
Memory bound (L1 cached): CPE >= 0.5
Analysis:
naive: Limited by 3-cycle FP add latency (dependency chain)
unroll_*: Unrolling alone cannot break dependency chain
parallel_*: Multiple accumulators create independent chains
Best practical: 8 accumulators achieves ~6x speedup
$ ./perf-clinic --kernel conditional --trials 100 --data random
=== PERFORMANCE CLINIC: Conditional Sum ===
Testing branch prediction effects with random data (50% threshold)
Variant CPE Mispredicts/1K Speedup
────────────────────────────────────────────────────────────
branch_naive 8.42 498.2 1.00x
branch_sorted 2.15 1.3 3.92x
branchless 3.12 0.0 2.70x
────────────────────────────────────────────────────────────
Analysis:
Random data: ~50% misprediction rate (498/1000 branches)
Sorted data: Highly predictable, near-perfect prediction
Branchless: No branches to mispredict, constant time
$ perf stat -e cycles,instructions,cache-misses ./perf-clinic --kernel matvec --size 1024
=== PERFORMANCE CLINIC: Matrix-Vector Multiply ===
Matrix: 1024x1024 doubles (8 MB)
Performance counter stats for './perf-clinic':
1,234,567,890 cycles
987,654,321 instructions # 0.80 insn per cycle
12,345 cache-misses
Variant CPE L1 Miss Rate Speedup
────────────────────────────────────────────────────────────
row_major 1.23 0.1% 1.00x
col_major 8.45 12.4% 0.15x
blocked_32 0.95 0.2% 1.29x
────────────────────────────────────────────────────────────
Analysis:
col_major: Strided access pattern causes cache thrashing
row_major: Sequential access, good spatial locality
blocked: Even better cache utilization through blocking
$ cat reports/optimization_report.md
# Performance Optimization Report
---
## Solution Architecture
### High-Level Design
┌─────────────────────────────────────────────────────────────────────┐ │ perf-clinic │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Timing │ │ Kernel │ │ Analysis │ │ │ │ Harness │ │ Library │ │ Engine │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Benchmark Runner │ │ │ │ - Load data │ │ │ │ - Run warmup │ │ │ │ - Execute trials │ │ │ │ - Collect statistics │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────┐ │ │ │ Report Generator │ │ │ │ - CPE calculations │ │ │ │ - Speedup ratios │ │ │ │ - Statistical summary │ │ │ │ - Bottleneck identification │ │ │ └──────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Console / File Output] │ └─────────────────────────────────────────────────────────────────────┘
### Key Data Structures
```c
// Timing result for one trial
typedef struct {
uint64_t cycles;
double seconds;
size_t elements;
} trial_result_t;
// Statistics across all trials
typedef struct {
double min_cycles;
double median_cycles;
double mean_cycles;
double stddev_cycles;
double cpe; // cycles per element
size_t num_trials;
} benchmark_stats_t;
// Kernel variant
typedef struct {
const char *name;
const char *description;
void (*setup)(void *ctx); // Optional setup
void (*kernel)(void *ctx); // The actual work
void (*verify)(void *ctx); // Correctness check
void (*teardown)(void *ctx); // Cleanup
} kernel_variant_t;
// Benchmark configuration
typedef struct {
const char *kernel_name;
kernel_variant_t *variants;
size_t num_variants;
size_t element_count;
size_t num_trials;
size_t warmup_iters;
bool verify_results;
} benchmark_config_t;
// Result for one kernel variant
typedef struct {
const char *variant_name;
benchmark_stats_t stats;
bool correct;
double speedup; // relative to baseline
} variant_result_t;
Module Structure
perf-clinic/
├── src/
│ ├── main.c # CLI and orchestration
│ ├── timing.c # Cycle counting and timing
│ ├── timing.h
│ ├── stats.c # Statistical analysis
│ ├── stats.h
│ ├── runner.c # Benchmark execution
│ ├── runner.h
│ ├── report.c # Output formatting
│ ├── report.h
│ └── kernels/
│ ├── sum_reduce.c # Sum variants
│ ├── poly_eval.c # Polynomial variants
│ ├── conditional.c # Conditional sum variants
│ ├── matvec.c # Matrix-vector variants
│ └── inner_product.c # Inner product variants
├── include/
│ ├── kernels.h # Kernel declarations
│ └── common.h # Shared definitions
├── tests/
│ ├── test_correctness.c # Verify all variants match
│ └── test_timing.c # Validate timing infrastructure
├── data/
│ └── generate_data.c # Test data generation
├── Makefile
└── README.md
Implementation Guide
Phase 1: Measurement Infrastructure (Days 1-3)
Goals:
- Reliable cycle-accurate timing
- Statistical analysis of multiple trials
- Avoid measurement pitfalls
Tasks:
- Implement cycle counter access:
// timing.h
#include <stdint.h>
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
// More precise: use rdtscp (serializing)
static inline uint64_t rdtscp(void) {
uint32_t lo, hi, aux;
__asm__ __volatile__ (
"rdtscp" : "=a"(lo), "=d"(hi), "=c"(aux)
);
return ((uint64_t)hi << 32) | lo;
}
- Implement statistical functions:
// stats.c
double compute_median(double *values, size_t n);
double compute_mean(double *values, size_t n);
double compute_stddev(double *values, size_t n);
double compute_min(double *values, size_t n);
- Implement benchmark harness:
// runner.c
benchmark_stats_t run_benchmark(
void (*kernel)(void *ctx),
void *ctx,
size_t elements,
size_t trials,
size_t warmup
);
Checkpoint: Can measure a simple loop and get consistent cycle counts across runs.
Phase 2: Baseline Kernels (Days 4-5)
Goals:
- Implement naive versions of all kernels
- Verify correctness
- Establish baseline CPE
Tasks:
- Implement each kernel’s naive version
- Create verification functions (compare against known results)
- Generate test data with different patterns:
- Sequential (0, 1, 2, …)
- Random uniform
- Random with patterns (sorted, partially sorted)
- Measure baseline CPE:
void measure_baselines(void) {
double data[N];
fill_random(data, N);
benchmark_stats_t sum_stats = run_benchmark(
sum_naive, data, N, 100, 10
);
printf("sum_naive: CPE = %.2f\n", sum_stats.cpe);
}
Checkpoint: All naive kernels produce correct results. Baseline CPE recorded.
Phase 3: Loop Optimizations (Days 6-8)
Goals:
- Implement unrolled variants
- Implement parallel accumulator variants
- Understand dependency effects
Tasks:
- Implement 2x, 4x, 8x unrolling:
double sum_unroll4(double *data, size_t n) {
double sum = 0;
size_t i;
for (i = 0; i < n - 3; i += 4) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
}
for (; i < n; i++) {
sum += data[i];
}
return sum;
}
- Implement parallel accumulator variants:
double sum_parallel4(double *data, size_t n) {
double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
size_t i;
for (i = 0; i < n - 3; i += 4) {
sum0 += data[i];
sum1 += data[i+1];
sum2 += data[i+2];
sum3 += data[i+3];
}
for (; i < n; i++) {
sum0 += data[i];
}
return (sum0 + sum1) + (sum2 + sum3);
}
- Measure and compare CPE for all variants
Checkpoint: Parallel accumulator shows significant speedup. Can explain why unrolling alone doesn’t help.
Phase 4: Branch Optimizations (Days 9-10)
Goals:
- Measure branch misprediction effects
- Implement branchless alternatives
- Compare with different data patterns
Tasks:
- Implement branchless conditional:
double conditional_sum_branchless(double *data, size_t n, double threshold) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
// mask is 0 or all-1s based on condition
double mask = (data[i] > threshold) ? 1.0 : 0.0;
sum += data[i] * mask;
}
return sum;
}
- Test with different data distributions:
- Random (50% pass threshold) - unpredictable
- Sorted (all pass or all fail) - predictable
- Biased (90% one way) - mostly predictable
- Measure with
perf stat -e branches,branch-misses:
# High misprediction rate expected for random data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data random
# Low misprediction rate for sorted data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data sorted
Checkpoint: Can demonstrate branch misprediction costs. Branchless version wins for random data.
Phase 5: Memory Optimizations (Days 11-12)
Goals:
- Explore locality effects
- Measure cache miss impact
- Optimize memory access patterns
Tasks:
- Implement matrix-vector with different access patterns:
// Row-major (good cache behavior for row-major storage)
void matvec_rowmajor(double *A, double *x, double *y, size_t n) {
for (size_t i = 0; i < n; i++) {
double sum = 0;
for (size_t j = 0; j < n; j++) {
sum += A[i*n + j] * x[j]; // A accessed sequentially
}
y[i] = sum;
}
}
// Column-major (bad cache behavior for row-major storage)
void matvec_colmajor(double *A, double *x, double *y, size_t n) {
for (size_t i = 0; i < n; i++) y[i] = 0;
for (size_t j = 0; j < n; j++) {
for (size_t i = 0; i < n; i++) {
y[i] += A[i*n + j] * x[j]; // A accessed with stride n
}
}
}
- Measure cache misses:
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant rowmajor
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant colmajor
- Test with matrices of different sizes (L1, L2, L3, larger than cache)
Checkpoint: Row-major significantly faster. Cache miss rate correlates with slowdown.
Phase 6: Report and Analysis (Days 13-14)
Goals:
- Document all results
- Explain each optimization architecturally
- Identify when optimizations backfire
Tasks:
- Create comprehensive CPE table
- Write analysis for each kernel:
- What was the bottleneck?
- How did each transformation address it?
- What is the limiting factor now?
- Document cases where optimizations didn’t help:
- Unrolling without multiple accumulators
- Branchless for predictable data
- Over-unrolling that causes register pressure
- Produce final report document
Checkpoint: Portfolio-quality report with before/after results and strong architectural narrative.
Testing Strategy
Correctness Testing
Every optimized variant must produce the same result as the naive version:
void test_sum_correctness(void) {
double data[N];
fill_random(data, N);
double expected = sum_naive(data, N);
assert_close(sum_unroll4(data, N), expected, 1e-10);
assert_close(sum_parallel4(data, N), expected, 1e-10);
assert_close(sum_parallel8(data, N), expected, 1e-10);
// Note: floating-point may differ slightly due to reassociation
}
void assert_close(double actual, double expected, double tolerance) {
double diff = fabs(actual - expected);
double rel = diff / (fabs(expected) + 1e-10);
if (rel > tolerance) {
fprintf(stderr, "FAIL: expected %f, got %f (rel diff: %e)\n",
expected, actual, rel);
exit(1);
}
}
Measurement Validation
Verify timing infrastructure is reliable:
void test_timing_stability(void) {
// Measure a known-duration operation
double times[100];
for (int i = 0; i < 100; i++) {
uint64_t start = rdtsc();
busy_loop(10000); // Calibrated delay
times[i] = rdtsc() - start;
}
double stddev = compute_stddev(times, 100);
double mean = compute_mean(times, 100);
double cv = stddev / mean; // Coefficient of variation
// CV should be < 5% for stable measurements
assert(cv < 0.05);
}
Compiler Interference Check
Ensure the compiler isn’t removing your benchmark code:
void test_not_optimized_away(void) {
volatile double result;
uint64_t start = rdtsc();
result = sum_naive(data, N);
uint64_t cycles = rdtsc() - start;
// Should take at least N cycles (one memory access per element)
assert(cycles > N / 2);
// Use result to prevent elimination
assert(result != 12345.6789); // Some value it won't equal
}
Edge Case Testing
void test_edge_cases(void) {
double data[8] = {1, 2, 3, 4, 5, 6, 7, 8};
// Test with n not divisible by unroll factor
assert_close(sum_parallel4(data, 1), 1.0, 1e-10);
assert_close(sum_parallel4(data, 2), 3.0, 1e-10);
assert_close(sum_parallel4(data, 5), 15.0, 1e-10);
assert_close(sum_parallel4(data, 7), 28.0, 1e-10);
// Test with empty array
assert_close(sum_parallel4(data, 0), 0.0, 1e-10);
// Test with special values
data[0] = INFINITY;
assert(isinf(sum_parallel4(data, 8)));
data[0] = NAN;
assert(isnan(sum_parallel4(data, 8)));
}
Common Pitfalls and Debugging
Measurement Errors
| Pitfall | Symptom | Solution |
|---|---|---|
| Dead code elimination | CPE near 0 | Use volatile or print result |
| Frequency scaling | First runs slower | Warmup iterations, disable scaling |
| Turbo boost inconsistency | High variance | Disable turbo or use more trials |
| Timer overflow | Negative times | Use 64-bit counters |
| Memory prefetching | Arrays too small | Use arrays > L3 cache |
CPU Frequency Scaling
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set to performance (requires root)
sudo cpupower frequency-set --governor performance
# Or for all CPUs:
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done
# Disable turbo boost (Intel)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
Compiler Optimization Interference
# Check what the compiler is doing
gcc -O3 -S -fverbose-asm kernel.c -o kernel.s
# Use Compiler Explorer (https://godbolt.org)
# to visualize assembly
# Prevent specific optimizations
__attribute__((optimize("no-unroll-loops")))
double sum_my_unroll(double *data, size_t n) { ... }
Memory Alignment Issues
// Ensure proper alignment for SIMD
double *data = aligned_alloc(32, n * sizeof(double));
// Or use compiler attributes
double data[N] __attribute__((aligned(32)));
// Check alignment at runtime
assert(((uintptr_t)data % 32) == 0);
False Sharing in Parallel Code
// BAD: Accumulators may share cache line
double sums[4]; // Could be on same 64-byte line
// GOOD: Pad to separate cache lines
struct {
double value;
char padding[56]; // 64 - 8 = 56 bytes padding
} sums[4];
Extensions
Beginner Extensions
- Add CSV output: Export results for spreadsheet analysis
- Implement warm/cold cache modes: Measure both scenarios
- Add progress indicator: Show which benchmark is running
- Create comparison plots: Visualize CPE across variants
Intermediate Extensions
- SIMD implementation: Use intrinsics for vectorized kernels
- Profile-guided optimization: Use hardware counters to guide changes
- Multi-threaded variants: Parallelize across cores (be careful with reduction!)
- Different data types: Compare float vs double vs int performance
Advanced Extensions
- Auto-tuning framework: Automatically search for best unroll factor
- Roofline model: Plot achieved performance vs hardware limits
- Assembly kernels: Hand-write inner loops in assembly
- Cross-platform support: ARM NEON, AMD specific counters
Real-World Connections
Industry Applications
High-Frequency Trading: Every nanosecond matters. The techniques in this project are exactly what HFT firms use to optimize their systems.
Game Engines: Physics simulations, graphics rendering, and audio processing all depend on tight inner loops.
Machine Learning: Matrix operations in neural networks benefit from the same optimization strategies.
Scientific Computing: Weather modeling, molecular dynamics, and other simulations spend most time in computational kernels.
Related Tools and Libraries
- perf: Linux performance analysis tool
- Intel VTune: Commercial profiler with deep microarchitecture insight
- PAPI: Portable interface to hardware performance counters
- likwid: Lightweight performance tools for Linux
- BLAS/LAPACK: Highly optimized linear algebra libraries
Interview Relevance
This project prepares you for questions like:
- “How would you optimize this code?” (Can demonstrate systematic approach)
- “What limits the performance of this loop?” (Can identify bottleneck)
- “Why is this version faster?” (Can explain in terms of architecture)
- “How would you measure performance?” (Know pitfalls and methodology)
- “Explain Amdahl’s Law” (Can calculate and apply limits)
Resources
Essential Reading
- CS:APP Chapter 5: “Optimizing Program Performance” - The foundation for this project
- CS:APP Chapter 6: “The Memory Hierarchy” - Understanding cache effects
- CS:APP Chapter 1.9: “Amdahl’s Law” - Optimization limits
- Agner Fog’s Optimization Manuals: agner.org/optimize - CPU architecture details
Tools
- perf:
man perf- Linux performance analysis - valgrind –tool=cachegrind: Cache simulation
- Intel Intrinsics Guide: intrinsics
- Compiler Explorer: godbolt.org - See generated assembly
Video Resources
- MIT 6.172: Performance Engineering of Software Systems (MIT OpenCourseWare)
- What Every Programmer Should Know About Memory: Ulrich Drepper (LWN.net)
- CppCon talks: Search for “optimization”, “performance”
Related Projects in This Series
- Previous: P1 (Toolchain Explorer) - understanding compilation
- Next: P9 (Cache Lab++) - deep dive into cache behavior
- Related: P7 (Y86-64 Simulator) - CPU pipeline internals
Self-Assessment Checklist
Understanding
- I can explain Amdahl’s Law and calculate speedup limits
- I understand the difference between latency and throughput
- I can identify when code is latency-bound vs throughput-bound
- I can explain why multiple accumulators help performance
- I understand how branch misprediction affects performance
- I can reason about memory access patterns and cache behavior
Implementation
- My timing infrastructure produces stable, repeatable results
- All optimized variants produce correct results
- I have measured CPE for at least 4 kernels
- Each kernel has at least 3 variants with different optimizations
- I can demonstrate cases where optimizations backfire
Analysis
- My report explains each optimization in architectural terms
- I can predict when an optimization will help or hurt
- I have identified the bottleneck for each kernel variant
- I understand why unrolling alone doesn’t help sum reduction
- I can explain the performance difference with different data patterns
Growth
- I used hardware performance counters (perf) to validate understanding
- I examined compiler-generated assembly to understand transformations
- I can now read a loop and estimate its CPE
- I trust measurements over intuition
Submission / Completion Criteria
Minimum Viable Completion:
- Measurement infrastructure works reliably
- At least 4 kernels with naive and 2+ optimized variants each
- Correctness tests pass
- Basic CPE table with results
Full Completion:
- All 5 kernels with 4+ variants each
- Statistical analysis (mean, median, stddev)
- Branch and cache measurements using perf
- Written report explaining each optimization
- Cases documented where optimization didn’t help
Excellence (Going Above & Beyond):
- SIMD implementations using intrinsics
- Roofline model analysis
- Auto-tuning for optimal parameters
- Cross-architecture comparison (Intel vs AMD vs ARM)
- Assembly-level kernels for maximum performance
Executive Summary
Achieved 5.7x speedup on sum reduction through loop parallelism. Branch optimization yielded 2.7-3.9x speedup depending on data. Memory access pattern optimization: 6.5x improvement on matrix ops.
Key Findings
- Loop unrolling WITHOUT multiple accumulators provides minimal benefit
- Dependency chains are the primary bottleneck for reduction operations
- Branch misprediction costs ~15 cycles on this architecture
- Cache miss penalty is ~100 cycles for L3, ~200 for DRAM … ```
The Core Question You’re Answering
“What prevents code from running at hardware speed, and how do I systematically identify and remove performance bottlenecks - while proving that my optimizations actually work?”
This question is at the heart of performance engineering. You’re learning:
- How to measure performance reliably (avoiding measurement pitfalls)
- How to identify whether code is latency-bound, throughput-bound, or memory-bound
- How compiler optimizations and CPU features interact with your code
- The discipline of hypothesis-driven optimization: measure, hypothesize, transform, verify
Concepts You Must Understand First
Before starting this project, ensure you understand these concepts:
| Concept | Where to Learn | Why It’s Needed |
|---|---|---|
| Amdahl’s Law | CS:APP 1.9.1 | Calculate maximum speedup before optimizing |
| Instruction Pipelining | CS:APP 4.4 | Understand latency vs throughput |
| Data Dependencies | CS:APP 5.5 | Identify what limits instruction-level parallelism |
| Loop Unrolling | CS:APP 5.8 | Know when and why unrolling helps |
| Multiple Accumulators | CS:APP 5.9 | Break dependency chains |
| Branch Prediction | CS:APP 5.12 | Understand misprediction costs |
| Cache Hierarchy | CS:APP 6.1-6.4 | Reason about memory access patterns |
| Condition Codes | CS:APP 3.6.1 | Understand what affects branch behavior |
Questions to Guide Your Design
As you develop your benchmark suite, ask yourself:
- Am I measuring what I think I’m measuring? Is the compiler optimizing away my code? Is the timer granular enough?
- What is the theoretical bound? Given latency and throughput, what’s the best possible CPE?
- Which bound am I hitting? Am I latency-bound (dependency chain) or throughput-bound (functional units)?
- Is my data realistic? Cache-cold vs cache-hot? Random vs sequential access?
- Can I explain the speedup? For each optimization, can I say exactly why it helped?
- What didn’t work? Understanding failures is as important as successes.
- Is this reproducible? Can someone else get the same results?
Thinking Exercise
Before you start coding, work through this analysis on paper:
Consider this reduction loop:
double sum_naive(double *data, int n) {
double sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i];
}
return sum;
}
Given:
- FP add latency: 3 cycles
- FP add throughput: 1 per cycle
- Load throughput: 2 per cycle
- Array is cached in L1
Questions to answer:
- Draw the dependency graph for iterations i=0, 1, 2, 3. What depends on what?
- What is the CPE if we’re latency-bound?
- What is the CPE if we’re throughput-bound?
- This loop is latency-bound. Why? (What creates the dependency chain?)
- If we use 4 accumulators (sum0, sum1, sum2, sum3), what is the new CPE? Why?
- How many accumulators do we need to become throughput-bound?
Expected answers:
- Each
sum +=depends on the previous: sum_0 -> sum_1 -> sum_2 -> sum_3 (serial chain) - Latency-bound: CPE = 3.0 (one add every 3 cycles due to dependency)
- Throughput-bound: CPE = 1.0 (one add per cycle if no dependencies)
- Each addition waits for the previous sum value - this is a RAW dependency chain
- With 4 accumulators: 4 independent chains, CPE = 3/4 = 0.75 (parallel execution)
- To hit throughput bound of 1 add/cycle with 3-cycle latency: need 3 accumulators minimum
Hints in Layers
If you get stuck, reveal hints one at a time. Try each level before moving to the next.
Hint 1 - Reliable Timing
Use rdtsc for cycle counting or clock_gettime(CLOCK_MONOTONIC) for wall time. Always run multiple trials and take the minimum (represents best case without interference).
Hint 2 - Preventing Dead Code Elimination
Declare your result as volatile, or print it at the end. Check the assembly (use -S or godbolt.org) to verify your loop wasn’t optimized away.
Hint 3 - Understanding the Bottleneck
If unrolling doesn’t help, you’re probably latency-bound. Count the operations in your dependency chain. Add accumulators equal to latency/throughput ratio.
Hint 4 - When Branch Optimization Matters
Only optimize branches if: (1) data is unpredictable, (2) branch is in hot loop, (3) both paths are cheap. Use perf stat -e branch-misses to measure.
The Interview Questions They’ll Ask
After completing this project, you should be able to answer these interview questions:
- “How would you optimize a slow function?”
- First: Profile to find where time is spent (don’t guess)
- Calculate Amdahl’s Law limit - is optimization worthwhile?
- Identify the bottleneck: latency? throughput? memory? branches?
- Apply targeted transformation: unroll+multiple accumulators for latency, cache blocking for memory
- Measure again to verify improvement
- “What is the difference between latency and throughput?”
- Latency: Time for one operation to complete (start to finish)
- Throughput: Rate at which operations can be started
- Pipelined units have low throughput (can start frequently) but high latency (each takes time)
- Dependency chains are latency-bound; independent work is throughput-bound
- “Explain Amdahl’s Law and give an example.”
- Speedup = 1 / ((1-p) + p/S) where p = parallelizable fraction, S = speedup of that fraction
- Example: 90% of time in function, optimize 10x: Speedup = 1/(0.1 + 0.9/10) = 1/0.19 = 5.26x
- Key insight: Serial portion limits maximum speedup (1/(1-p))
- “Why does loop unrolling help performance?”
- Reduces loop overhead (fewer increments, compares, branches)
- More work visible to out-of-order engine
- BUT: doesn’t break dependency chains alone
- Need multiple accumulators to truly parallelize work
- “How does branch misprediction affect performance?”
- CPU speculatively executes predicted path
- On mispredict: flush pipeline, restart from correct path
- Cost: 15-20 cycles on modern CPUs
- Random branches = ~50% mispredict rate = significant overhead
- “How do you ensure benchmark results are valid?”
- Use volatile or print results to prevent dead code elimination
- Warm up: run several iterations before measuring
- Multiple trials: report minimum, median, and standard deviation
- Consistent state: control cache (cold vs warm), CPU frequency (disable turbo)
- Verify correctness: optimized code must produce same results
Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| Performance Fundamentals | CS:APP 3e | Chapter 5 (entire chapter) |
| Amdahl’s Law | CS:APP 3e | Section 1.9.1 |
| Loop Unrolling & Parallelism | CS:APP 3e | Sections 5.8-5.9 |
| Branch Prediction | CS:APP 3e | Section 5.12 |
| Memory Hierarchy | CS:APP 3e | Chapter 6 |
| CPU Microarchitecture | Agner Fog’s Manuals | Microarchitecture guide |
| Performance Analysis | “Systems Performance” by Brendan Gregg | Chapters 6-7 |
| Low-Level Optimization | “What Every Programmer Should Know About Memory” | Sections 3-6 |
This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.