Project 8: Performance Clinic
Project 8: Performance Clinic
Build a benchmark suite of small kernels plus a written optimization report explaining changes in terms of ILP, branch prediction, and locality.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Language | C (Alternatives: Rust, Zig, C++) |
| Prerequisites | Project 1 (Toolchain Explorer) |
| Key Topics | Loop optimization, ILP, branch prediction, memory hierarchy, profiling |
| CS:APP Chapters | 5 (primary), 6, 1 |
1. Learning Objectives
By completing this project, you will:
- Measure performance reliably: Obtain stable, repeatable timing measurements that resist noise and variance
- Understand CPU execution pipelines: Reason about instruction-level parallelism, functional units, and throughput vs latency
- Master loop transformations: Apply unrolling, reassociation, and accumulator parallelism systematically
- Predict branch behavior: Understand branch prediction mechanics and identify misprediction-prone patterns
- Reason about memory hierarchy: Connect cache behavior (Chapter 6) to observed performance
- Apply Amdahlโs Law: Quantify optimization limits and prioritize improvement efforts
- Avoid โfaster by accidentโ: Distinguish genuine improvements from measurement artifacts
2. Deep Theoretical Foundation
2.1 Amdahlโs Law: The Speed Limit of Optimization
Every optimization has limits. Gene Amdahl formalized this in 1967:
1
Speedup(S, p) = โโโโโโโโโโโโโโโโโ
(1 - p) + (p / S)
Where:
- p = fraction of execution time in the optimizable portion
- S = speedup of that portion
- (1 - p) = fraction that cannot be optimized
Example: Your program spends 80% of time in one function. You optimize that function to run 10x faster.
p = 0.80 (80% in optimizable portion)
S = 10 (10x speedup)
Speedup = 1 / (0.20 + 0.80/10)
= 1 / (0.20 + 0.08)
= 1 / 0.28
= 3.57x overall speedup
Even with infinite speedup (S = infinity):
Speedup_max = 1 / (1 - 0.80) = 1 / 0.20 = 5x maximum
You can never exceed 5x no matter how fast you make that 80%!
Key Implications:
- Profile before optimizing: Find the p-dominant portions
- The sequential portion is the ceiling: Reducing (1-p) may matter more than increasing S
- Diminishing returns: Each additional speedup of the hot portion yields less overall benefit
- Always calculate the bound: Before spending a week optimizing, compute the maximum possible gain
Amdahl's Law Visualization:
Overall speedup when portion p is sped up by factor S:
p = 50%: S=2 -> 1.33x S=10 -> 1.82x S=inf -> 2.00x
p = 80%: S=2 -> 1.67x S=10 -> 3.57x S=inf -> 5.00x
p = 90%: S=2 -> 1.82x S=10 -> 5.26x S=inf -> 10.00x
p = 95%: S=2 -> 1.90x S=10 -> 6.90x S=inf -> 20.00x
p = 99%: S=2 -> 1.98x S=10 -> 9.17x S=inf -> 100.00x
The larger p is, the more room for improvement.
2.2 Understanding Modern CPU Pipelines
Modern processors execute instructions through a multi-stage pipeline:
Classic 5-Stage Pipeline:
Cycle: 1 2 3 4 5 6 7 8
Inst 1: [IF] [ID] [EX] [MEM][WB]
Inst 2: [IF] [ID] [EX] [MEM][WB]
Inst 3: [IF] [ID] [EX] [MEM][WB]
Inst 4: [IF] [ID] [EX] [MEM][WB]
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute (ALU operation)
MEM = Memory access
WB = Write Back (to register)
Ideally: One instruction completes per cycle (throughput = 1 IPC)
Modern Superscalar Processors:
Real processors are far more complex:
Intel Skylake-class Core (simplified):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Front End โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Fetch โโโโโถโ Decode โโโโโถโ Rename/ โ โ
โ โ (16B/cycle) โ โ (4 inst/cyc)โ โ Allocate โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Reservation Station (Scheduler) โ
โ ~100 entries โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ โ
โผ โผ โผ โผ โผ
โโโโโโโโโ โโโโโโโโโ โโโโโโโโโ โโโโโโโโโ โโโโโโโโโ
โ INT 0 โ โ INT 1 โ โ INT 2 โ โ FP 0 โ โ FP 1 โ
โ ALU โ โ ALU โ โ ALU โ โ FMA โ โ FMA โ
โ BRANCHโ โ BRANCHโ โ MUL โ โ ADD โ โ ADD โ
โโโโโโโโโ โโโโโโโโโ โโโโโโโโโ โโโโโโโโโ โโโโโโโโโ
โ โ โ โ โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Reorder Buffer โ
โ (preserves sequential semantics) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key insight: Multiple instructions execute simultaneously across different functional units. Your code should provide enough independent work to keep these units busy.
2.3 Instruction-Level Parallelism (ILP)
ILP is the amount of independent work available at the instruction level.
Data Dependencies Limit ILP:
// Sequential chain - NO parallelism
// Each instruction depends on the previous result
x = a + b; // cycle 1-4 (assume 4-cycle latency)
y = x + c; // cycle 5-8 (must wait for x)
z = y + d; // cycle 9-12 (must wait for y)
// Total: 12 cycles for 3 operations
// Independent operations - FULL parallelism
// No dependencies between operations
x = a + b; // cycles 1-4 โโฌโ all execute
y = c + d; // cycles 1-4 โโค simultaneously
z = e + f; // cycles 1-4 โโ
// Total: 4 cycles for 3 operations (3x speedup)
Dependency Types:
| Type | Also Called | Description | Can Be Eliminated? |
|---|---|---|---|
| RAW | True Dependency | Read After Write: a = 1; b = a; |
No (fundamental) |
| WAR | Anti-Dependency | Write After Read: a = b; b = 2; |
Yes (renaming) |
| WAW | Output Dependency | Write After Write: a = 1; a = 2; |
Yes (renaming) |
The processorโs register renaming eliminates WAR and WAW dependencies, but RAW dependencies are fundamental and limit ILP.
2.4 Latency vs Throughput
Critical distinction for understanding performance:
Latency: Time for a single operation to complete (cycles)
Throughput: Rate at which operations can be initiated (operations per cycle)
Example: Floating-point multiplication on modern Intel
Latency: 4 cycles (one multiply takes 4 cycles start to finish)
Throughput: 0.5 cycles (can start a new multiply every 0.5 cycles)
This means the multiplier is pipelined:
Cycle: 1 2 3 4 5 6 7 8
Mult 1: [S1] [S2] [S3] [S4] done
Mult 2: [S1] [S2] [S3] [S4] done
Mult 3: [S1] [S2] [S3] [S4] done
Mult 4: [S1] [S2] [S3] [S4] done
With 8 independent multiplies, throughput bound is:
8 operations / (8 * 0.5 cycles) = 8 / 4 = 2 ops/cycle
But with a dependency chain:
8 operations / (8 * 4 cycles) = 8 / 32 = 0.25 ops/cycle
That's 8x slower!
The Performance Equation:
Performance = min(Latency_bound, Throughput_bound)
Latency_bound: Time = n * Latency
(serial chain of n operations)
Throughput_bound: Time = n * Throughput
(n independent operations)
Goal: Make your code throughput-bound, not latency-bound!
2.5 CPE: Cycles Per Element
The CPE metric from CS:APP measures efficiency for array/loop computations:
CPE = (Total cycles to process array) / (Number of elements)
Example: Sum an array of 1000 elements
Version A: 4000 cycles -> CPE = 4.0
Version B: 1500 cycles -> CPE = 1.5
Version C: 1000 cycles -> CPE = 1.0
Lower CPE = faster code
Theoretical CPE Limits:
For a loop that performs one floating-point add per element:
for (int i = 0; i < n; i++) {
sum += data[i]; // one FP add per iteration
}
| Bottleneck | Bound | Reason |
|---|---|---|
| Latency (3-cycle add, dependency chain) | CPE >= 3.0 | Each add waits for previous |
| Throughput (1 add/cycle possible) | CPE >= 1.0 | Adder can do 1 per cycle |
| Memory (assuming cached) | CPE >= 0.5 | 2 loads/cycle possible |
If your measured CPE is 3.0, youโre latency-bound. Optimization means breaking the dependency chain.
2.6 Loop Unrolling
Loop unrolling reduces loop overhead and exposes more work per iteration:
// Original loop
for (int i = 0; i < n; i++) {
sum += data[i];
}
// 2x unrolled
for (int i = 0; i < n; i += 2) {
sum += data[i];
sum += data[i+1];
}
// 4x unrolled
for (int i = 0; i < n; i += 4) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
}
Benefits:
- Fewer loop counter increments and comparisons
- More work visible to the out-of-order engine
- Better instruction cache utilization (fewer branches)
But still latency-bound! All adds still chain through sum. Unrolling alone may not help.
2.7 Reassociation and Multiple Accumulators
The key to breaking latency bounds is parallelizing the accumulation:
// Original: all adds chain through one accumulator
// CPE ~= latency of addition (3-4 cycles)
double sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i]; // sum = sum + data[i]
}
// 2x parallelism: two independent accumulator chains
// CPE ~= latency / 2
double sum0 = 0, sum1 = 0;
for (int i = 0; i < n; i += 2) {
sum0 += data[i]; // chain A
sum1 += data[i+1]; // chain B (independent!)
}
double sum = sum0 + sum1;
// 4x parallelism: four independent chains
// CPE ~= latency / 4
double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < n; i += 4) {
sum0 += data[i];
sum1 += data[i+1];
sum2 += data[i+2];
sum3 += data[i+3];
}
double sum = (sum0 + sum1) + (sum2 + sum3);
Visualization:
Original (serial):
data[0] โโaddโโโถ sum โโaddโโโถ sum โโaddโโโถ sum โโaddโโโถ ...
โ data[1] โ data[2] โ data[3]
wait wait wait
Cycle: 0โโโ3โโโ6โโโ9โโโ12โโโ... (CPE = 3.0)
4x Parallel:
data[0] โโaddโโโถ sum0
data[1] โโaddโโโถ sum1 All four chains run
data[2] โโaddโโโถ sum2 simultaneously!
data[3] โโaddโโโถ sum3
โ
data[4] โโaddโโโถ sum0 Next iteration starts
... after 3 cycles, not 12
Cycle: 0โโโ3โโโ6โโโ9โโโ... (CPE = 0.75)
2.8 Reassociation Transformations
Reassociation changes the order of operations to expose parallelism:
// Original: Left-to-right evaluation
// acc = ((acc * x[i]) * x[i+1]) * x[i+2]
// Each multiply depends on the previous
for (int i = 0; i < n; i += 2) {
acc = (acc * data[i]) * data[i+1]; // Serial!
}
// Reassociated: Pair independent operations
// acc = acc * (x[i] * x[i+1])
// The (x[i] * x[i+1]) can proceed in parallel with the chain
for (int i = 0; i < n; i += 2) {
acc = acc * (data[i] * data[i+1]); // Partial parallel!
}
Warning: Reassociation changes results for floating-point due to rounding!
Floating-point is NOT associative:
a = 1.0e20
b = -1.0e20
c = 1.0
(a + b) + c = 0 + 1 = 1.0
a + (b + c) = 1.0e20 + (-1.0e20 + 1.0) = 1.0e20 + (-1.0e20) = 0.0
Different results! Compiler won't reassociate unless you explicitly allow it.
Use -ffast-math to permit reassociation, or do it manually (accepting the precision trade-off).
2.9 Branch Prediction Mechanics
Modern CPUs predict branch outcomes to avoid pipeline stalls:
Pipeline with branch at cycle 3:
Cycle: 1 2 3 4 5 6 7 8
Fetch: [A] [B] [br] [?] [?] [?] [?] [?]
โ
โโโ Predict TAKEN: fetch from target
โโโ Predict NOT-TAKEN: fetch next instruction
If prediction is WRONG:
- Pipeline must be flushed
- Restart from correct address
- Penalty: 15-20 cycles on modern CPUs!
Branch Prediction Strategies:
| Strategy | How it Works | Accuracy |
|---|---|---|
| Static (taken) | Always predict backward branches taken, forward not-taken | ~60-70% |
| One-bit | Remember last outcome | ~80% |
| Two-bit (saturating counter) | Change prediction after 2 wrong | ~85-90% |
| Correlating | Consider history of other branches | ~90-95% |
| Neural | Pattern recognition on branch history | ~95%+ |
Branch-Unfriendly Patterns:
// Random condition - terrible for prediction
if (rand() % 2) { ... } // 50% misprediction rate!
// Data-dependent unpredictable branch
for (int i = 0; i < n; i++) {
if (data[i] > threshold) { // depends on data pattern
result += data[i];
}
}
Branch-Friendly Patterns:
// Predictable loop
for (int i = 0; i < 1000; i++) { ... }
// Predict: i < 1000 is TRUE for 999 iterations
// Sorted data makes branches predictable
sort(data);
for (int i = 0; i < n; i++) {
if (data[i] > threshold) { // After some point, always true
result += data[i];
}
}
2.10 Avoiding Branches: Conditional Moves
Replace branches with branchless computation:
// Branch version - misprediction possible
if (a > b) {
max = a;
} else {
max = b;
}
// Branchless version using conditional move
max = (a > b) ? a : b; // May compile to CMOV instruction
// Manual branchless (works always)
int mask = -(a > b); // All 1s if true, all 0s if false
max = (a & mask) | (b & ~mask);
When to use branchless:
- Unpredictable conditions (near 50/50 split)
- Inner loops where branch overhead matters
- When the computation is simple (conditional move overhead is ~2 cycles)
When branches are better:
- Predictable patterns (strongly biased one way)
- Complex alternatives (one path is expensive)
- The work saved by skipping outweighs misprediction cost
2.11 Memory Hierarchy Impact
Performance depends heavily on memory access patterns (Chapter 6):
Memory Hierarchy Latencies (approximate, Intel Skylake):
Level Size Latency (cycles) Bandwidth
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
L1 Cache 32 KB 4 cycles ~200 GB/s
L2 Cache 256 KB 12 cycles ~80 GB/s
L3 Cache 8+ MB 40 cycles ~40 GB/s
Main Memory GBs 200+ cycles ~20 GB/s
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
A cache miss costs 50-200x more than a hit!
Locality Principles:
// Temporal locality: recently accessed items accessed again
for (int iter = 0; iter < 100; iter++) {
for (int i = 0; i < N; i++) {
sum += data[i]; // Same data[i] accessed 100 times
} // Stays in cache after first pass
}
// Spatial locality: access nearby memory locations
for (int i = 0; i < N; i++) {
sum += data[i]; // data[i] and data[i+1] are adjacent
} // Cache line prefetched helps
// Poor locality (stride pattern)
for (int i = 0; i < N; i += 64) {
sum += data[i]; // Skip cache lines, waste prefetch
}
// Poor locality (random access)
for (int i = 0; i < N; i++) {
sum += data[indices[i]]; // Random access - cache useless
}
Performance Impact:
Sequential access of 1 million doubles:
L1 hit: 1M * 4 cycles = 4M cycles
L3 hit: 1M * 40 cycles = 40M cycles (10x slower)
DRAM: 1M * 200 cycles = 200M cycles (50x slower)
Locality is CRITICAL for performance!
2.12 Profiling Methodologies
Before optimizing, measure:
1. Time Measurement:
#include <time.h>
// Wall-clock time
clock_t start = clock();
// ... code to measure ...
clock_t end = clock();
double seconds = (double)(end - start) / CLOCKS_PER_SEC;
// More precise: POSIX high-resolution timer
#include <time.h>
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
// ... code to measure ...
clock_gettime(CLOCK_MONOTONIC, &end);
double ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec);
2. Cycle Counting (x86):
#include <x86intrin.h>
unsigned long long start = __rdtsc();
// ... code to measure ...
unsigned long long end = __rdtsc();
unsigned long long cycles = end - start;
3. Hardware Performance Counters (Linux perf):
# Overall stats
perf stat ./my_program
# Detailed cache behavior
perf stat -e cache-misses,cache-references ./my_program
# Branch prediction
perf stat -e branches,branch-misses ./my_program
# Profile and identify hotspots
perf record ./my_program
perf report
2.13 Microbenchmarking Pitfalls
Getting stable measurements is harder than it seems:
Pitfall 1: Dead Code Elimination
// BAD: Compiler may optimize away!
double result = 0;
for (int i = 0; i < N; i++) {
result += expensive_computation(data[i]);
}
// result is never used - compiler removes the loop!
// GOOD: Prevent elimination
volatile double result = 0;
// or
printf("Result: %f\n", result); // Force result to be computed
Pitfall 2: CPU Frequency Scaling
// CPU starts at low frequency, ramps up during benchmark
// First measurements are slower!
// SOLUTION 1: Warmup iterations
for (int i = 0; i < 1000; i++) {
test_function(data); // Warmup - don't measure
}
// Now measure
start = rdtsc();
for (int i = 0; i < TRIALS; i++) {
test_function(data);
}
end = rdtsc();
// SOLUTION 2: Disable frequency scaling
// sudo cpupower frequency-set --governor performance
Pitfall 3: Measurement Overhead
// BAD: Timing overhead dominates for fast operations
for (int i = 0; i < N; i++) {
start = rdtsc();
x = x + 1; // 1 cycle
end = rdtsc(); // ~50 cycles overhead!
total += (end - start);
}
// GOOD: Batch many operations
start = rdtsc();
for (int i = 0; i < 1000000; i++) {
x = x + 1;
}
end = rdtsc();
cycles_per_op = (end - start) / 1000000.0;
Pitfall 4: Cache State Variability
// First access: cold cache (slow)
// Second access: warm cache (fast)
// SOLUTION: Consistent cache state
// Option A: Flush cache before each trial (cold)
// Option B: Warmup loop before measuring (warm)
// Option C: Large enough data to exceed cache (always miss)
// Document which state you're measuring!
Pitfall 5: Variance and Outliers
// Run multiple trials, report minimum or median
double times[TRIALS];
for (int t = 0; t < TRIALS; t++) {
start = now();
test_function(data);
times[t] = now() - start;
}
// Use minimum (represents "best case" without interference)
// Or median (more robust to outliers)
double min_time = minimum(times, TRIALS);
double median_time = median(times, TRIALS);
// Report both and explain methodology!
2.14 Compiler Optimizations Overview
Know what the compiler can do so you donโt duplicate its work:
Common Compiler Optimizations:
| Optimization | What It Does | Flag |
|---|---|---|
| Inlining | Replace function call with body | -O1+ |
| Loop unrolling | Replicate loop body | -O2+ or -funroll-loops |
| Constant folding | Compute constants at compile time | -O1+ |
| Dead code elimination | Remove unused code | -O1+ |
| Common subexpression | Reuse computed values | -O1+ |
| Strength reduction | Replace expensive ops (a*2 -> aยซ1) | -O1+ |
| Vectorization | Use SIMD instructions | -O3 or -ftree-vectorize |
| Instruction scheduling | Reorder for pipeline efficiency | -O2+ |
Compiler Explorer Strategy:
- Write your original code
- Compile with
-O0and-O3 - Examine assembly (godbolt.org is excellent)
- See what the compiler already does
- Only manually optimize what the compiler misses
Critical Compiler Flags:
# Standard optimization levels
gcc -O0 ... # No optimization (debugging)
gcc -O1 ... # Basic optimizations
gcc -O2 ... # Standard optimizations (recommended for most cases)
gcc -O3 ... # Aggressive optimizations (may increase code size)
gcc -Os ... # Optimize for size
# Architecture-specific
gcc -march=native ... # Use all instructions available on this CPU
gcc -mtune=native ... # Tune for this CPU (without requiring features)
# Floating-point
gcc -ffast-math ... # Allows reassociation, breaks IEEE compliance
gcc -ffp-contract=fast # Allow FMA contraction
# Vectorization
gcc -ftree-vectorize ... # Enable auto-vectorization
gcc -fopt-info-vec ... # Report what was vectorized
3. Project Specification
3.1 What You Will Build
A benchmark suite consisting of:
- Small computational kernels: Array operations that stress different aspects of the CPU
- Measurement infrastructure: Reliable timing with statistical analysis
- Optimization laboratory: Before/after versions of each kernel
- Written report: Explanation of each optimization in architectural terms
3.2 Core Benchmark Kernels
Implement and optimize these kernels (minimum 4):
Kernel 1: Sum Reduction
double sum_naive(double *data, size_t n) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
sum += data[i];
}
return sum;
}
Target: Transform from latency-bound to throughput-bound.
Kernel 2: Polynomial Evaluation (Hornerโs Rule)
double poly_naive(double *coeffs, size_t degree, double x) {
double result = coeffs[degree];
for (size_t i = degree; i > 0; i--) {
result = result * x + coeffs[i-1];
}
return result;
}
Target: Explore dependency chain limitations.
Kernel 3: Conditional Accumulation
double conditional_sum_naive(double *data, size_t n, double threshold) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
if (data[i] > threshold) {
sum += data[i];
}
}
return sum;
}
Target: Explore branch prediction effects, try branchless.
Kernel 4: Matrix-Vector Multiply
void matvec_naive(double *A, double *x, double *y, size_t n) {
for (size_t i = 0; i < n; i++) {
y[i] = 0;
for (size_t j = 0; j < n; j++) {
y[i] += A[i*n + j] * x[j];
}
}
}
Target: Explore memory access patterns and locality.
Kernel 5: Inner Product
double inner_product_naive(double *a, double *b, size_t n) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
sum += a[i] * b[i];
}
return sum;
}
Target: Two data streams, multiply-add dependency chain.
3.3 Functional Requirements
- Measurement System:
- Multiple trials with statistical reporting
- CPE calculation for each kernel
- Cycle-accurate timing (rdtsc or similar)
- Warmup iterations before measurement
- Kernel Variants:
- At least 3 optimization levels per kernel
- Each variant must produce correct results
- Document the transformation applied
- Report Requirements:
- CPE table for all variants
- Speedup calculation relative to naive
- Explanation of why each optimization works (or doesnโt)
- Discussion of architectural bottleneck shifted
3.4 Example Output
$ ./perf-clinic --kernel sum --trials 100
=== PERFORMANCE CLINIC: Sum Reduction ===
Array size: 1,000,000 doubles
Trials: 100 (reporting median)
Variant CPE Speedup Bottleneck
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
naive 3.02 1.00x Latency (FP add chain)
unroll_2x 3.01 1.00x Still latency-bound
unroll_4x 2.98 1.01x Minimal improvement
parallel_2acc 1.52 1.99x Latency / 2
parallel_4acc 0.78 3.87x Approaching throughput
parallel_8acc 0.52 5.81x Near throughput bound
vectorized 0.13 23.23x SIMD (4 ops/instr)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Analysis:
- Naive version is limited by 3-cycle FP add latency
- Unrolling alone doesn't help (still one dependency chain)
- Multiple accumulators break the dependency chain
- 8 accumulators approach the throughput bound (~0.5 CPE)
- SIMD achieves 4x further by processing 4 elements per instruction
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ perf-clinic โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Timing โ โ Kernel โ โ Analysis โ โ
โ โ Harness โ โ Library โ โ Engine โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Benchmark Runner โ โ
โ โ - Load data โ โ
โ โ - Run warmup โ โ
โ โ - Execute trials โ โ
โ โ - Collect statistics โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Report Generator โ โ
โ โ - CPE calculations โ โ
โ โ - Speedup ratios โ โ
โ โ - Statistical summary โ โ
โ โ - Bottleneck identification โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ [Console / File Output] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.2 Key Data Structures
// Timing result for one trial
typedef struct {
uint64_t cycles;
double seconds;
size_t elements;
} trial_result_t;
// Statistics across all trials
typedef struct {
double min_cycles;
double median_cycles;
double mean_cycles;
double stddev_cycles;
double cpe; // cycles per element
size_t num_trials;
} benchmark_stats_t;
// Kernel variant
typedef struct {
const char *name;
const char *description;
void (*setup)(void *ctx); // Optional setup
void (*kernel)(void *ctx); // The actual work
void (*verify)(void *ctx); // Correctness check
void (*teardown)(void *ctx); // Cleanup
} kernel_variant_t;
// Benchmark configuration
typedef struct {
const char *kernel_name;
kernel_variant_t *variants;
size_t num_variants;
size_t element_count;
size_t num_trials;
size_t warmup_iters;
bool verify_results;
} benchmark_config_t;
// Result for one kernel variant
typedef struct {
const char *variant_name;
benchmark_stats_t stats;
bool correct;
double speedup; // relative to baseline
} variant_result_t;
4.3 Module Structure
perf-clinic/
โโโ src/
โ โโโ main.c # CLI and orchestration
โ โโโ timing.c # Cycle counting and timing
โ โโโ timing.h
โ โโโ stats.c # Statistical analysis
โ โโโ stats.h
โ โโโ runner.c # Benchmark execution
โ โโโ runner.h
โ โโโ report.c # Output formatting
โ โโโ report.h
โ โโโ kernels/
โ โโโ sum_reduce.c # Sum variants
โ โโโ poly_eval.c # Polynomial variants
โ โโโ conditional.c # Conditional sum variants
โ โโโ matvec.c # Matrix-vector variants
โ โโโ inner_product.c # Inner product variants
โโโ include/
โ โโโ kernels.h # Kernel declarations
โ โโโ common.h # Shared definitions
โโโ tests/
โ โโโ test_correctness.c # Verify all variants match
โ โโโ test_timing.c # Validate timing infrastructure
โโโ data/
โ โโโ generate_data.c # Test data generation
โโโ Makefile
โโโ README.md
5. Implementation Guide
Phase 1: Measurement Infrastructure (Days 1-3)
Goals:
- Reliable cycle-accurate timing
- Statistical analysis of multiple trials
- Avoid measurement pitfalls
Tasks:
- Implement cycle counter access:
// timing.h
#include <stdint.h>
static inline uint64_t rdtsc(void) {
uint32_t lo, hi;
__asm__ __volatile__ (
"rdtsc" : "=a"(lo), "=d"(hi)
);
return ((uint64_t)hi << 32) | lo;
}
// More precise: use rdtscp (serializing)
static inline uint64_t rdtscp(void) {
uint32_t lo, hi, aux;
__asm__ __volatile__ (
"rdtscp" : "=a"(lo), "=d"(hi), "=c"(aux)
);
return ((uint64_t)hi << 32) | lo;
}
- Implement statistical functions:
// stats.c
double compute_median(double *values, size_t n);
double compute_mean(double *values, size_t n);
double compute_stddev(double *values, size_t n);
double compute_min(double *values, size_t n);
- Implement benchmark harness:
// runner.c
benchmark_stats_t run_benchmark(
void (*kernel)(void *ctx),
void *ctx,
size_t elements,
size_t trials,
size_t warmup
);
Checkpoint: Can measure a simple loop and get consistent cycle counts across runs.
Phase 2: Baseline Kernels (Days 4-5)
Goals:
- Implement naive versions of all kernels
- Verify correctness
- Establish baseline CPE
Tasks:
- Implement each kernelโs naive version
- Create verification functions (compare against known results)
- Generate test data with different patterns:
- Sequential (0, 1, 2, โฆ)
- Random uniform
- Random with patterns (sorted, partially sorted)
- Measure baseline CPE:
void measure_baselines(void) {
double data[N];
fill_random(data, N);
benchmark_stats_t sum_stats = run_benchmark(
sum_naive, data, N, 100, 10
);
printf("sum_naive: CPE = %.2f\n", sum_stats.cpe);
}
Checkpoint: All naive kernels produce correct results. Baseline CPE recorded.
Phase 3: Loop Optimizations (Days 6-8)
Goals:
- Implement unrolled variants
- Implement parallel accumulator variants
- Understand dependency effects
Tasks:
- Implement 2x, 4x, 8x unrolling:
double sum_unroll4(double *data, size_t n) {
double sum = 0;
size_t i;
for (i = 0; i < n - 3; i += 4) {
sum += data[i];
sum += data[i+1];
sum += data[i+2];
sum += data[i+3];
}
for (; i < n; i++) {
sum += data[i];
}
return sum;
}
- Implement parallel accumulator variants:
double sum_parallel4(double *data, size_t n) {
double sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
size_t i;
for (i = 0; i < n - 3; i += 4) {
sum0 += data[i];
sum1 += data[i+1];
sum2 += data[i+2];
sum3 += data[i+3];
}
for (; i < n; i++) {
sum0 += data[i];
}
return (sum0 + sum1) + (sum2 + sum3);
}
- Measure and compare CPE for all variants
Checkpoint: Parallel accumulator shows significant speedup. Can explain why unrolling alone doesnโt help.
Phase 4: Branch Optimizations (Days 9-10)
Goals:
- Measure branch misprediction effects
- Implement branchless alternatives
- Compare with different data patterns
Tasks:
- Implement branchless conditional:
double conditional_sum_branchless(double *data, size_t n, double threshold) {
double sum = 0;
for (size_t i = 0; i < n; i++) {
// mask is 0 or all-1s based on condition
double mask = (data[i] > threshold) ? 1.0 : 0.0;
sum += data[i] * mask;
}
return sum;
}
- Test with different data distributions:
- Random (50% pass threshold) - unpredictable
- Sorted (all pass or all fail) - predictable
- Biased (90% one way) - mostly predictable
- Measure with
perf stat -e branches,branch-misses:
# High misprediction rate expected for random data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data random
# Low misprediction rate for sorted data
perf stat -e branches,branch-misses ./perf-clinic --kernel conditional --data sorted
Checkpoint: Can demonstrate branch misprediction costs. Branchless version wins for random data.
Phase 5: Memory Optimizations (Days 11-12)
Goals:
- Explore locality effects
- Measure cache miss impact
- Optimize memory access patterns
Tasks:
- Implement matrix-vector with different access patterns:
// Row-major (good cache behavior for row-major storage)
void matvec_rowmajor(double *A, double *x, double *y, size_t n) {
for (size_t i = 0; i < n; i++) {
double sum = 0;
for (size_t j = 0; j < n; j++) {
sum += A[i*n + j] * x[j]; // A accessed sequentially
}
y[i] = sum;
}
}
// Column-major (bad cache behavior for row-major storage)
void matvec_colmajor(double *A, double *x, double *y, size_t n) {
for (size_t i = 0; i < n; i++) y[i] = 0;
for (size_t j = 0; j < n; j++) {
for (size_t i = 0; i < n; i++) {
y[i] += A[i*n + j] * x[j]; // A accessed with stride n
}
}
}
- Measure cache misses:
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant rowmajor
perf stat -e cache-references,cache-misses ./perf-clinic --kernel matvec --variant colmajor
- Test with matrices of different sizes (L1, L2, L3, larger than cache)
Checkpoint: Row-major significantly faster. Cache miss rate correlates with slowdown.
Phase 6: Report and Analysis (Days 13-14)
Goals:
- Document all results
- Explain each optimization architecturally
- Identify when optimizations backfire
Tasks:
- Create comprehensive CPE table
- Write analysis for each kernel:
- What was the bottleneck?
- How did each transformation address it?
- What is the limiting factor now?
- Document cases where optimizations didnโt help:
- Unrolling without multiple accumulators
- Branchless for predictable data
- Over-unrolling that causes register pressure
- Produce final report document
Checkpoint: Portfolio-quality report with before/after results and strong architectural narrative.
6. Testing Strategy
6.1 Correctness Testing
Every optimized variant must produce the same result as the naive version:
void test_sum_correctness(void) {
double data[N];
fill_random(data, N);
double expected = sum_naive(data, N);
assert_close(sum_unroll4(data, N), expected, 1e-10);
assert_close(sum_parallel4(data, N), expected, 1e-10);
assert_close(sum_parallel8(data, N), expected, 1e-10);
// Note: floating-point may differ slightly due to reassociation
}
void assert_close(double actual, double expected, double tolerance) {
double diff = fabs(actual - expected);
double rel = diff / (fabs(expected) + 1e-10);
if (rel > tolerance) {
fprintf(stderr, "FAIL: expected %f, got %f (rel diff: %e)\n",
expected, actual, rel);
exit(1);
}
}
6.2 Measurement Validation
Verify timing infrastructure is reliable:
void test_timing_stability(void) {
// Measure a known-duration operation
double times[100];
for (int i = 0; i < 100; i++) {
uint64_t start = rdtsc();
busy_loop(10000); // Calibrated delay
times[i] = rdtsc() - start;
}
double stddev = compute_stddev(times, 100);
double mean = compute_mean(times, 100);
double cv = stddev / mean; // Coefficient of variation
// CV should be < 5% for stable measurements
assert(cv < 0.05);
}
6.3 Compiler Interference Check
Ensure the compiler isnโt removing your benchmark code:
void test_not_optimized_away(void) {
volatile double result;
uint64_t start = rdtsc();
result = sum_naive(data, N);
uint64_t cycles = rdtsc() - start;
// Should take at least N cycles (one memory access per element)
assert(cycles > N / 2);
// Use result to prevent elimination
assert(result != 12345.6789); // Some value it won't equal
}
6.4 Edge Case Testing
void test_edge_cases(void) {
double data[8] = {1, 2, 3, 4, 5, 6, 7, 8};
// Test with n not divisible by unroll factor
assert_close(sum_parallel4(data, 1), 1.0, 1e-10);
assert_close(sum_parallel4(data, 2), 3.0, 1e-10);
assert_close(sum_parallel4(data, 5), 15.0, 1e-10);
assert_close(sum_parallel4(data, 7), 28.0, 1e-10);
// Test with empty array
assert_close(sum_parallel4(data, 0), 0.0, 1e-10);
// Test with special values
data[0] = INFINITY;
assert(isinf(sum_parallel4(data, 8)));
data[0] = NAN;
assert(isnan(sum_parallel4(data, 8)));
}
7. Common Pitfalls and Debugging
7.1 Measurement Errors
| Pitfall | Symptom | Solution |
|---|---|---|
| Dead code elimination | CPE near 0 | Use volatile or print result |
| Frequency scaling | First runs slower | Warmup iterations, disable scaling |
| Turbo boost inconsistency | High variance | Disable turbo or use more trials |
| Timer overflow | Negative times | Use 64-bit counters |
| Memory prefetching | Arrays too small | Use arrays > L3 cache |
7.2 CPU Frequency Scaling
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set to performance (requires root)
sudo cpupower frequency-set --governor performance
# Or for all CPUs:
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done
# Disable turbo boost (Intel)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
7.3 Compiler Optimization Interference
# Check what the compiler is doing
gcc -O3 -S -fverbose-asm kernel.c -o kernel.s
# Use Compiler Explorer (https://godbolt.org)
# to visualize assembly
# Prevent specific optimizations
__attribute__((optimize("no-unroll-loops")))
double sum_my_unroll(double *data, size_t n) { ... }
7.4 Memory Alignment Issues
// Ensure proper alignment for SIMD
double *data = aligned_alloc(32, n * sizeof(double));
// Or use compiler attributes
double data[N] __attribute__((aligned(32)));
// Check alignment at runtime
assert(((uintptr_t)data % 32) == 0);
7.5 False Sharing in Parallel Code
// BAD: Accumulators may share cache line
double sums[4]; // Could be on same 64-byte line
// GOOD: Pad to separate cache lines
struct {
double value;
char padding[56]; // 64 - 8 = 56 bytes padding
} sums[4];
8. Extensions
8.1 Beginner Extensions
- Add CSV output: Export results for spreadsheet analysis
- Implement warm/cold cache modes: Measure both scenarios
- Add progress indicator: Show which benchmark is running
- Create comparison plots: Visualize CPE across variants
8.2 Intermediate Extensions
- SIMD implementation: Use intrinsics for vectorized kernels
- Profile-guided optimization: Use hardware counters to guide changes
- Multi-threaded variants: Parallelize across cores (be careful with reduction!)
- Different data types: Compare float vs double vs int performance
8.3 Advanced Extensions
- Auto-tuning framework: Automatically search for best unroll factor
- Roofline model: Plot achieved performance vs hardware limits
- Assembly kernels: Hand-write inner loops in assembly
- Cross-platform support: ARM NEON, AMD specific counters
9. Real-World Connections
9.1 Industry Applications
High-Frequency Trading: Every nanosecond matters. The techniques in this project are exactly what HFT firms use to optimize their systems.
Game Engines: Physics simulations, graphics rendering, and audio processing all depend on tight inner loops.
Machine Learning: Matrix operations in neural networks benefit from the same optimization strategies.
Scientific Computing: Weather modeling, molecular dynamics, and other simulations spend most time in computational kernels.
9.2 Related Tools and Libraries
- perf: Linux performance analysis tool
- Intel VTune: Commercial profiler with deep microarchitecture insight
- PAPI: Portable interface to hardware performance counters
- likwid: Lightweight performance tools for Linux
- BLAS/LAPACK: Highly optimized linear algebra libraries
9.3 Interview Relevance
This project prepares you for questions like:
- โHow would you optimize this code?โ (Can demonstrate systematic approach)
- โWhat limits the performance of this loop?โ (Can identify bottleneck)
- โWhy is this version faster?โ (Can explain in terms of architecture)
- โHow would you measure performance?โ (Know pitfalls and methodology)
- โExplain Amdahlโs Lawโ (Can calculate and apply limits)
10. Resources
10.1 Essential Reading
- CS:APP Chapter 5: โOptimizing Program Performanceโ - The foundation for this project
- CS:APP Chapter 6: โThe Memory Hierarchyโ - Understanding cache effects
- CS:APP Chapter 1.9: โAmdahlโs Lawโ - Optimization limits
- Agner Fogโs Optimization Manuals: agner.org/optimize - CPU architecture details
10.2 Tools
- perf:
man perf- Linux performance analysis - valgrind โtool=cachegrind: Cache simulation
- Intel Intrinsics Guide: intrinsics
- Compiler Explorer: godbolt.org - See generated assembly
10.3 Video Resources
- MIT 6.172: Performance Engineering of Software Systems (MIT OpenCourseWare)
- What Every Programmer Should Know About Memory: Ulrich Drepper (LWN.net)
- CppCon talks: Search for โoptimizationโ, โperformanceโ
10.4 Related Projects in This Series
- Previous: P1 (Toolchain Explorer) - understanding compilation
- Next: P9 (Cache Lab++) - deep dive into cache behavior
- Related: P7 (Y86-64 Simulator) - CPU pipeline internals
11. Self-Assessment Checklist
Understanding
- I can explain Amdahlโs Law and calculate speedup limits
- I understand the difference between latency and throughput
- I can identify when code is latency-bound vs throughput-bound
- I can explain why multiple accumulators help performance
- I understand how branch misprediction affects performance
- I can reason about memory access patterns and cache behavior
Implementation
- My timing infrastructure produces stable, repeatable results
- All optimized variants produce correct results
- I have measured CPE for at least 4 kernels
- Each kernel has at least 3 variants with different optimizations
- I can demonstrate cases where optimizations backfire
Analysis
- My report explains each optimization in architectural terms
- I can predict when an optimization will help or hurt
- I have identified the bottleneck for each kernel variant
- I understand why unrolling alone doesnโt help sum reduction
- I can explain the performance difference with different data patterns
Growth
- I used hardware performance counters (perf) to validate understanding
- I examined compiler-generated assembly to understand transformations
- I can now read a loop and estimate its CPE
- I trust measurements over intuition
12. Submission / Completion Criteria
Minimum Viable Completion:
- Measurement infrastructure works reliably
- At least 4 kernels with naive and 2+ optimized variants each
- Correctness tests pass
- Basic CPE table with results
Full Completion:
- All 5 kernels with 4+ variants each
- Statistical analysis (mean, median, stddev)
- Branch and cache measurements using perf
- Written report explaining each optimization
- Cases documented where optimization didnโt help
Excellence (Going Above & Beyond):
- SIMD implementations using intrinsics
- Roofline model analysis
- Auto-tuning for optimal parameters
- Cross-architecture comparison (Intel vs AMD vs ARM)
- Assembly-level kernels for maximum performance
13. Real World Outcome
When you complete this project, here is exactly what you will see when running your performance clinic:
$ cd perf-clinic
$ make
gcc -O2 -march=native -Wall -g -c src/main.c -o build/main.o
gcc -O2 -march=native -Wall -g -c src/timing.c -o build/timing.o
gcc -O2 -march=native -Wall -g -c src/kernels/sum_reduce.c -o build/sum_reduce.o
...
gcc -o perf-clinic build/*.o -lm
$ ./perf-clinic --kernel sum --trials 100 --size 1000000
=== PERFORMANCE CLINIC: Sum Reduction ===
Array size: 1,000,000 doubles (8 MB)
Trials: 100 (reporting minimum, median, mean)
CPU: Intel Core i7-10700 @ 3.80GHz
Variant Min CPE Med CPE Mean CPE Speedup Bottleneck
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
naive 3.01 3.04 3.05 1.00x Latency (FP add)
unroll_2x 3.00 3.02 3.03 1.01x Still latency-bound
unroll_4x 2.98 3.01 3.02 1.01x Unroll doesn't help alone
parallel_2acc 1.51 1.53 1.54 1.99x Latency / 2
parallel_4acc 0.76 0.78 0.79 3.90x Near throughput
parallel_8acc 0.51 0.53 0.54 5.70x Approaching limit
parallel_16acc 0.50 0.52 0.53 5.83x Diminishing returns
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Theoretical Bounds:
Latency bound (3-cycle add): CPE >= 3.0
Throughput bound (1 add/cyc): CPE >= 1.0
Memory bound (L1 cached): CPE >= 0.5
Analysis:
naive: Limited by 3-cycle FP add latency (dependency chain)
unroll_*: Unrolling alone cannot break dependency chain
parallel_*: Multiple accumulators create independent chains
Best practical: 8 accumulators achieves ~6x speedup
$ ./perf-clinic --kernel conditional --trials 100 --data random
=== PERFORMANCE CLINIC: Conditional Sum ===
Testing branch prediction effects with random data (50% threshold)
Variant CPE Mispredicts/1K Speedup
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
branch_naive 8.42 498.2 1.00x
branch_sorted 2.15 1.3 3.92x
branchless 3.12 0.0 2.70x
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Analysis:
Random data: ~50% misprediction rate (498/1000 branches)
Sorted data: Highly predictable, near-perfect prediction
Branchless: No branches to mispredict, constant time
$ perf stat -e cycles,instructions,cache-misses ./perf-clinic --kernel matvec --size 1024
=== PERFORMANCE CLINIC: Matrix-Vector Multiply ===
Matrix: 1024x1024 doubles (8 MB)
Performance counter stats for './perf-clinic':
1,234,567,890 cycles
987,654,321 instructions # 0.80 insn per cycle
12,345 cache-misses
Variant CPE L1 Miss Rate Speedup
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
row_major 1.23 0.1% 1.00x
col_major 8.45 12.4% 0.15x
blocked_32 0.95 0.2% 1.29x
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Analysis:
col_major: Strided access pattern causes cache thrashing
row_major: Sequential access, good spatial locality
blocked: Even better cache utilization through blocking
$ cat reports/optimization_report.md
# Performance Optimization Report
## Executive Summary
Achieved 5.7x speedup on sum reduction through loop parallelism.
Branch optimization yielded 2.7-3.9x speedup depending on data.
Memory access pattern optimization: 6.5x improvement on matrix ops.
## Key Findings
1. Loop unrolling WITHOUT multiple accumulators provides minimal benefit
2. Dependency chains are the primary bottleneck for reduction operations
3. Branch misprediction costs ~15 cycles on this architecture
4. Cache miss penalty is ~100 cycles for L3, ~200 for DRAM
...
14. The Core Question Youโre Answering
โWhat prevents code from running at hardware speed, and how do I systematically identify and remove performance bottlenecks - while proving that my optimizations actually work?โ
This question is at the heart of performance engineering. Youโre learning:
- How to measure performance reliably (avoiding measurement pitfalls)
- How to identify whether code is latency-bound, throughput-bound, or memory-bound
- How compiler optimizations and CPU features interact with your code
- The discipline of hypothesis-driven optimization: measure, hypothesize, transform, verify
15. Concepts You Must Understand First
Before starting this project, ensure you understand these concepts:
| Concept | Where to Learn | Why Itโs Needed |
|---|---|---|
| Amdahlโs Law | CS:APP 1.9.1 | Calculate maximum speedup before optimizing |
| Instruction Pipelining | CS:APP 4.4 | Understand latency vs throughput |
| Data Dependencies | CS:APP 5.5 | Identify what limits instruction-level parallelism |
| Loop Unrolling | CS:APP 5.8 | Know when and why unrolling helps |
| Multiple Accumulators | CS:APP 5.9 | Break dependency chains |
| Branch Prediction | CS:APP 5.12 | Understand misprediction costs |
| Cache Hierarchy | CS:APP 6.1-6.4 | Reason about memory access patterns |
| Condition Codes | CS:APP 3.6.1 | Understand what affects branch behavior |
16. Questions to Guide Your Design
As you develop your benchmark suite, ask yourself:
- Am I measuring what I think Iโm measuring? Is the compiler optimizing away my code? Is the timer granular enough?
- What is the theoretical bound? Given latency and throughput, whatโs the best possible CPE?
- Which bound am I hitting? Am I latency-bound (dependency chain) or throughput-bound (functional units)?
- Is my data realistic? Cache-cold vs cache-hot? Random vs sequential access?
- Can I explain the speedup? For each optimization, can I say exactly why it helped?
- What didnโt work? Understanding failures is as important as successes.
- Is this reproducible? Can someone else get the same results?
17. Thinking Exercise
Before you start coding, work through this analysis on paper:
Consider this reduction loop:
double sum_naive(double *data, int n) {
double sum = 0;
for (int i = 0; i < n; i++) {
sum += data[i];
}
return sum;
}
Given:
- FP add latency: 3 cycles
- FP add throughput: 1 per cycle
- Load throughput: 2 per cycle
- Array is cached in L1
Questions to answer:
- Draw the dependency graph for iterations i=0, 1, 2, 3. What depends on what?
- What is the CPE if weโre latency-bound?
- What is the CPE if weโre throughput-bound?
- This loop is latency-bound. Why? (What creates the dependency chain?)
- If we use 4 accumulators (sum0, sum1, sum2, sum3), what is the new CPE? Why?
- How many accumulators do we need to become throughput-bound?
Expected answers:
- Each
sum +=depends on the previous: sum_0 -> sum_1 -> sum_2 -> sum_3 (serial chain) - Latency-bound: CPE = 3.0 (one add every 3 cycles due to dependency)
- Throughput-bound: CPE = 1.0 (one add per cycle if no dependencies)
- Each addition waits for the previous sum value - this is a RAW dependency chain
- With 4 accumulators: 4 independent chains, CPE = 3/4 = 0.75 (parallel execution)
- To hit throughput bound of 1 add/cycle with 3-cycle latency: need 3 accumulators minimum
18. The Interview Questions Theyโll Ask
After completing this project, you should be able to answer these interview questions:
- โHow would you optimize a slow function?โ
- First: Profile to find where time is spent (donโt guess)
- Calculate Amdahlโs Law limit - is optimization worthwhile?
- Identify the bottleneck: latency? throughput? memory? branches?
- Apply targeted transformation: unroll+multiple accumulators for latency, cache blocking for memory
- Measure again to verify improvement
- โWhat is the difference between latency and throughput?โ
- Latency: Time for one operation to complete (start to finish)
- Throughput: Rate at which operations can be started
- Pipelined units have low throughput (can start frequently) but high latency (each takes time)
- Dependency chains are latency-bound; independent work is throughput-bound
- โExplain Amdahlโs Law and give an example.โ
- Speedup = 1 / ((1-p) + p/S) where p = parallelizable fraction, S = speedup of that fraction
- Example: 90% of time in function, optimize 10x: Speedup = 1/(0.1 + 0.9/10) = 1/0.19 = 5.26x
- Key insight: Serial portion limits maximum speedup (1/(1-p))
- โWhy does loop unrolling help performance?โ
- Reduces loop overhead (fewer increments, compares, branches)
- More work visible to out-of-order engine
- BUT: doesnโt break dependency chains alone
- Need multiple accumulators to truly parallelize work
- โHow does branch misprediction affect performance?โ
- CPU speculatively executes predicted path
- On mispredict: flush pipeline, restart from correct path
- Cost: 15-20 cycles on modern CPUs
- Random branches = ~50% mispredict rate = significant overhead
- โHow do you ensure benchmark results are valid?โ
- Use volatile or print results to prevent dead code elimination
- Warm up: run several iterations before measuring
- Multiple trials: report minimum, median, and standard deviation
- Consistent state: control cache (cold vs warm), CPU frequency (disable turbo)
- Verify correctness: optimized code must produce same results
19. Hints in Layers
If you get stuck, reveal hints one at a time. Try each level before moving to the next.
Hint 1 - Reliable Timing
Use rdtsc for cycle counting or clock_gettime(CLOCK_MONOTONIC) for wall time. Always run multiple trials and take the minimum (represents best case without interference).
Hint 2 - Preventing Dead Code Elimination
Declare your result as volatile, or print it at the end. Check the assembly (use -S or godbolt.org) to verify your loop wasnโt optimized away.
Hint 3 - Understanding the Bottleneck
If unrolling doesnโt help, youโre probably latency-bound. Count the operations in your dependency chain. Add accumulators equal to latency/throughput ratio.
Hint 4 - When Branch Optimization Matters
Only optimize branches if: (1) data is unpredictable, (2) branch is in hot loop, (3) both paths are cheap. Use perf stat -e branch-misses to measure.
20. Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| Performance Fundamentals | CS:APP 3e | Chapter 5 (entire chapter) |
| Amdahlโs Law | CS:APP 3e | Section 1.9.1 |
| Loop Unrolling & Parallelism | CS:APP 3e | Sections 5.8-5.9 |
| Branch Prediction | CS:APP 3e | Section 5.12 |
| Memory Hierarchy | CS:APP 3e | Chapter 6 |
| CPU Microarchitecture | Agner Fogโs Manuals | Microarchitecture guide |
| Performance Analysis | โSystems Performanceโ by Brendan Gregg | Chapters 6-7 |
| Low-Level Optimization | โWhat Every Programmer Should Know About Memoryโ | Sections 3-6 |
This guide was expanded from CSAPP_3E_DEEP_LEARNING_PROJECTS.md. For the complete learning path, see the project index.