← Back to all projects

LEARN PERFORMANCE ENGINEERING

Learn Performance Engineering: From Zero to Systems Wizard

Goal: Deeply understand software performance engineering—from bit-level optimizations to parallel algorithms, cache-efficient data structures, and building systems that squeeze every cycle out of modern hardware.

Based on: MIT 6.172 Performance Engineering of Software Systems (Fall 2018)


Why Performance Engineering Matters

In an era of “just throw more hardware at it,” true performance engineering is a lost art. Yet the difference between a naive implementation and an optimized one can be 100x or more—that’s the difference between a query taking 10 seconds and 100 milliseconds, between needing 100 servers and needing 1.

Modern CPUs are marvels of engineering: multiple cores, deep pipelines, branch predictors, multiple cache levels, SIMD units, and out-of-order execution. But to exploit this power, you need to understand:

  • How the machine actually works (not the abstraction)
  • Where time goes (measurement, not guessing)
  • What the compiler can and cannot do
  • How to structure data for cache efficiency
  • How to parallelize without destroying performance

After completing these projects, you will:

  • Write code that runs 10-100x faster than naive implementations
  • Understand assembly output and reason about instruction-level performance
  • Build parallel programs that scale with core count
  • Design cache-efficient algorithms and data structures
  • Profile and optimize real systems systematically
  • Understand the engineering tradeoffs in modern systems

Core Concept Analysis

The Performance Engineering Mindset

"Premature optimization is the root of all evil" - Knuth

But also:

"We should forget about small efficiencies, say about 97% of the time...
Yet we should not pass up our opportunities in that critical 3%." - Knuth (full quote)

The Memory Hierarchy

┌──────────────────────────────────────────────────────────────┐
│                        Registers                              │ ~1 cycle
│                         (few KB)                              │
├──────────────────────────────────────────────────────────────┤
│                         L1 Cache                              │ ~4 cycles
│                         (32-64 KB)                            │
├──────────────────────────────────────────────────────────────┤
│                         L2 Cache                              │ ~12 cycles
│                        (256 KB - 1 MB)                        │
├──────────────────────────────────────────────────────────────┤
│                         L3 Cache                              │ ~40 cycles
│                         (8-64 MB)                             │
├──────────────────────────────────────────────────────────────┤
│                        Main Memory                            │ ~200 cycles
│                         (GB - TB)                             │
├──────────────────────────────────────────────────────────────┤
│                           SSD                                 │ ~50,000 cycles
│                         (100s GB)                             │
├──────────────────────────────────────────────────────────────┤
│                           HDD                                 │ ~10,000,000 cycles
│                          (TBs)                                │
└──────────────────────────────────────────────────────────────┘

Fundamental Concepts

  1. Work and Span
    • Work: Total operations performed (sequential time)
    • Span: Longest dependency chain (parallel time with infinite processors)
    • Parallelism: Work / Span (how much parallel speedup is possible)
  2. Bentley’s Rules for Optimization
    • Data Structure Augmentation
    • Precomputation
    • Compile-Time Initialization
    • Caching
    • Lazy Evaluation
    • Loop Transformations (unrolling, fusion, fission)
    • Logic Optimizations
  3. The Roofline Model
    • Performance bound by either compute or memory bandwidth
    • Operational Intensity = FLOPs / Bytes moved
    • Know which bound applies to your algorithm
  4. Amdahl’s Law
    Speedup = 1 / ((1 - P) + P/S)
    
    Where: P = parallelizable fraction
           S = speedup of parallel portion
    
  5. Cache Efficiency
    • Temporal locality: reuse data while it’s hot
    • Spatial locality: access data contiguously
    • Cache-oblivious algorithms: work well regardless of cache size
  6. Instruction-Level Parallelism (ILP)
    • Modern CPUs execute multiple instructions per cycle
    • Data dependencies limit ILP
    • Branch mispredictions flush the pipeline (~15-20 cycles)

Project 1: Performance Measurement Toolkit

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Profiling / Measurement / Benchmarking
  • Software or Tool: perf, RDTSC, clock_gettime
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A micro-benchmarking framework that accurately measures CPU cycles, cache misses, branch mispredictions, and memory bandwidth for arbitrary code snippets—with statistical rigor (warmup, multiple runs, outlier detection, confidence intervals).

Why it teaches performance engineering: You cannot optimize what you cannot measure. This project teaches you the fundamental skill of accurate performance measurement—understanding why naive timing is often wrong (warmup effects, frequency scaling, context switches), how hardware performance counters work, and how to draw valid conclusions from noisy data.

Core challenges you’ll face:

  • Cycle-accurate timing with RDTSC → maps to understanding CPU timestamp counters and serialization
  • Accessing hardware performance counters → maps to perf_event_open, PMU architecture
  • Eliminating measurement noise → maps to CPU frequency scaling, TurboBoost, process affinity
  • Statistical analysis of results → maps to variance, confidence intervals, outlier detection

Key Concepts:

  • RDTSC and Timing: “What Every Programmer Should Know About Memory” Section 7 - Ulrich Drepper
  • Hardware Performance Counters: “Computer Architecture” Chapter 4 - Hennessy & Patterson
  • Statistical Benchmarking: “Systems Performance” Chapter 2 - Brendan Gregg
  • CPU Frequency Scaling: Linux kernel documentation on cpufreq governors

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming, basic statistics, familiarity with Linux

Real world outcome:

$ ./perf_measure -b matrix_mult -n 1000 -s 256
Benchmark: matrix_mult (256x256)
Runs: 1000, Warmup: 100

┌────────────────────────────────────────────────────────────┐
│ Metric              │ Mean      │ Std Dev  │ 95% CI       │
├────────────────────────────────────────────────────────────┤
│ CPU Cycles          │ 45.2M     │ 1.2M     │ ±0.08M       │
│ Instructions        │ 134.2M    │ 0.001M   │ ±0.0001M     │
│ IPC                 │ 2.97      │ 0.08     │ ±0.005       │
│ L1D Cache Misses    │ 2.1M      │ 0.05M    │ ±0.003M      │
│ L3 Cache Misses     │ 0.26M     │ 0.02M    │ ±0.001M      │
│ Branch Mispred      │ 15.2K     │ 0.8K     │ ±0.05K       │
│ Time (ms)           │ 15.2      │ 0.4      │ ±0.03        │
└────────────────────────────────────────────────────────────┘

Memory Bandwidth: 4.3 GB/s
Operational Intensity: 2.1 FLOPs/byte
Bound by: MEMORY (below roofline)

Implementation Hints:

The core of cycle-accurate timing on x86:

Serialize with CPUID (ensures all previous instructions complete)
Read TSC with RDTSC/RDTSCP
Run your code
Serialize again
Read TSC again
Difference = cycles elapsed

For hardware counters, Linux provides perf_event_open():

  • What events are available? Check /sys/devices/cpu/events/
  • You can read multiple counters simultaneously using event groups
  • Consider: How do you handle multiplexing if you need more counters than available PMUs?

For statistical rigor:

  • How many warmup runs eliminate JIT/cache cold effects?
  • How do you detect outliers? (IQR method, Z-score?)
  • What sample size gives you 95% confidence?
  • Should you use mean, median, or minimum?

Questions to ask yourself:

  • Why might back-to-back measurements differ by 1000x?
  • What happens if TurboBoost kicks in mid-measurement?
  • How does hyperthreading affect your measurements?

Learning milestones:

  1. RDTSC timing works consistently → You understand instruction serialization
  2. Hardware counters report correctly → You understand the PMU interface
  3. Results are reproducible across runs → You understand measurement methodology
  4. You can identify noise vs. real variance → You understand performance statistics

Project 2: Matrix Multiplication Optimizer

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Fortran
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Numerical Computing / Cache Optimization / SIMD
  • Software or Tool: BLAS comparison, Intel Intrinsics
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron (Chapter 5 & 6)

What you’ll build: A matrix multiplication library that progressively improves from naive O(n³) implementation to a highly optimized version using blocking, SIMD intrinsics, and loop transformations—achieving within 50% of Intel MKL performance.

Why it teaches performance engineering: Matrix multiplication is THE canonical example for performance optimization. The naive algorithm leaves 99% of CPU performance on the table. By optimizing it, you encounter every major concept: cache blocking, loop tiling, SIMD vectorization, instruction-level parallelism, and memory access patterns. The gap between naive (~1 GFLOP) and optimized (~100+ GFLOP) is so dramatic that it permanently changes how you think about code.

Core challenges you’ll face:

  • Cache blocking/tiling → maps to fitting working set in L1/L2 cache
  • Loop ordering (ijk vs ikj vs …) → maps to spatial locality and stride access
  • SIMD vectorization → maps to AVX2/AVX-512 intrinsics for parallel arithmetic
  • Register tiling → maps to keeping data in registers to avoid cache access
  • TLB efficiency → maps to minimizing page faults in large matrices

Key Concepts:

  • Loop Tiling for Cache: “Computer Systems: A Programmer’s Perspective” Chapter 6.6 - Bryant & O’Hallaron
  • SIMD Intrinsics: Intel Intrinsics Guide (online reference)
  • Matrix Multiplication Optimization: MIT 6.172 Lecture 1 - Charles Leiserson
  • Cache-Oblivious Algorithms: “Introduction to Algorithms” Chapter 27 - CLRS

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, understanding of caches, Project 1 (measurement)

Real world outcome:

$ ./matmul_bench -n 1024
Matrix Multiplication Benchmark (1024x1024, double precision)
Peak theoretical: 150 GFLOPS (assuming 3 GHz, 8-wide AVX, FMA)

┌─────────────────────────────────────────────────────────────┐
│ Implementation          │ Time (ms) │ GFLOPS │ % of Peak   │
├─────────────────────────────────────────────────────────────┤
│ Naive (ijk)             │ 8,542     │ 0.25   │ 0.17%       │
│ Loop interchange (ikj)  │ 1,923     │ 1.11   │ 0.74%       │
│ 32x32 blocking          │ 412       │ 5.19   │ 3.46%       │
│ + Loop unrolling (4x)   │ 198       │ 10.81  │ 7.21%       │
│ + AVX2 SIMD             │ 52        │ 41.12  │ 27.41%      │
│ + Register tiling       │ 31        │ 69.02  │ 46.01%      │
│ + Prefetching           │ 26        │ 82.28  │ 54.85%      │
├─────────────────────────────────────────────────────────────┤
│ Intel MKL (reference)   │ 18        │ 118.82 │ 79.21%      │
└─────────────────────────────────────────────────────────────┘

Final speedup over naive: 328x

Implementation Hints:

Start with the naive implementation and measure it carefully:

for i in 0..n:
    for j in 0..n:
        for k in 0..n:
            C[i][j] += A[i][k] * B[k][j]

The first optimization is free—just change loop order. Why does ikj outperform ijk by 4-8x? Draw the memory access patterns.

For cache blocking, think about:

  • What block size fits in L1 cache? (32KB ≈ 4096 doubles ≈ 64x64 block… but you need 3 matrices)
  • The answer is usually 32x32 or 64x64 for doubles
  • Why does blocking help? What’s the new miss rate?

For SIMD:

  • AVX2 gives you 4 doubles (256-bit) or AVX-512 gives 8 doubles (512-bit)
  • How do you load 4 doubles, multiply them, and accumulate?
  • What about when dimensions aren’t multiples of 4?

Register tiling insight:

  • You have 16 YMM registers (AVX2)
  • Can you keep a 4x4 block of C in registers while streaming A and B?
  • This is the key to hitting peak FLOPS

Questions to explore:

  • What happens when matrices don’t fit in L3 cache?
  • How would you parallelize this across cores? (Project 6)
  • Why is Strassen’s algorithm not used in practice for moderate sizes?

Learning milestones:

  1. Loop interchange gives 4x speedup → You understand stride-1 access patterns
  2. Blocking gives 10x over interchange → You understand cache working sets
  3. SIMD gives 4x over blocked → You understand vector parallelism
  4. You achieve 50%+ of MKL → You understand the full optimization stack

Project 3: Bit Manipulation Library

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Bit Manipulation / Low-Level Optimization
  • Software or Tool: None (pure computation)
  • Main Book: “Hacker’s Delight” by Henry S. Warren Jr.

What you’ll build: A library of bit manipulation routines that implement common operations (popcount, leading/trailing zeros, bit reversal, next permutation, etc.) using clever bit hacks—with benchmarks comparing to naive implementations and compiler intrinsics.

Why it teaches performance engineering: Bit hacks demonstrate that the same computation can be done in radically different ways with orders of magnitude performance difference. They teach you to think about data representation, exploit instruction-level parallelism, and understand how computers actually compute. These techniques appear everywhere: hash functions, compression, cryptography, graphics, and systems programming.

Core challenges you’ll face:

  • Implementing popcount without loops → maps to parallel bit counting with shifts and masks
  • Finding leading/trailing zeros → maps to binary search with bits
  • Understanding why these work → maps to bitwise arithmetic and Boolean algebra
  • Comparing with hardware instructions → maps to when to use intrinsics vs. pure C

Key Concepts:

  • Bit Twiddling: “Hacker’s Delight” Chapters 2-5 - Henry Warren
  • Bit Hacks Collection: MIT 6.172 Lecture 3 - Bit Hacks
  • Population Count: “Computer Systems: A Programmer’s Perspective” Practice Problem 2.66
  • POPCNT and LZCNT Instructions: Intel 64 and IA-32 Architectures Software Developer’s Manual

Difficulty: Intermediate Time estimate: 1 week Prerequisites: C programming, binary number system, bitwise operators

Real world outcome:

$ ./bithacks_test
Bit Manipulation Library - Performance Tests

┌────────────────────────────────────────────────────────────┐
│ Operation           │ Naive    │ BitHack  │ Intrinsic    │
├────────────────────────────────────────────────────────────┤
│ popcount(64-bit)    │ 64 ops   │ 12 ops   │ 1 op (POPCNT)│
│ leading zeros       │ 64 ops   │ 5 ops    │ 1 op (LZCNT) │
│ trailing zeros      │ 64 ops   │ 5 ops    │ 1 op (TZCNT) │
│ reverse bits        │ 64 ops   │ 17 ops   │ N/A          │
│ next power of 2     │ 6+ ops   │ 5 ops    │ 2 ops        │
│ is power of 2       │ loop     │ 2 ops    │ 2 ops        │
│ log2 (integer)      │ 64 ops   │ 5 ops    │ 1 op (LZCNT) │
│ abs (branchless)    │ 1 branch │ 3 ops    │ 3 ops        │
│ sign extend         │ 1 branch │ 2 ops    │ 2 ops        │
│ bit interleave      │ 64 ops   │ 12 ops   │ PDEP 1 op    │
└────────────────────────────────────────────────────────────┘

Detailed popcount benchmark (1M random 64-bit values):
  Naive (loop):      45.2 ms
  SWAR (parallel):   12.1 ms (3.7x faster)
  __builtin_popcountll: 2.3 ms (19.6x faster)
  Hardware POPCNT:   1.8 ms (25.1x faster)

Implementation Hints:

The classic popcount algorithm (SWAR - SIMD Within A Register):

Think of a 64-bit integer as 64 1-bit counters
Step 1: Add pairs of bits → 32 2-bit counters
Step 2: Add pairs of 2-bit counters → 16 4-bit counters
Step 3: Add pairs of 4-bit counters → 8 8-bit counters
...continue until one 64-bit counter

The magic constants mask out alternating positions.
What are those constants? (0x5555..., 0x3333..., 0x0F0F..., etc.)

For leading zeros:

Binary search, but with bits!
If the top 32 bits are zero, we know result ≥ 32
If the top 16 of remaining are zero, result ≥ 48
...continue halving

How do you check "top N bits are zero" with one operation?

The “is power of 2” hack: x && !(x & (x-1))

  • Why does this work?
  • What’s special about x - 1 when x is a power of 2?

Questions to explore:

  • Why is branchless abs faster than x < 0 ? -x : x?
  • What’s the relationship between popcount and Hamming distance?
  • How do modern CPUs implement POPCNT in hardware?
  • When should you use intrinsics vs. let the compiler optimize?

Learning milestones:

  1. You can explain why each hack works → You understand bit-level reasoning
  2. You implement popcount with SWAR → You understand parallel computation in registers
  3. You know when to use intrinsics → You understand the performance tradeoffs
  4. You can derive new bit hacks → You’ve internalized the techniques

Project 4: Assembly Language Analyzer

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C (analyzing assembly output)
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Assembly Language / Compiler Output / CPU Architecture
  • Software or Tool: objdump, Compiler Explorer (godbolt.org)
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron (Chapter 3)

What you’ll build: A tool that takes C source code, compiles it at various optimization levels, disassembles the output, and provides annotated analysis—showing instruction count, estimated cycles, register usage, memory accesses, and potential bottlenecks (dependencies, cache misses, branch mispredictions).

Why it teaches performance engineering: To optimize code, you must see what the compiler produces. This project forces you to read assembly fluently and understand the mapping from C to machine code. You’ll learn why some “obvious” optimizations don’t help (compiler already did them), why some code is slow (bad instruction selection), and how to write “compiler-friendly” C.

Core challenges you’ll face:

  • Parsing x86-64 assembly → maps to instruction format, operand encoding
  • Understanding calling conventions → maps to register usage, stack frames
  • Identifying optimization patterns → maps to loop unrolling, vectorization, strength reduction
  • Estimating cycle counts → maps to instruction latency, throughput, dependencies

Key Concepts:

  • x86-64 Assembly: “Computer Systems: A Programmer’s Perspective” Chapter 3 - Bryant & O’Hallaron
  • Compiler Optimizations: MIT 6.172 Lecture 5 - C to Assembly & Lecture 9 - What Compilers Can and Cannot Do
  • Intel Architecture: Intel 64 and IA-32 Optimization Reference Manual
  • Instruction Tables: Agner Fog’s Instruction Tables (online)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, basic understanding of CPU architecture, familiarity with assembly concepts

Real world outcome:

$ ./asm_analyzer sum_array.c -O0 -O2 -O3

Source: sum_array.c
┌─────────────────────────────────────────────────────────────┐
│ long sum_array(long *arr, int n) {                          │
│     long sum = 0;                                           │
│     for (int i = 0; i < n; i++) {                          │
│         sum += arr[i];                                      │
│     }                                                       │
│     return sum;                                             │
│ }                                                           │
└─────────────────────────────────────────────────────────────┘

═══════════════════════════════════════════════════════════════
OPTIMIZATION LEVEL: -O0 (no optimization)
═══════════════════════════════════════════════════════════════
Instructions: 23
Estimated cycles/iteration: 12-15
Issues identified:
  ⚠ Memory spilling: sum stored to stack every iteration
  ⚠ No loop unrolling
  ⚠ Redundant loads of loop variables

Assembly (annotated):
  push   rbp                  ; Function prologue
  mov    rbp, rsp
  mov    QWORD PTR [rbp-24], rdi  ; arr → stack (SPILL!)
  mov    DWORD PTR [rbp-28], esi  ; n → stack (SPILL!)
  mov    QWORD PTR [rbp-8], 0     ; sum = 0
  mov    DWORD PTR [rbp-12], 0    ; i = 0
.L3:                              ; Loop start
  mov    eax, DWORD PTR [rbp-12]  ; Load i (from stack!)
  cmp    eax, DWORD PTR [rbp-28]  ; Compare i < n (from stack!)
  jge    .L2                      ; Exit if >=
  mov    eax, DWORD PTR [rbp-12]  ; Load i AGAIN!
  cdqe
  lea    rdx, [0+rax*8]
  mov    rax, QWORD PTR [rbp-24]  ; Load arr (from stack!)
  add    rax, rdx
  mov    rax, QWORD PTR [rax]     ; Load arr[i]
  add    QWORD PTR [rbp-8], rax   ; sum += arr[i] (to stack!)
  add    DWORD PTR [rbp-12], 1    ; i++
  jmp    .L3
.L2:
  mov    rax, QWORD PTR [rbp-8]   ; Return sum
  pop    rbp
  ret

═══════════════════════════════════════════════════════════════
OPTIMIZATION LEVEL: -O2 (standard optimization)
═══════════════════════════════════════════════════════════════
Instructions: 8
Estimated cycles/iteration: 3-4
Optimizations applied:
  ✓ Register allocation (no spills)
  ✓ Pointer increment instead of indexing
  ✓ Minimal loop overhead

Assembly (annotated):
  test   esi, esi           ; if (n <= 0)
  jle    .L4                 ;   return 0
  mov    eax, esi
  lea    rax, [rdi+rax*8]    ; end = arr + n
  xor    eax, eax            ; sum = 0
.L3:                         ; Loop (tight!)
  add    rax, QWORD PTR [rdi] ; sum += *arr
  add    rdi, 8              ; arr++
  cmp    rdi, rax            ; while (arr != end)
  jne    .L3
  ret
.L4:
  xor    eax, eax            ; return 0
  ret

═══════════════════════════════════════════════════════════════
OPTIMIZATION LEVEL: -O3 (aggressive optimization)
═══════════════════════════════════════════════════════════════
Instructions: 24 (but vectorized!)
Estimated cycles/iteration: 0.5-1 (4 elements per iteration)
Optimizations applied:
  ✓ All -O2 optimizations
  ✓ Loop vectorization (SSE/AVX)
  ✓ 4x unrolling with SIMD

Vectorized loop uses PADDQ (parallel add) on 4 longs simultaneously!

Implementation Hints:

Structure of the analyzer:

1. Invoke compiler: gcc -S -masm=intel [opts] file.c -o file.s
2. Parse assembly output (or use objdump on .o file)
3. For each instruction:
   - Classify: ALU, memory, branch, etc.
   - Track register def/use chains
   - Estimate latency from instruction tables
4. Identify patterns:
   - Stack spills (mov to [rbp-N])
   - Vectorization (xmm/ymm registers, packed ops)
   - Loop unrolling (duplicated loop bodies)
5. Generate report

Key patterns to recognize:

  • Strength reduction: x * 8x << 3
  • Loop-invariant code motion: computation moved out of loop
  • Induction variable optimization: array indexing → pointer increment
  • Tail call optimization: jmp instead of call; ret

For cycle estimation, use Agner Fog’s tables:

  • Latency: cycles until result is ready
  • Throughput: cycles per instruction (can be < 1 due to superscalar)
  • Consider: Which instructions can execute in parallel?

Questions to explore:

  • Why does -O0 spill everything to the stack?
  • What prevents -O2 from vectorizing this loop?
  • How do you recognize auto-vectorized code?
  • What flags enable/disable specific optimizations?

Learning milestones:

  1. You can read x86-64 assembly fluently → You understand the instruction set
  2. You identify compiler optimizations → You understand what compilers do
  3. You predict what the compiler will produce → You can write optimizer-friendly code
  4. You spot missed optimizations → You can guide the compiler with hints

Project 5: Custom Memory Allocator

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Memory Management / Systems Programming
  • Software or Tool: Building replacement for malloc
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A custom malloc/free implementation with multiple allocation strategies (first-fit, best-fit, segregated free lists, buddy allocator) and performance characteristics tuned for different workloads—benchmarked against glibc malloc and jemalloc.

Why it teaches performance engineering: Memory allocation is a perfect microcosm of performance engineering: you must balance speed, memory overhead, fragmentation, and cache efficiency. Every systems programmer should understand what happens when they call malloc(). This project teaches memory layout, pointer arithmetic, fragmentation strategies, and the trade-offs that real allocators make.

Core challenges you’ll face:

  • Managing the heap with sbrk/mmap → maps to virtual memory, page allocation
  • Free list management → maps to data structure choice, coalescing
  • Minimizing fragmentation → maps to internal vs. external fragmentation
  • Thread-safe allocation → maps to lock-free data structures or per-thread caches

Key Concepts:

  • Dynamic Memory Allocation: “Computer Systems: A Programmer’s Perspective” Chapter 9.9 - Bryant & O’Hallaron
  • Storage Allocation: MIT 6.172 Lectures 11-12 - Storage Allocation
  • Allocator Design: “The Linux Programming Interface” Chapter 7 - Michael Kerrisk
  • Buddy System: Knuth, “The Art of Computer Programming” Volume 1

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: C programming, pointers, virtual memory concepts, Projects 1-2

Real world outcome:

$ ./malloc_bench -a custom -w mixed
Memory Allocator Benchmark

Workload: Mixed (varying sizes, random order free)
Operations: 10M allocations, 10M frees

┌─────────────────────────────────────────────────────────────┐
│ Allocator        │ Alloc/s │ Free/s  │ Memory │ Fragmentation│
├─────────────────────────────────────────────────────────────┤
│ glibc malloc     │ 15.2M   │ 18.1M   │ 1.00x  │ 12%          │
│ jemalloc         │ 22.1M   │ 25.3M   │ 1.05x  │ 8%           │
│ custom-firstfit  │ 8.3M    │ 10.2M   │ 1.02x  │ 35%          │
│ custom-bestfit   │ 3.1M    │ 10.2M   │ 1.02x  │ 18%          │
│ custom-segregated│ 18.4M   │ 21.2M   │ 1.08x  │ 10%          │
│ custom-buddy     │ 20.1M   │ 22.8M   │ 1.50x  │ 0% (internal)│
└─────────────────────────────────────────────────────────────┘

$ ./malloc_test_suite custom
Testing: custom allocator
  ✓ Basic allocation and free
  ✓ Alignment (16-byte)
  ✓ Realloc preserves data
  ✓ Large allocations (> 128KB use mmap)
  ✓ Coalescing adjacent free blocks
  ✓ No memory leaks
  ✓ Thread safety (4 threads, 1M ops each)

All tests passed!

Implementation Hints:

Basic structure of a block:

┌──────────────────────────────────────────────────────────┐
│ Header (8 bytes): size | flags (allocated/free)          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│                      Payload                             │
│                    (user data)                           │
│                                                          │
├──────────────────────────────────────────────────────────┤
│ Footer (optional): size (for coalescing)                 │
└──────────────────────────────────────────────────────────┘

Strategies to implement:

  1. First-fit: Traverse free list, use first block that fits
  2. Best-fit: Find smallest block that fits (slow but less fragmentation)
  3. Segregated free lists: Separate lists for different size classes
  4. Buddy system: Split/merge powers of 2 (fast but internal fragmentation)

Key operations:

  • Splitting: When allocating from a larger free block
  • Coalescing: When freeing, merge with adjacent free blocks
  • Extending heap: When no suitable free block exists

For thread safety:

  • Simple approach: global lock
  • Better: per-size-class locks
  • Advanced: thread-local caches (like tcmalloc/jemalloc)

Questions to consider:

  • What alignment is required? (Usually 16 bytes on x86-64)
  • When should you use mmap vs. sbrk?
  • How do you handle realloc efficiently?
  • What about zero-size allocations?

Learning milestones:

  1. First-fit allocator passes basic tests → You understand heap management
  2. Coalescing works correctly → You understand fragmentation prevention
  3. Segregated lists outperform first-fit → You understand data structure trade-offs
  4. Thread-safe version works → You understand concurrent data structures

Project 6: Parallel Matrix Operations with Cilk

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C (with Cilk/OpenCilk extensions)
  • Alternative Programming Languages: C++ (with TBB), Rust (with Rayon)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Parallel Programming / Work-Stealing / Fork-Join
  • Software or Tool: OpenCilk, Intel TBB, or Rayon
  • Main Book: “Introduction to Algorithms” by CLRS (Chapter 27: Multithreaded Algorithms)

What you’ll build: A parallel linear algebra library implementing matrix multiplication, transpose, and LU decomposition using fork-join parallelism—with work-span analysis demonstrating theoretical speedup and benchmarks showing actual speedup across core counts.

Why it teaches performance engineering: Parallelism is the only way to exploit modern CPUs, but naive parallelism often makes things worse (overhead, false sharing, poor cache usage). This project teaches you to analyze parallel algorithms (work-span), use structured parallelism (fork-join), and understand why practical speedup differs from theoretical speedup.

Core challenges you’ll face:

  • Work-span analysis → maps to understanding parallelism = work/span
  • Granularity control → maps to when parallel overhead exceeds benefit
  • Cache-efficient parallel algorithms → maps to cache-oblivious techniques
  • Measuring parallel efficiency → maps to speedup curves, Amdahl’s law

Key Concepts:

  • Multithreaded Algorithms: MIT 6.172 Lectures 6-8 - Multicore Programming
  • Work-Span Analysis: “Introduction to Algorithms” Chapter 27 - CLRS
  • Cilk Runtime: MIT 6.172 Lecture 13 - The Cilk Runtime System
  • Cache-Oblivious Algorithms: MIT 6.172 Lecture 15

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, matrix multiplication (Project 2), understanding of parallelism concepts

Real world outcome:

$ ./parallel_linalg_bench -n 2048 --cores 1,2,4,8,16

Parallel Linear Algebra Benchmark
Matrix size: 2048x2048 (double precision)
System: 16 cores, 32 threads

┌─────────────────────────────────────────────────────────────┐
│ Operation         │ Work (T₁) │ Span (T∞) │ Parallelism     │
├─────────────────────────────────────────────────────────────┤
│ Matrix Multiply   │ O()     │ O(log n)  │ O(n³/log n) ≈ ∞ │
│ Matrix Transpose  │ O()     │ O(log n)  │ O(n²/log n)     │
│ LU Decomposition  │ O()     │ O(n log n)│ O(n²/log n)     │
└─────────────────────────────────────────────────────────────┘

Actual Speedup (Matrix Multiply 2048x2048):
┌─────────────────────────────────────────────────────────────┐
│ Cores │ Time (ms) │ Speedup │ Efficiency │ GFLOPS          │
├─────────────────────────────────────────────────────────────┤
│ 1     │ 4,521     │ 1.0x    │ 100%       │ 3.8             │
│ 2     │ 2,298     │ 1.97x   │ 98%        │ 7.5             │
│ 4     │ 1,178     │ 3.84x   │ 96%        │ 14.6            │
│ 8     │ 612       │ 7.39x   │ 92%        │ 28.1            │
│ 16    │ 342       │ 13.22x  │ 83%        │ 50.3            │
└─────────────────────────────────────────────────────────────┘

Efficiency analysis:
  - Near-linear scaling up to 8 cores (92%)
  - Diminishing returns at 16 cores (memory bandwidth limited)
  - Achieved 50 GFLOPS (33% of peak)

Implementation Hints:

Cilk provides two keywords:

  • cilk_spawn: Fork a task
  • cilk_sync: Wait for spawned children

For matrix multiply, the divide-and-conquer approach:

matmul(C, A, B, n):
    if n <= BASE_CASE:
        sequential_matmul(C, A, B, n)
    else:
        partition A, B, C into quadrants
        # 8 recursive multiplications, can spawn 7 of them
        cilk_spawn matmul(C00, A00, B00) + matmul(C00, A01, B10)
        cilk_spawn matmul(C01, A00, B01) + matmul(C01, A01, B11)
        ... (6 more)
        cilk_sync

Work-span analysis:

  • Work T₁: Total operations if run sequentially
  • Span T∞: Critical path (longest dependency chain)
  • Parallelism: T₁/T∞ (how many processors can be used effectively)

Granularity control is crucial:

  • Too fine: Overhead dominates (spawn cost ~hundreds of cycles)
  • Too coarse: Not enough parallelism
  • Rule of thumb: Base case should be ~10μs of work

For cache efficiency:

  • Use cache-oblivious recursive decomposition
  • This naturally fits in cache at the right level
  • Avoid false sharing between threads

Questions to explore:

  • What happens when parallelism » cores?
  • Why doesn’t LU decomposition scale as well as matrix multiply?
  • How does the work-stealing scheduler affect cache behavior?
  • What’s the overhead of cilk_spawn vs. sequential call?

Learning milestones:

  1. Basic parallel matmul works → You understand fork-join parallelism
  2. Near-linear speedup up to 4-8 cores → You understand granularity control
  3. Work-span analysis matches empirical results → You understand parallel complexity
  4. You identify bottlenecks (memory, contention) → You understand real-world limits

Project 7: Race Detector

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Concurrency / Race Detection / Program Analysis
  • Software or Tool: Building a tool similar to ThreadSanitizer
  • Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A race detector that instruments parallel C programs to detect data races at runtime—tracking memory accesses, maintaining vector clocks, and reporting conflicting accesses with source locations and stack traces.

Why it teaches performance engineering: Data races are the bane of parallel programming. They cause heisenbugs that appear and disappear unpredictably. Building a race detector teaches you how parallel programs actually execute, what constitutes a race (formally), and how to instrument programs to track memory accesses—skills essential for debugging and optimizing parallel code.

Core challenges you’ll face:

  • Tracking happens-before relationships → maps to vector clocks, lamport timestamps
  • Instrumenting memory accesses → maps to compiler instrumentation or binary instrumentation
  • Efficient shadow memory → maps to mapping application memory to metadata
  • Reducing false positives → maps to understanding synchronization semantics

Key Concepts:

  • Race Detection Theory: “The Art of Multiprocessor Programming” Chapter 8 - Herlihy & Shavit
  • Vector Clocks: Lamport’s paper “Time, Clocks, and the Ordering of Events”
  • Nondeterministic Parallel Programming: MIT 6.172 Lecture 16
  • Shadow Memory: How AddressSanitizer/ThreadSanitizer work (LLVM documentation)

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Parallel programming (Project 6), understanding of synchronization primitives, familiarity with compiler or binary instrumentation

Real world outcome:

$ ./race_detector ./buggy_parallel_program

Running with race detection enabled...

═══════════════════════════════════════════════════════════════
WARNING: DATA RACE DETECTED
═══════════════════════════════════════════════════════════════

Write of size 8 at 0x7f8a3c000040 by thread T2:
    #0 update_counter() at counter.c:15
    #1 worker_thread() at main.c:42
    #2 pthread_start()

Previous read of size 8 at 0x7f8a3c000040 by thread T1:
    #0 read_counter() at counter.c:21
    #1 worker_thread() at main.c:42
    #2 pthread_start()

Location is global variable 'shared_counter' declared at:
    counter.c:3

Threads involved:
    T1 (tid=1001): created at main.c:38
    T2 (tid=1002): created at main.c:38

═══════════════════════════════════════════════════════════════
SUMMARY: 1 race detected in 0.523 seconds
         Slowdown: 12x vs uninstrumented
═══════════════════════════════════════════════════════════════

Implementation Hints:

A data race occurs when:

  1. Two threads access the same memory location
  2. At least one access is a write
  3. The accesses are not ordered by synchronization (happens-before)

Vector clocks track happens-before:

Each thread maintains a vector V[T] where V[T][i] = logical time of thread i
On sync: merge vectors (element-wise max)
On access: check if another thread's access could be concurrent

Shadow memory maps each application byte to metadata:

For each memory location, track:
- Last writer (thread, vector clock)
- Last readers (may need multiple)
- Access size and type

Instrumentation approaches:

  1. Compiler instrumentation: Insert callbacks before every load/store
  2. Binary instrumentation: Use Pin, DynamoRIO, or Valgrind
  3. Hybrid: Compile with hooks, link with runtime library

Performance considerations:

  • Shadow memory lookup must be fast (direct mapping with offset)
  • Vector clock operations are expensive (use epochs to optimize)
  • Only check inter-thread accesses (per-thread fast path)

Questions to explore:

  • What’s the difference between benign races and bugs?
  • How do you handle atomic operations?
  • What about races through signals/interrupts?
  • How do production race detectors achieve reasonable slowdowns?

Learning milestones:

  1. Detect obvious races (no sync) → You understand race definition
  2. Handle pthread_mutex correctly → You understand happens-before
  3. Reasonable overhead (<50x slowdown) → You understand optimization
  4. No false positives on correct programs → You understand sync semantics

Project 8: Cache-Oblivious B-Tree

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Data Structures / Cache Optimization / Databases
  • Software or Tool: Building a high-performance index structure
  • Main Book: “Introduction to Algorithms” by CLRS

What you’ll build: A cache-oblivious B-tree (van Emde Boas layout) that achieves optimal cache behavior without knowing the cache size—benchmarked against standard B-trees and red-black trees across different cache hierarchies.

Why it teaches performance engineering: Modern programs are bottlenecked by memory, not CPU. Cache-oblivious algorithms achieve near-optimal cache performance without tuning for specific cache sizes. Building this data structure teaches you about memory hierarchy, cache analysis, and data layout—skills that apply to any data-intensive code.

Core challenges you’ll face:

  • van Emde Boas layout → maps to recursive layout for optimal cache behavior
  • Implicit pointers → maps to computing child positions without storing pointers
  • Cache complexity analysis → maps to understanding O(log_B N) cache behavior
  • Comparison with cache-aware B-tree → maps to understanding the tradeoffs

Key Concepts:

  • Cache-Oblivious Algorithms: MIT 6.172 Lecture 15
  • B-Trees: “Introduction to Algorithms” Chapter 18 - CLRS
  • van Emde Boas Layout: Prokop’s thesis, “Cache-Oblivious Algorithms”
  • Memory Hierarchy: “Computer Architecture” Chapter 2 - Hennessy & Patterson

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: B-trees, understanding of cache hierarchy, Project 2 (cache optimization experience)

Real world outcome:

$ ./btree_bench -n 10000000 -ops search,insert,range

B-Tree Benchmark: 10M keys, random order

┌─────────────────────────────────────────────────────────────┐
│ Data Structure        │ Search  │ Insert  │ Range(1K)      │
├─────────────────────────────────────────────────────────────┤
│ std::map (RB-tree)    │ 892 ns  │ 1,021 ns│ 45.2 μs        │
│ B-tree (B=64)         │ 423 ns  │ 512 ns  │ 12.3 μs        │
│ B-tree (B=256)        │ 389 ns  │ 498 ns  │ 11.8 μs        │
│ B-tree (B=1024)       │ 412 ns  │ 534 ns  │ 10.9 μs        │
│ Cache-oblivious B-tree│ 356 ns  │ 478 ns  │ 9.2 μs         │
└─────────────────────────────────────────────────────────────┘

Cache behavior (L3 misses per operation):
┌─────────────────────────────────────────────────────────────┐
│ Data Structure        │ Search  │ Insert  │ Notes          │
├─────────────────────────────────────────────────────────────┤
│ std::map (RB-tree)    │ 12.3    │ 15.1    │ pointer-chasing│
│ B-tree (B=64)         │ 4.2     │ 5.8     │ optimal for L1 │
│ B-tree (B=256)        │ 3.1     │ 4.2     │ optimal for L2 │
│ B-tree (B=1024)       │ 3.8     │ 5.1     │ too big for L2 │
│ Cache-oblivious       │ 2.8     │ 3.9     │ self-adapting  │
└─────────────────────────────────────────────────────────────┘

The cache-oblivious B-tree achieves near-optimal cache behavior
across ALL cache levels without tuning!

Implementation Hints:

The key insight: Layout matters more than algorithms for cache performance.

Standard B-tree with pointer-based nodes:

Each access chases a pointer → likely cache miss
log_2(N) accesses → log_2(N) cache misses

Cache-aware B-tree with large nodes (B keys per node):

Choose B to fill a cache line
log_B(N) node accesses → log_B(N) cache misses
But: Must tune B for specific cache!

Cache-oblivious layout (van Emde Boas):

Recursively divide tree:
- Top half of tree stored first
- Then bottom subtrees, each in recursive layout

Why it works:
- At some recursion level, subtree fits in cache
- All accesses within that subtree are cache hits
- Achieves O(log_B N) cache complexity for ANY B!

The layout construction:

1. Conceptually, build a complete binary search tree
2. Apply van Emde Boas layout recursively
3. For position calculations, derive formulas for:
   - Parent index from child index
   - Child index from parent index
   - These can be computed, no pointers stored!

Questions to explore:

  • Why is pointer-chasing so expensive?
  • What’s the relationship between tree depth and cache misses?
  • How do you handle updates (splits) in the packed layout?
  • What’s the practical difference vs. cache-aware tuning?

Learning milestones:

  1. van Emde Boas layout works for static tree → You understand the layout
  2. Search achieves O(log_B N) cache complexity → You understand cache analysis
  3. Updates work correctly → You understand the algorithmic challenges
  4. Outperforms tuned B-tree on varied cache sizes → You’ve proven cache-obliviousness

Project 9: Lock-Free Queue

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Concurrency / Lock-Free Programming / Memory Ordering
  • Software or Tool: Building concurrent data structures
  • Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A lock-free MPMC (multi-producer, multi-consumer) queue using compare-and-swap operations, with careful attention to memory ordering and the ABA problem—benchmarked against mutex-based queues under contention.

Why it teaches performance engineering: Lock-free data structures can dramatically outperform lock-based ones under contention, but they’re notoriously difficult to get right. This project teaches you about atomic operations, memory ordering, linearizability, and the subtle bugs that arise in concurrent code—essential knowledge for high-performance systems.

Core challenges you’ll face:

  • Compare-and-swap (CAS) loops → maps to atomic operations and retry logic
  • ABA problem → maps to why simple CAS isn’t enough
  • Memory ordering → maps to acquire/release semantics, memory barriers
  • Linearizability correctness → maps to formal reasoning about concurrent operations

Key Concepts:

  • Lock-Free Data Structures: MIT 6.172 Lecture 17 - Synchronization Without Locks
  • Memory Models: “C++ Concurrency in Action” Chapter 5 - Anthony Williams
  • Michael-Scott Queue: The classic lock-free queue algorithm
  • ABA Problem: “The Art of Multiprocessor Programming” Chapter 10 - Herlihy & Shavit

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Understanding of atomic operations, memory models, parallel programming experience (Project 6)

Real world outcome:

$ ./queue_bench -producers 8 -consumers 8 -ops 10000000

Queue Benchmark: 8 producers, 8 consumers, 10M total ops

┌─────────────────────────────────────────────────────────────┐
│ Implementation        │ Throughput │ Latency (p99)│ Fairness │
├─────────────────────────────────────────────────────────────┤
│ Mutex-based queue     │ 1.2M ops/s │ 45 μs        │ Good     │
│ Spinlock queue        │ 2.1M ops/s │ 380 μs       │ Poor     │
│ Lock-free (simple CAS)│ BROKEN     │ -            │ -        │
│ Lock-free (w/ counter)│ 8.2M ops/s │ 8.2 μs       │ Good     │
│ Lock-free (hazard ptr)│ 7.8M ops/s │ 7.5 μs       │ Good     │
└─────────────────────────────────────────────────────────────┘

Scalability (lock-free w/ counter):
┌─────────────────────────────────────────────────────────────┐
│ Threads (P+C) │ Throughput │ Scaling Efficiency             │
├─────────────────────────────────────────────────────────────┤
│ 2  (1+1)      │ 5.1M ops/s │ 100% (baseline)                │
│ 4  (2+2)      │ 9.2M ops/s │ 90%                            │
│ 8  (4+4)      │ 14.8M ops/s│ 73%                            │
│ 16 (8+8)      │ 18.2M ops/s│ 45% (contention limited)       │
└─────────────────────────────────────────────────────────────┘

Correctness tests:
  ✓ No lost items
  ✓ No duplicate items
  ✓ FIFO order preserved (per-producer)
  ✓ Linearizable operations
  ✓ ABA problem prevented

Implementation Hints:

The Michael-Scott queue structure:

Queue:
  head → Node → Node → Node → ... → tail

Node:
  value: T
  next: atomic<Node*>

Basic enqueue (BROKEN - has ABA problem):

Allocate new_node with value
loop:
    tail = queue.tail
    next = tail.next
    if tail == queue.tail:  # still valid?
        if next == null:
            if CAS(&tail.next, null, new_node):  # try to link
                CAS(&queue.tail, tail, new_node) # try to advance tail
                return
        else:
            CAS(&queue.tail, tail, next)  # help advance tail

The ABA problem:

Thread 1: reads tail = A
Thread 1: preempted
Thread 2: dequeues A, frees it
Thread 2: allocates new node, happens to get same address A!
Thread 2: enqueues new A
Thread 1: resumes, CAS succeeds (still A), but wrong A!

Solutions:

  1. Counter/stamp with pointer: Pack version counter into pointer bits
  2. Hazard pointers: Protect nodes from being freed while accessed
  3. Epoch-based reclamation: Defer frees until no threads in critical section

Memory ordering is crucial:

  • memory_order_acquire on loads that guard following reads
  • memory_order_release on stores that guard preceding writes
  • memory_order_acq_rel on CAS operations
  • Getting this wrong causes subtle bugs!

Questions to explore:

  • Why is the spinlock version unfair under contention?
  • What happens if you use memory_order_relaxed everywhere?
  • How do you test for ABA bugs? (They’re rare and timing-dependent)
  • What’s the relationship between lock-free and wait-free?

Learning milestones:

  1. Basic lock-free queue works (ignoring ABA) → You understand CAS patterns
  2. ABA solution implemented and tested → You understand subtle concurrency bugs
  3. Correct memory ordering → You understand memory models
  4. Outperforms mutex under contention → You understand when lock-free helps

Project 10: SIMD JSON Parser

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: SIMD / Parsing / Data Processing
  • Software or Tool: Building something similar to simdjson
  • Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A JSON parser that uses SIMD instructions (AVX2/AVX-512) to parse multiple bytes simultaneously, achieving gigabytes-per-second parsing speeds—inspired by the simdjson project.

Why it teaches performance engineering: Parsing is traditionally done byte-by-byte, but SIMD allows processing 32-64 bytes at once. This project teaches you to think about data parallelism at the instruction level, design SIMD-friendly algorithms, and achieve performance that seems impossible. simdjson parses JSON at over 3 GB/s—faster than memcpy on some systems!

Core challenges you’ll face:

  • SIMD-based structural scanning → maps to finding quotes, braces, colons with vector ops
  • Handling escapes and strings → maps to complex state transitions in parallel
  • Branch-free parsing → maps to avoiding branch mispredictions
  • Validating UTF-8 with SIMD → maps to parallel state machines

Key Concepts:

  • SIMD Programming: Intel Intrinsics Guide (online)
  • simdjson Paper: “Parsing Gigabytes of JSON per Second” by Langdale & Lemire
  • SIMD String Processing: “Modern C” Chapter on SIMD - Jens Gustedt
  • Bit Manipulation for Parsing: MIT 6.172 Lecture 3 - Bit Hacks

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: SIMD experience (Project 2), bit manipulation (Project 3), understanding of parsing

Real world outcome:

$ ./simd_json_bench twitter.json

JSON Parsing Benchmark
File: twitter.json (631 KB, Twitter API response)

┌─────────────────────────────────────────────────────────────┐
│ Parser              │ Speed    │ Cycles/byte │ Time         │
├─────────────────────────────────────────────────────────────┤
│ Naive (byte-by-byte)│ 120 MB/s │ 25.0        │ 5.26 ms      │
│ RapidJSON           │ 380 MB/s │ 7.9         │ 1.66 ms      │
│ cJSON               │ 220 MB/s │ 13.6        │ 2.87 ms      │
│ simdjson            │ 3.2 GB/s │ 0.94        │ 0.20 ms      │
│ our_simd_json       │ 2.1 GB/s │ 1.43        │ 0.30 ms      │
└─────────────────────────────────────────────────────────────┘

Stage breakdown (our_simd_json):
  Stage 1 (structural scan):    0.08 ms  (7.9 GB/s)
  Stage 2 (tape construction):  0.15 ms  (4.2 GB/s)
  Stage 3 (DOM building):       0.07 ms  (9.0 GB/s)

Validation tests:
  ✓ All JSONTestSuite pass cases pass
  ✓ All JSONTestSuite fail cases fail
  ✓ Handles 4GB+ files (streaming mode)
  ✓ Correct UTF-8 validation

Implementation Hints:

Key insight: JSON structure can be found in parallel!

Stage 1: Find structural characters with SIMD

Load 32 bytes into YMM register
Compare against '{'  → bitmask of matches
Compare against '}'  → bitmask of matches
Compare against '['  → bitmask of matches
... for all structural chars

But wait! Chars inside strings don't count!
Need to track "inside string" state across chunks

The quote state problem:

"hello, world"  → inside string between quotes
"hello \" world" → escaped quote doesn't toggle
"hello \\ " → escaped backslash, quote does toggle

Solution: Use carryless multiplication (PCLMULQDQ)
or clever XOR prefix sum to track quote state

Structural characters to find:

  • { } [ ] : , - structure
  • " - string delimiters
  • \ - escape (affects following char)
  • Whitespace (to skip)

For each 64-byte chunk:

1. Load into two YMM registers (or one ZMM)
2. Generate bitmask for each character class
3. XOR quote bits to get string regions
4. Mask out structural chars inside strings
5. Append valid structural positions to tape

Branch-free number parsing:

  • Parse all digits with SIMD, multiply-accumulate
  • Handle sign, decimal point, exponent in parallel
  • Only branch on error cases

Questions to explore:

  • Why is simdjson faster than memcpy for some workloads?
  • How do you handle strings longer than SIMD width?
  • What’s the cache behavior of the two-stage approach?
  • How would you parallelize across cores?

Learning milestones:

  1. SIMD character classification works → You understand vector comparisons
  2. Quote/escape handling correct → You understand SIMD state machines
  3. 1 GB/s+ parsing speed → You understand SIMD optimization
  4. Full JSON compliance → You’ve handled all edge cases

Project 11: Profile-Guided Optimizer

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C/C++
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compilers / Profiling / Optimization
  • Software or Tool: LLVM, gcc, profile data
  • Main Book: “Engineering a Compiler” by Cooper & Torczon

What you’ll build: A tool that uses runtime profiles to guide code optimizations—identifying hot paths, reordering basic blocks to reduce branch mispredictions, and inlining decisions based on actual call frequencies.

Why it teaches performance engineering: Compilers optimize based on static analysis, but real performance depends on runtime behavior. Profile-guided optimization (PGO) bridges this gap. This project teaches you about control flow graphs, profile data collection, and how to make informed optimization decisions—understanding what modern compilers do with -fprofile-use.

Core challenges you’ll face:

  • Instrumenting code for profiling → maps to basic block counters, edge profiling
  • Analyzing profile data → maps to hot paths, call graphs, branch frequencies
  • Basic block reordering → maps to reducing branch mispredictions
  • Inlining heuristics → maps to code size vs. call overhead

Key Concepts:

  • Compiler Optimizations: MIT 6.172 Lecture 9 - What Compilers Can and Cannot Do
  • Profile-Guided Optimization: “Engineering a Compiler” Chapter 8 - Cooper & Torczon
  • Basic Block Layout: “Profile Guided Code Positioning” (Pettis & Hansen paper)
  • LLVM PGO: LLVM documentation on profile-guided optimization

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of compiler structure, control flow graphs, assembly (Project 4)

Real world outcome:

$ ./pgo_tool analyze profile_data.txt program.cfg

Profile-Guided Analysis Report
==============================

Hot Functions (>5% of execution time):
┌─────────────────────────────────────────────────────────────┐
│ Function            │ Samples │ %Total │ Avg calls/invoc   │
├─────────────────────────────────────────────────────────────┤
│ process_request     │ 2,341K  │ 34.2%  │ 1 (entry point)   │
│ parse_json          │ 1,892K  │ 27.6%  │ 3.2               │
│ hash_lookup         │ 1,234K  │ 18.0%  │ 48.7              │
│ string_compare      │ 823K    │ 12.0%  │ 156.3             │
│ allocate_buffer     │ 412K    │ 6.0%   │ 8.2               │
└─────────────────────────────────────────────────────────────┘

Branch Analysis:
┌─────────────────────────────────────────────────────────────┐
│ Location            │ Taken % │ Current │ Optimal Layout   │
├─────────────────────────────────────────────────────────────┤
│ parse_json:42       │ 3%      │ taken   │ fall-through ↓   │
│ hash_lookup:78      │ 98%     │ fallthru│ (optimal)        │
│ process:121         │ 15%     │ taken   │ fall-through ↓   │
└─────────────────────────────────────────────────────────────┘

Inlining Recommendations:
  ✓ Inline string_compare into hash_lookup (hot caller, small callee)
  ✓ Inline hash_lookup into parse_json (very hot path)
  ✗ Don't inline allocate_buffer (cold path, large code)

$ ./pgo_tool optimize program.o -o program_optimized.o

Optimizations applied:
  - Reordered 23 basic blocks for fall-through on hot paths
  - Inlined 3 functions on hot paths
  - Moved cold code to .text.cold section

Estimated improvement: 12-18% (branch mispredictions reduced by 40%)

$ ./benchmark program_optimized vs program_baseline
Speedup: 15.3%

Implementation Hints:

Profile data collection (instrumentation approach):

For each basic block, insert:
  __profile_counter[block_id]++;

For each function entry:
  __profile_call_count[func_id]++;

For each branch:
  __profile_branch[branch_id][taken]++;

Control flow graph analysis:

Build CFG from assembly or intermediate representation
Label each edge with profile count
Identify:
  - Hot loops (high back-edge counts)
  - Hot paths (trace through most-taken branches)
  - Cold code (rarely executed)

Basic block reordering algorithm:

Goal: Maximize fall-through on hot paths
Approach:
  1. Find hottest edge from current block
  2. Place target block next (if not already placed)
  3. Repeat until all blocks placed
  4. Handle cold blocks at end

Inlining heuristics to consider:

  • Caller hotness × callee hotness
  • Callee code size (inlining bloats code, hurts I-cache)
  • Call sites per callee (inline if few call sites)
  • Depth in hot path (inline inner loops first)

Questions to explore:

  • How does PGO interact with branch prediction hardware?
  • What’s the benefit of .text.cold sections?
  • How do you handle profile data from multiple runs?
  • What optimizations can ONLY be done with profile data?

Learning milestones:

  1. Profile data collected accurately → You understand instrumentation
  2. Hot path identified correctly → You understand profile analysis
  3. Block reordering improves performance → You understand branch prediction
  4. Inlining decisions match production compilers → You understand optimizer tradeoffs

Project 12: Parallel Merge Sort with Work-Stealing

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Parallel Algorithms / Sorting / Work-Stealing
  • Software or Tool: OpenCilk or custom work-stealing scheduler
  • Main Book: “Introduction to Algorithms” by CLRS (Chapter 27)

What you’ll build: A parallel merge sort that achieves near-linear speedup through work-stealing, including a parallel merge operation (not sequential merge!) that maintains O(n log n) work with O(log² n) span.

Why it teaches performance engineering: Sorting is fundamental, and merge sort’s divide-and-conquer structure is naturally parallel. But the merge step is traditionally sequential, limiting parallelism. The parallel merge algorithm demonstrates how to find parallelism in seemingly sequential operations. This project ties together work-span analysis, cache efficiency, and practical parallelization.

Core challenges you’ll face:

  • Parallel divide phase → maps to trivial parallelism with spawn
  • Parallel merge with binary search → maps to finding parallelism in merge
  • Cache efficiency in parallel context → maps to maintaining locality across cores
  • Work-span analysis → maps to proving O(log² n) span

Key Concepts:

  • Parallel Merge Sort: MIT 6.172 Lecture 8 - Analysis of Multithreaded Algorithms
  • Parallel Merge Algorithm: “Introduction to Algorithms” Chapter 27.3 - CLRS
  • Work-Stealing Schedulers: MIT 6.172 Lecture 13 - The Cilk Runtime System
  • Cache-Efficient Sorting: “Algorithms for Memory Hierarchies” (Vitter survey)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sequential merge sort, work-span analysis concepts, Project 6 (parallel programming)

Real world outcome:

$ ./parallel_sort_bench -n 100000000 -cores 1,2,4,8,16

Parallel Merge Sort Benchmark
Array size: 100M integers (400 MB)
Comparison: sequential vs parallel merge

┌─────────────────────────────────────────────────────────────┐
│ Implementation              │ Work     │ Span       │       │
├─────────────────────────────────────────────────────────────┤
│ Sequential merge sort       │ O(n lg n)│ O(n lg n)P=1   │
│ Parallel sort, seq merge    │ O(n lg n)│ O(n)P=lg n│
│ Parallel sort, par merge    │ O(n lg n)│ O(lg² n)P=n/lg²n│
└─────────────────────────────────────────────────────────────┘

Actual Performance:
┌─────────────────────────────────────────────────────────────┐
│ Cores │ Par-sort,seq-merge │ Par-sort,par-merge │ Speedup  │
├─────────────────────────────────────────────────────────────┤
│ 1     │ 12.4s              │ 12.8s              │ 1.0x     │
│ 2     │ 7.2s               │ 6.8s               │ 1.88x    │
│ 4     │ 4.8s               │ 3.6s               │ 3.56x    │
│ 8     │ 4.1s               │ 1.9s               │ 6.74x    │
│ 16    │ 4.0s (bottleneck!) │ 1.1s               │ 11.6x    │
└─────────────────────────────────────────────────────────────┘

Note: Sequential merge becomes the bottleneck at 8+ cores!
Parallel merge achieves near-linear scaling.

Cache behavior:
  Cache misses (16 cores): 180M (par-merge) vs 450M (seq-merge)
  Parallel merge maintains better locality!

Implementation Hints:

Parallel merge algorithm:

merge(A[1..n], B[1..m], C[1..n+m]):
    # Base case
    if n + m <= BASE_SIZE:
        sequential_merge(A, B, C)
        return

    # Ensure A is longer
    if n < m:
        swap A, B

    # Pick middle of A
    mid_a = n/2
    # Binary search to find corresponding position in B
    mid_b = binary_search(B, A[mid_a])

    # A[mid_a] goes to position mid_a + mid_b in C
    C[mid_a + mid_b] = A[mid_a]

    # Recursively merge halves in parallel
    spawn merge(A[1..mid_a-1], B[1..mid_b], C[1..mid_a+mid_b-1])
    spawn merge(A[mid_a+1..n], B[mid_b+1..m], C[mid_a+mid_b+1..n+m])
    sync

Work-span analysis:

Work: Still O(n) for merge (each element touched once)
Span: T(n) = T(n/2) + O(log n) [binary search + spawn]
     = O(log² n)

Parallelism = O(n / log² n) — very high!

Why parallel merge improves cache behavior:

  • Each recursive call works on smaller subproblems
  • Different cores work on different cache-friendly chunks
  • vs. sequential merge which streams through all of A and B

Granularity tuning:

  • Too fine-grained: spawn overhead dominates
  • Too coarse: insufficient parallelism for all cores
  • Typical: BASE_SIZE ~ 10,000 elements

Questions to explore:

  • Why does sequential merge become a bottleneck?
  • What’s the memory bandwidth requirement for sorting?
  • How does this compare to parallel quicksort?
  • What about sample sort for distributed memory?

Learning milestones:

  1. Parallel divide works correctly → You understand basic fork-join
  2. Parallel merge produces correct output → You understand the algorithm
  3. Scaling improves with parallel merge → You understand why it helps
  4. You can derive the span analysis → You understand parallel complexity

Project 13: Hardware Performance Counter Dashboard

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Python (for UI)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Profiling / Performance Monitoring / Visualization
  • Software or Tool: perf, PAPI, Intel VTune concepts
  • Main Book: “Systems Performance” by Brendan Gregg

What you’ll build: A real-time dashboard that reads hardware performance counters (instructions, cycles, cache misses, branch mispredictions, etc.) and visualizes bottlenecks—showing whether code is CPU-bound, memory-bound, or suffering from specific microarchitectural issues.

Why it teaches performance engineering: Hardware performance counters reveal what’s really happening inside the CPU. This project teaches you to interpret counter data, understand microarchitectural bottlenecks, and diagnose performance issues systematically. It’s like having X-ray vision for your code.

Core challenges you’ll face:

  • Reading performance counters → maps to perf_event_open, counter multiplexing
  • Interpreting counter combinations → maps to derived metrics (CPI, cache hit rate)
  • Real-time sampling → maps to low-overhead continuous monitoring
  • Bottleneck classification → maps to Top-Down Microarchitecture Analysis

Key Concepts:

  • Performance Measurement: MIT 6.172 Lecture 10 - Measurement and Timing
  • Top-Down Analysis: Intel’s Top-Down Microarchitecture Analysis Method
  • Hardware Counters: “Systems Performance” Chapter 6 - Brendan Gregg
  • Roofline Model: “Roofline: An Insightful Visual Performance Model” (Williams et al.)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 1 (basic measurement), understanding of CPU architecture

Real world outcome:

$ ./perf_dashboard -p $(pidof my_app) --refresh 500ms

╔═══════════════════════════════════════════════════════════════════════╗
║             PERFORMANCE DASHBOARD - my_app (PID: 12345)               ║
╠═══════════════════════════════════════════════════════════════════════╣
║  CPU Utilization: ████████████████████░░░░ 82%    Cores: 4/8 active   ║
╠═══════════════════════════════════════════════════════════════════════╣
║                           CORE METRICS                                ║
╠══════════════════╦════════════════╦═══════════════╦═══════════════════╣
║  Instructions    ║  2,341 M/sec   ║  Cycles       ║  3,120 M/sec      ║
║  IPC             ║  0.75 ⚠️       ║  Frequency    ║  3.2 GHz          ║
╠══════════════════╩════════════════╩═══════════════╩═══════════════════╣
║                         MEMORY HIERARCHY                              ║
╠═══════════════════════════════════════════════════════════════════════╣
║  L1D Hit Rate    ║ ████████████████████ 99.2%    ║  OK ✓             ║
║  L2 Hit Rate     ║ ████████████████░░░░ 78.4%    ║  OK ✓             ║
║  L3 Hit Rate     ║ ████████░░░░░░░░░░░░ 42.1%    ║  WARN ⚠️          ║
║  Memory BW       ║ ██████████████░░░░░░ 24.2 GB/s (of 50 GB/s peak)  ║
╠═══════════════════════════════════════════════════════════════════════╣
║                        PIPELINE EFFICIENCY                            ║
╠═══════════════════════════════════════════════════════════════════════╣
║  Branch Mispred  ║ 2.3% ████████████████████ OK ✓                    ║
║  Frontend Bound  ║ 8.2% ████░░░░░░░░░░░░░░░░ OK ✓                    ║
║  Backend Bound   ║ 61.4%████████████████████ HIGH ❌ (Memory)         ║
║  Retiring        ║ 24.1%██████████░░░░░░░░░░                         ║
╠═══════════════════════════════════════════════════════════════════════╣
║                          BOTTLENECK ANALYSIS                          ║
╠═══════════════════════════════════════════════════════════════════════╣
║  Primary:   MEMORY BOUND (L3 misses → DRAM latency)                   ║
║  Secondary: Low ILP (long dependency chains detected)                 ║
║                                                                       ║
║  Recommendations:                                                     ║
║  • Consider cache blocking to improve L3 hit rate                     ║
║  • Prefetch hints may help with predictable access patterns           ║
║  • Check for false sharing if multi-threaded                          ║
╚═══════════════════════════════════════════════════════════════════════╝

Implementation Hints:

Key counters to collect:

Basic:
  - instructions, cycles → IPC
  - cache-references, cache-misses → miss rate
  - branches, branch-misses → misprediction rate

Memory hierarchy:
  - L1-dcache-loads, L1-dcache-load-misses
  - l2_rqsts.references, l2_rqsts.miss
  - LLC-loads, LLC-load-misses

Top-down analysis (Intel specific):
  - UOPS_RETIRED.RETIRE_SLOTS
  - UOPS_ISSUED.ANY
  - IDQ_UOPS_NOT_DELIVERED.CORE
  - Various backend stall counters

Top-Down Microarchitecture Analysis:

All cycles are classified into:
1. Retiring (useful work)
2. Bad Speculation (wrong-path work)
3. Frontend Bound (instruction supply issues)
4. Backend Bound (execution issues)
   - Memory Bound
   - Core Bound

This gives a systematic way to identify bottlenecks.

Counter multiplexing:

  • PMUs have limited counters (typically 4-8 per core)
  • OS multiplexes if you request more
  • Use groups to ensure counters are read together
  • Handle scaling for multiplexed counters

Real-time considerations:

  • Reading counters has overhead (~hundreds of nanoseconds)
  • Sample at reasonable intervals (100ms-1s)
  • Consider per-CPU vs. per-thread counting

Questions to explore:

  • Why is IPC alone not sufficient to identify bottlenecks?
  • How do you distinguish L3 misses from DRAM bandwidth saturation?
  • What counters indicate false sharing?
  • How does hyperthreading affect counter interpretation?

Learning milestones:

  1. Basic counters display correctly → You understand the PMU interface
  2. Derived metrics computed properly → You understand counter relationships
  3. Top-down analysis matches Intel VTune → You understand the methodology
  4. Recommendations are actionable → You understand optimization strategies

Project 14: Graph Algorithm Optimizer

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Graph Algorithms / Cache Optimization / Parallelization
  • Software or Tool: Building optimized BFS/SSSP/PageRank
  • Main Book: “Introduction to Algorithms” by CLRS

What you’ll build: An optimized graph processing library implementing BFS, SSSP (Dijkstra/Bellman-Ford), and PageRank with cache-efficient data layouts, direction-optimizing traversal, and parallelization—benchmarked on real-world graphs (road networks, social graphs, web graphs).

Why it teaches performance engineering: Graphs are everywhere (social networks, road maps, web crawling), but graph algorithms are notoriously memory-bound due to irregular access patterns. This project teaches you to optimize for the worst case: random memory access. You’ll learn about graph representations, direction-optimizing traversal, and how to squeeze performance from cache-hostile workloads.

Core challenges you’ll face:

  • Graph representation (CSR vs adjacency list) → maps to memory layout tradeoffs
  • Direction-optimizing BFS → maps to switching between push and pull
  • Cache-efficient traversal → maps to vertex ordering, edge partitioning
  • Parallel graph algorithms → maps to synchronization, work distribution

Key Concepts:

  • Graph Optimization: MIT 6.172 Lecture 22 - Graph Optimization
  • Direction-Optimizing BFS: “Direction-Optimizing Breadth-First Search” (Beamer et al.)
  • Graph Representations: “Introduction to Algorithms” Chapter 22 - CLRS
  • Parallel Graph Algorithms: “Ligra: A Lightweight Graph Processing Framework”

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Basic graph algorithms, cache optimization (Project 2), parallel programming (Project 6)

Real world outcome:

$ ./graph_bench -g road_network_usa.mtx -a bfs,sssp,pagerank

Graph: USA Road Network
Vertices: 23,947,347
Edges: 58,333,344 (directed)

BFS Benchmark (from random source):
┌─────────────────────────────────────────────────────────────┐
│ Implementation          │ Time   │ MTEPS  │ Cache Misses   │
├─────────────────────────────────────────────────────────────┤
│ Naive (queue-based)     │ 1.82s  │ 32.1   │ 892M           │
│ CSR representation      │ 0.94s  │ 62.1   │ 412M           │
│ Direction-optimizing    │ 0.31s  │ 188.2  │ 156M           │
│ + Parallel (8 cores)    │ 0.052s │ 1121.8 │ 168M           │
└─────────────────────────────────────────────────────────────┘

PageRank (20 iterations):
┌─────────────────────────────────────────────────────────────┐
│ Implementation          │ Time   │ Edges/s │ Convergence   │
├─────────────────────────────────────────────────────────────┤
│ Pull-based (naive)      │ 28.4s  │ 41.1M   │ 20 iters      │
│ Push-based              │ 12.2s  │ 95.6M   │ 20 iters      │
│ + Vertex reordering     │ 8.4s   │ 138.9M  │ 20 iters      │
│ + Parallel (8 cores)    │ 1.2s   │ 972.2M  │ 20 iters      │
│ + Delta-based           │ 0.34s  │ N/A     │ converged @ 8 │
└─────────────────────────────────────────────────────────────┘

Memory bandwidth utilization: 34.2 GB/s (of 50 GB/s peak)
Graph processing is MEMORY BOUND as expected.

Implementation Hints:

CSR (Compressed Sparse Row) representation:

row_ptr[v] = start index of v's neighbors in col_idx
col_idx[row_ptr[v]..row_ptr[v+1]] = neighbors of v

Memory: O(V + E) — much better than adjacency matrix O(V²)
Access pattern: Sequential for one vertex's neighbors

Direction-optimizing BFS:

Top-down (push): Visit frontier, add unvisited neighbors
  Good when frontier is small

Bottom-up (pull): For each unvisited vertex, check if any neighbor is in frontier
  Good when frontier is large (many vertices in frontier → likely to find parent)

Switch based on |frontier| / |unvisited| ratio!

Why graph algorithms are hard to optimize:

  • Irregular access patterns → cache misses on every edge
  • Low arithmetic intensity → memory bandwidth bound
  • Power-law degree distributions → load imbalance

Optimization techniques:

  1. Vertex reordering: Place frequently co-accessed vertices nearby
  2. Edge partitioning: Divide edges for cache-friendly batches
  3. Prefetching: Prefetch neighbor lists while processing current
  4. Delta-stepping (SSSP): Bucket vertices by distance for better locality

Questions to explore:

  • Why does direction-optimizing BFS help so much?
  • How do you parallelize without excessive synchronization?
  • What’s the difference between road networks and social networks?
  • How would you handle graphs that don’t fit in memory?

Learning milestones:

  1. CSR representation works correctly → You understand graph memory layout
  2. Direction-optimizing outperforms naive by 3-5x → You understand the technique
  3. Parallel version scales to 8 cores → You understand parallel graph algorithms
  4. You identify memory bandwidth as the bottleneck → You understand real limits

Project 15: Traveling Salesman Solver

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Optimization / Branch-and-Bound / Bit Manipulation
  • Software or Tool: Building Concorde-inspired solver
  • Main Book: “The Algorithm Design Manual” by Steven Skiena

What you’ll build: A high-performance TSP solver using branch-and-bound with effective bounds (1-tree, Held-Karp), intelligent branching strategies, and parallelization—capable of solving instances with 50-100 cities to optimality.

Why it teaches performance engineering: TSP is NP-hard, so brute force is hopeless. But clever algorithms can solve surprisingly large instances. This project brings together everything: bit manipulation for state representation, cache-efficient data structures, branch prediction optimization, and parallelization of search. It’s where algorithms meet systems performance.

Core challenges you’ll face:

  • State representation with bitmasks → maps to bit manipulation for efficiency
  • Computing tight lower bounds → maps to 1-tree, Held-Karp bounds
  • Branch-and-bound search → maps to pruning, search order
  • Parallel search with load balancing → maps to work distribution, bound sharing

Key Concepts:

  • TSP Tuning: MIT 6.172 Lecture 21 - Tuning a TSP Algorithm
  • Branch and Bound: “The Algorithm Design Manual” Chapter 7 - Skiena
  • Bit Manipulation for Sets: MIT 6.172 Lecture 3 - Bit Hacks
  • TSP Algorithms: “The Traveling Salesman Problem: A Computational Study” (Applegate et al.)

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Dynamic programming, graph algorithms (Project 14), bit manipulation (Project 3), parallel programming (Project 6)

Real world outcome:

$ ./tsp_solver -i berlin52.tsp -t 60

TSP Solver - Berlin 52 cities
Optimal tour length: 7542 (known optimal)

Search Progress:
┌─────────────────────────────────────────────────────────────┐
│ Time   │ Best Tour │ Bound    │ Gap    │ Nodes    │ Pruned  │
├─────────────────────────────────────────────────────────────┤
│ 0.1s   │ 8,234     │ 6,891    │ 19.5%  │ 1,024    │ 45%     │
│ 0.5s   │ 7,891     │ 7,102    │ 11.1%  │ 12,892   │ 72%     │
│ 1.0s   │ 7,678     │ 7,289    │ 5.3%   │ 45,123   │ 84%     │
│ 2.0s   │ 7,542 ★   │ 7,421    │ 1.6%   │ 156,789  │ 91%     │
│ 3.2s   │ 7,542 ★   │ 7,542    │ 0.0%   │ 312,456  │ OPTIMAL │
└─────────────────────────────────────────────────────────────┘

Solution verified: OPTIMAL
Time to optimal: 2.0s
Time to prove: 3.2s
Nodes explored: 312,456 (vs ~2^52 = 4.5 × 10^15 brute force)

Parallel scaling (8 cores): 4.8x speedup

$ ./tsp_solver -i att48.tsp -algorithm dp
Dynamic Programming (Held-Karp):
  States: 2^48 × 48 = 13.5 × 10^15 — TOO LARGE
  Branch-and-bound is required for this size.

Implementation Hints:

Bitmask representation:

Visited set: 64-bit integer, bit i = 1 if city i visited
Operations:
  - Add city i: set |= (1ULL << i)
  - Remove city i: set &= ~(1ULL << i)
  - Check city i: (set >> i) & 1
  - Count visited: __builtin_popcountll(set)
  - Iterate unvisited: for each 0 bit...

Branch-and-bound structure:

bound = compute_lower_bound(partial_tour)
if bound >= best_known:
    prune!  # This subtree can't improve

for each unvisited city c:
    extend tour with c
    recurse
    backtrack

Lower bounds (in increasing tightness/cost):

  1. Minimum spanning tree: O(n²) to compute
  2. 1-tree bound: MST + two edges to starting city
  3. Held-Karp bound: Lagrangian relaxation of LP

Search order matters:

  • Nearest neighbor heuristic first → good initial bound
  • Best-first search → explore promising nodes first
  • Most constrained first → prune early

Parallelization:

  • Distribute subtrees to threads
  • Share bounds across threads (atomic best_known)
  • Load balancing: work stealing when thread finishes early

Questions to explore:

  • Why does pruning percentage increase over time?
  • How much does a good initial heuristic help?
  • What’s the tradeoff between bound quality and compute time?
  • How do you handle memory for large state spaces?

Learning milestones:

  1. Correct exhaustive search → You understand the problem structure
  2. Bounds prune 90%+ of search space → You understand branch-and-bound
  3. Solve 20+ cities in seconds → You understand the optimizations
  4. Parallel version scales well → You understand parallel search

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Performance Measurement Toolkit Intermediate Weekend-1 week ★★★★☆ ★★★☆☆
2. Matrix Multiplication Optimizer Advanced 2-3 weeks ★★★★★ ★★★★☆
3. Bit Manipulation Library Intermediate 1 week ★★★☆☆ ★★★★☆
4. Assembly Language Analyzer Advanced 2-3 weeks ★★★★★ ★★★☆☆
5. Custom Memory Allocator Expert 3-4 weeks ★★★★★ ★★★★☆
6. Parallel Matrix Operations Advanced 2-3 weeks ★★★★☆ ★★★★☆
7. Race Detector Expert 4-6 weeks ★★★★★ ★★★☆☆
8. Cache-Oblivious B-Tree Expert 3-4 weeks ★★★★★ ★★★☆☆
9. Lock-Free Queue Expert 3-4 weeks ★★★★★ ★★★★☆
10. SIMD JSON Parser Expert 4-6 weeks ★★★★★ ★★★★★
11. Profile-Guided Optimizer Expert 4-6 weeks ★★★★★ ★★★☆☆
12. Parallel Merge Sort Advanced 2-3 weeks ★★★★☆ ★★★★☆
13. HW Perf Counter Dashboard Advanced 2-3 weeks ★★★★☆ ★★★★★
14. Graph Algorithm Optimizer Advanced 3-4 weeks ★★★★☆ ★★★★☆
15. TSP Solver Expert 4-6 weeks ★★★★★ ★★★★★

Phase 1: Foundations (4-6 weeks)

Build these first to establish measurement and low-level understanding:

  1. Project 1: Performance Measurement Toolkit — You need this to validate all other optimizations
  2. Project 3: Bit Manipulation Library — Quick win, teaches low-level thinking
  3. Project 2: Matrix Multiplication Optimizer — THE canonical performance engineering exercise

Phase 2: Systems Understanding (6-8 weeks)

Deepen your understanding of how systems actually work:

  1. Project 4: Assembly Language Analyzer — Learn to see what the compiler produces
  2. Project 5: Custom Memory Allocator — Understand dynamic memory deeply
  3. Project 13: HW Perf Counter Dashboard — Visual feedback on what’s happening

Phase 3: Parallelism (6-8 weeks)

Master parallel programming:

  1. Project 6: Parallel Matrix Operations — Fork-join parallelism basics
  2. Project 12: Parallel Merge Sort — Work-span analysis in practice
  3. Project 7: Race Detector — Understand concurrency bugs

Phase 4: Advanced Topics (8-12 weeks)

Push into expert territory:

  1. Project 9: Lock-Free Queue — Cutting-edge concurrency
  2. Project 8: Cache-Oblivious B-Tree — Advanced data structure design
  3. Project 10: SIMD JSON Parser — Pure magic performance

Phase 5: Integration (4-8 weeks)

Apply everything together:

  1. Project 14: Graph Algorithm Optimizer — Real-world memory-bound problem
  2. Project 11: Profile-Guided Optimizer — Compiler + runtime = holistic optimization
  3. Project 15: TSP Solver — Algorithms meet systems

Final Capstone Project: High-Performance Key-Value Store

  • File: LEARN_PERFORMANCE_ENGINEERING.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Storage / Full-Stack Performance
  • Software or Tool: Building a Redis/RocksDB competitor
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A high-performance in-memory key-value store that combines everything you’ve learned: custom memory allocator for minimal fragmentation, lock-free hash table for concurrent access, SIMD-accelerated key comparison, parallel request handling, and efficient persistence with write-ahead logging.

Why it teaches performance engineering: This is the ultimate integration project. Key-value stores are the backbone of modern infrastructure, and building one requires mastery of every concept: memory management, concurrency, caching, I/O, and measurement. Companies pay millions for fast key-value stores; you’ll understand why.

Core challenges you’ll face:

  • Lock-free hash table → applies Project 9
  • Custom slab allocator → applies Project 5
  • SIMD key comparison → applies Project 10
  • Parallel request handling → applies Projects 6, 12
  • Efficient persistence → new (WAL, fsync, mmap)
  • Performance measurement at every layer → applies Projects 1, 13

Key Concepts:

  • In-Memory Databases: “Designing Data-Intensive Applications” Chapter 3 - Kleppmann
  • Hash Table Design: All previous projects
  • Write-Ahead Logging: “Database Internals” by Alex Petrov
  • Efficient I/O: io_uring, mmap, direct I/O

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects (or equivalent experience)

Real world outcome:

$ ./kvstore_bench -threads 8 -ops 10000000 -keysize 32 -valuesize 256

High-Performance KV Store Benchmark
Operations: 10M (50% GET, 50% SET)
Keys: 32 bytes, Values: 256 bytes
Threads: 8

┌─────────────────────────────────────────────────────────────┐
│ Metric              │ Our KV Store │ Redis  │ Memcached    │
├─────────────────────────────────────────────────────────────┤
│ Throughput          │ 4.2M ops/s   │ 180K   │ 250K         │
│ Avg Latency         │ 1.9 μs       │ 44 μs  │ 32 μs        │
│ P99 Latency         │ 8.2 μs       │ 120 μs │ 85 μs        │
│ Memory Efficiency   │ 92%          │ 78%    │ 82%          │
└─────────────────────────────────────────────────────────────┘

Why we're faster:
  ✓ Lock-free hash table (no mutex contention)
  ✓ Custom allocator (no malloc overhead, no fragmentation)
  ✓ SIMD key comparison (32-byte keys in one instruction)
  ✓ Batched fsync (amortized persistence cost)
  ✓ Thread-local operation queues (reduced synchronization)

Persistence test:
  Write 1M keys, kill process, restart, verify all keys present: PASS
  WAL replay time: 0.8 seconds

Implementation Hints:

Architecture overview:

┌──────────────────────────────────────────────────────────────┐
│                     Request Handlers                          │
│   (thread-local queues, batch processing)                    │
├──────────────────────────────────────────────────────────────┤
│                    Lock-Free Hash Table                       │
│   (open addressing, SIMD probing, hopscotch/robin hood)      │
├──────────────────────────────────────────────────────────────┤
│                    Slab Allocator                             │
│   (size classes, thread-local caches, no fragmentation)      │
├──────────────────────────────────────────────────────────────┤
│                    Write-Ahead Log                            │
│   (append-only, group commit, background sync)               │
└──────────────────────────────────────────────────────────────┘

Key design decisions:

  1. Hash table: Lock-free with CAS, resize via gradual migration
  2. Memory: Slab allocator with power-of-2 size classes
  3. Key comparison: SIMD for fixed-size keys, optimized memcmp for variable
  4. Persistence: Append to WAL, batch fsync, background compaction
  5. Networking: Event-driven (epoll/kqueue) or io_uring

Questions to explore:

  • How do you resize a lock-free hash table?
  • What’s the tradeoff between immediate and batched fsync?
  • How do you handle hot keys (skewed access)?
  • What’s the latency breakdown (network, compute, storage)?

Learning milestones:

  1. Lock-free hash table works correctly → You’ve mastered concurrency
  2. Throughput exceeds 1M ops/s → You’ve optimized the hot path
  3. Persistence doesn’t hurt performance → You understand I/O optimization
  4. You can explain every microsecond of latency → You’re a performance engineer

Summary

# Project Main Language
1 Performance Measurement Toolkit C
2 Matrix Multiplication Optimizer C
3 Bit Manipulation Library C
4 Assembly Language Analyzer C
5 Custom Memory Allocator C
6 Parallel Matrix Operations with Cilk C (OpenCilk)
7 Race Detector C
8 Cache-Oblivious B-Tree C
9 Lock-Free Queue C
10 SIMD JSON Parser C
11 Profile-Guided Optimizer C/C++
12 Parallel Merge Sort with Work-Stealing C
13 Hardware Performance Counter Dashboard C
14 Graph Algorithm Optimizer C
15 Traveling Salesman Solver C
Capstone High-Performance Key-Value Store C

Resources

Primary Reference

Essential Books

  • “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Foundation for everything
  • “Introduction to Algorithms” by CLRS - Algorithms including parallel algorithms
  • “The Art of Multiprocessor Programming” by Herlihy & Shavit - Concurrency theory and practice

Performance-Specific Books

  • “Systems Performance” by Brendan Gregg - Practical performance engineering
  • “Hacker’s Delight” by Henry Warren - Bit manipulation techniques
  • “What Every Programmer Should Know About Memory” by Ulrich Drepper - Memory hierarchy deep dive

Online Resources


“The difference between a bad programmer and a good one is whether they consider their code or their data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.” — Linus Torvalds

“Premature optimization is the root of all evil — but so is premature pessimization.” — This course