LEARN PERFORMANCE ENGINEERING

Learn Performance Engineering: From Zero to Systems Wizard

Goal: Deeply understand software performance engineering—from bit-level optimizations to parallel algorithms, cache-efficient data structures, and building systems that squeeze every cycle out of modern hardware.

Based on: MIT 6.172 Performance Engineering of Software Systems (Fall 2018)

Why Performance Engineering Matters

In an era of “just throw more hardware at it,” true performance engineering is a lost art. Yet the difference between a naive implementation and an optimized one can be 100x or more—that’s the difference between a query taking 10 seconds and 100 milliseconds, between needing 100 servers and needing 1.

Modern CPUs are marvels of engineering: multiple cores, deep pipelines, branch predictors, multiple cache levels, SIMD units, and out-of-order execution. But to exploit this power, you need to understand:

How the machine actually works (not the abstraction)
Where time goes (measurement, not guessing)
What the compiler can and cannot do
How to structure data for cache efficiency
How to parallelize without destroying performance

After completing these projects, you will:

Write code that runs 10-100x faster than naive implementations
Understand assembly output and reason about instruction-level performance
Build parallel programs that scale with core count
Design cache-efficient algorithms and data structures
Profile and optimize real systems systematically
Understand the engineering tradeoffs in modern systems

Core Concept Analysis

The Performance Engineering Mindset

"Premature optimization is the root of all evil" - Knuth

But also:

"We should forget about small efficiencies, say about 97% of the time...
Yet we should not pass up our opportunities in that critical 3%." - Knuth (full quote)

The Memory Hierarchy

┌──────────────────────────────────────────────────────────────┐
│                        Registers                              │ ~1 cycle
│                         (few KB)                              │
├──────────────────────────────────────────────────────────────┤
│                         L1 Cache                              │ ~4 cycles
│                         (32-64 KB)                            │
├──────────────────────────────────────────────────────────────┤
│                         L2 Cache                              │ ~12 cycles
│                        (256 KB - 1 MB)                        │
├──────────────────────────────────────────────────────────────┤
│                         L3 Cache                              │ ~40 cycles
│                         (8-64 MB)                             │
├──────────────────────────────────────────────────────────────┤
│                        Main Memory                            │ ~200 cycles
│                         (GB - TB)                             │
├──────────────────────────────────────────────────────────────┤
│                           SSD                                 │ ~50,000 cycles
│                         (100s GB)                             │
├──────────────────────────────────────────────────────────────┤
│                           HDD                                 │ ~10,000,000 cycles
│                          (TBs)                                │
└──────────────────────────────────────────────────────────────┘

Fundamental Concepts

Work and Span
- Work: Total operations performed (sequential time)
- Span: Longest dependency chain (parallel time with infinite processors)
- Parallelism: Work / Span (how much parallel speedup is possible)
Bentley’s Rules for Optimization
- Data Structure Augmentation
- Precomputation
- Compile-Time Initialization
- Caching
- Lazy Evaluation
- Loop Transformations (unrolling, fusion, fission)
- Logic Optimizations
The Roofline Model
- Performance bound by either compute or memory bandwidth
- Operational Intensity = FLOPs / Bytes moved
- Know which bound applies to your algorithm

Amdahl’s Law

Speedup = 1 / ((1 - P) + P/S)

Where: P = parallelizable fraction
       S = speedup of parallel portion

Cache Efficiency
- Temporal locality: reuse data while it’s hot
- Spatial locality: access data contiguously
- Cache-oblivious algorithms: work well regardless of cache size
Instruction-Level Parallelism (ILP)
- Modern CPUs execute multiple instructions per cycle
- Data dependencies limit ILP
- Branch mispredictions flush the pipeline (~15-20 cycles)

Project 1: Performance Measurement Toolkit

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: Profiling / Measurement / Benchmarking
Software or Tool: perf, RDTSC, clock_gettime
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A micro-benchmarking framework that accurately measures CPU cycles, cache misses, branch mispredictions, and memory bandwidth for arbitrary code snippets—with statistical rigor (warmup, multiple runs, outlier detection, confidence intervals).

Why it teaches performance engineering: You cannot optimize what you cannot measure. This project teaches you the fundamental skill of accurate performance measurement—understanding why naive timing is often wrong (warmup effects, frequency scaling, context switches), how hardware performance counters work, and how to draw valid conclusions from noisy data.

Core challenges you’ll face:

Cycle-accurate timing with RDTSC → maps to understanding CPU timestamp counters and serialization
Accessing hardware performance counters → maps to perf_event_open, PMU architecture
Eliminating measurement noise → maps to CPU frequency scaling, TurboBoost, process affinity
Statistical analysis of results → maps to variance, confidence intervals, outlier detection

Key Concepts:

RDTSC and Timing: “What Every Programmer Should Know About Memory” Section 7 - Ulrich Drepper
Hardware Performance Counters: “Computer Architecture” Chapter 4 - Hennessy & Patterson
Statistical Benchmarking: “Systems Performance” Chapter 2 - Brendan Gregg
CPU Frequency Scaling: Linux kernel documentation on cpufreq governors

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: C programming, basic statistics, familiarity with Linux

Real world outcome:

$ ./perf_measure -b matrix_mult -n 1000 -s 256
Benchmark: matrix_mult (256x256)
Runs: 1000, Warmup: 100

┌────────────────────────────────────────────────────────────┐
│ Metric              │ Mean      │ Std Dev  │ 95% CI       │
├────────────────────────────────────────────────────────────┤
│ CPU Cycles          │ 45.2M     │ 1.2M     │ ±0.08M       │
│ Instructions        │ 134.2M    │ 0.001M   │ ±0.0001M     │
│ IPC                 │ 2.97      │ 0.08     │ ±0.005       │
│ L1D Cache Misses    │ 2.1M      │ 0.05M    │ ±0.003M      │
│ L3 Cache Misses     │ 0.26M     │ 0.02M    │ ±0.001M      │
│ Branch Mispred      │ 15.2K     │ 0.8K     │ ±0.05K       │
│ Time (ms)           │ 15.2      │ 0.4      │ ±0.03        │
└────────────────────────────────────────────────────────────┘

Memory Bandwidth: 4.3 GB/s
Operational Intensity: 2.1 FLOPs/byte
Bound by: MEMORY (below roofline)

Implementation Hints:

The core of cycle-accurate timing on x86:

Serialize with CPUID (ensures all previous instructions complete)
Read TSC with RDTSC/RDTSCP
Run your code
Serialize again
Read TSC again
Difference = cycles elapsed

For hardware counters, Linux provides perf_event_open():

What events are available? Check /sys/devices/cpu/events/
You can read multiple counters simultaneously using event groups
Consider: How do you handle multiplexing if you need more counters than available PMUs?

For statistical rigor:

How many warmup runs eliminate JIT/cache cold effects?
How do you detect outliers? (IQR method, Z-score?)
What sample size gives you 95% confidence?
Should you use mean, median, or minimum?

Questions to ask yourself:

Why might back-to-back measurements differ by 1000x?
What happens if TurboBoost kicks in mid-measurement?
How does hyperthreading affect your measurements?

Learning milestones:

RDTSC timing works consistently → You understand instruction serialization
Hardware counters report correctly → You understand the PMU interface
Results are reproducible across runs → You understand measurement methodology
You can identify noise vs. real variance → You understand performance statistics

Project 2: Matrix Multiplication Optimizer

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust, Fortran
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Numerical Computing / Cache Optimization / SIMD
Software or Tool: BLAS comparison, Intel Intrinsics
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron (Chapter 5 & 6)

What you’ll build: A matrix multiplication library that progressively improves from naive O(n³) implementation to a highly optimized version using blocking, SIMD intrinsics, and loop transformations—achieving within 50% of Intel MKL performance.

Why it teaches performance engineering: Matrix multiplication is THE canonical example for performance optimization. The naive algorithm leaves 99% of CPU performance on the table. By optimizing it, you encounter every major concept: cache blocking, loop tiling, SIMD vectorization, instruction-level parallelism, and memory access patterns. The gap between naive (~1 GFLOP) and optimized (~100+ GFLOP) is so dramatic that it permanently changes how you think about code.

Core challenges you’ll face:

Cache blocking/tiling → maps to fitting working set in L1/L2 cache
Loop ordering (ijk vs ikj vs …) → maps to spatial locality and stride access
SIMD vectorization → maps to AVX2/AVX-512 intrinsics for parallel arithmetic
Register tiling → maps to keeping data in registers to avoid cache access
TLB efficiency → maps to minimizing page faults in large matrices

Key Concepts:

Loop Tiling for Cache: “Computer Systems: A Programmer’s Perspective” Chapter 6.6 - Bryant & O’Hallaron
SIMD Intrinsics: Intel Intrinsics Guide (online reference)
Matrix Multiplication Optimization: MIT 6.172 Lecture 1 - Charles Leiserson
Cache-Oblivious Algorithms: “Introduction to Algorithms” Chapter 27 - CLRS

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, understanding of caches, Project 1 (measurement)

Real world outcome:

$ ./matmul_bench -n 1024
Matrix Multiplication Benchmark (1024x1024, double precision)
Peak theoretical: 150 GFLOPS (assuming 3 GHz, 8-wide AVX, FMA)

┌─────────────────────────────────────────────────────────────┐
│ Implementation          │ Time (ms) │ GFLOPS │ % of Peak   │
├─────────────────────────────────────────────────────────────┤
│ Naive (ijk)             │ 8,542     │ 0.25   │ 0.17%       │
│ Loop interchange (ikj)  │ 1,923     │ 1.11   │ 0.74%       │
│ 32x32 blocking          │ 412       │ 5.19   │ 3.46%       │
│ + Loop unrolling (4x)   │ 198       │ 10.81  │ 7.21%       │
│ + AVX2 SIMD             │ 52        │ 41.12  │ 27.41%      │
│ + Register tiling       │ 31        │ 69.02  │ 46.01%      │
│ + Prefetching           │ 26        │ 82.28  │ 54.85%      │
├─────────────────────────────────────────────────────────────┤
│ Intel MKL (reference)   │ 18        │ 118.82 │ 79.21%      │
└─────────────────────────────────────────────────────────────┘

Final speedup over naive: 328x

Implementation Hints:

Start with the naive implementation and measure it carefully:

for i in 0..n:
    for j in 0..n:
        for k in 0..n:
            C[i][j] += A[i][k] * B[k][j]

The first optimization is free—just change loop order. Why does ikj outperform ijk by 4-8x? Draw the memory access patterns.

For cache blocking, think about:

What block size fits in L1 cache? (32KB ≈ 4096 doubles ≈ 64x64 block… but you need 3 matrices)
The answer is usually 32x32 or 64x64 for doubles
Why does blocking help? What’s the new miss rate?

For SIMD:

AVX2 gives you 4 doubles (256-bit) or AVX-512 gives 8 doubles (512-bit)
How do you load 4 doubles, multiply them, and accumulate?
What about when dimensions aren’t multiples of 4?

You have 16 YMM registers (AVX2)
Can you keep a 4x4 block of C in registers while streaming A and B?
This is the key to hitting peak FLOPS

Questions to explore:

What happens when matrices don’t fit in L3 cache?
How would you parallelize this across cores? (Project 6)
Why is Strassen’s algorithm not used in practice for moderate sizes?

Learning milestones:

Loop interchange gives 4x speedup → You understand stride-1 access patterns
Blocking gives 10x over interchange → You understand cache working sets
SIMD gives 4x over blocked → You understand vector parallelism
You achieve 50%+ of MKL → You understand the full optimization stack

Project 3: Bit Manipulation Library

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Bit Manipulation / Low-Level Optimization
Software or Tool: None (pure computation)
Main Book: “Hacker’s Delight” by Henry S. Warren Jr.

What you’ll build: A library of bit manipulation routines that implement common operations (popcount, leading/trailing zeros, bit reversal, next permutation, etc.) using clever bit hacks—with benchmarks comparing to naive implementations and compiler intrinsics.

Why it teaches performance engineering: Bit hacks demonstrate that the same computation can be done in radically different ways with orders of magnitude performance difference. They teach you to think about data representation, exploit instruction-level parallelism, and understand how computers actually compute. These techniques appear everywhere: hash functions, compression, cryptography, graphics, and systems programming.

Core challenges you’ll face:

Implementing popcount without loops → maps to parallel bit counting with shifts and masks
Finding leading/trailing zeros → maps to binary search with bits
Understanding why these work → maps to bitwise arithmetic and Boolean algebra
Comparing with hardware instructions → maps to when to use intrinsics vs. pure C

Key Concepts:

Bit Twiddling: “Hacker’s Delight” Chapters 2-5 - Henry Warren
Bit Hacks Collection: MIT 6.172 Lecture 3 - Bit Hacks
Population Count: “Computer Systems: A Programmer’s Perspective” Practice Problem 2.66
POPCNT and LZCNT Instructions: Intel 64 and IA-32 Architectures Software Developer’s Manual

Difficulty: Intermediate Time estimate: 1 week Prerequisites: C programming, binary number system, bitwise operators

Real world outcome:

$ ./bithacks_test
Bit Manipulation Library - Performance Tests

┌────────────────────────────────────────────────────────────┐
│ Operation           │ Naive    │ BitHack  │ Intrinsic    │
├────────────────────────────────────────────────────────────┤
│ popcount(64-bit)    │ 64 ops   │ 12 ops   │ 1 op (POPCNT)│
│ leading zeros       │ 64 ops   │ 5 ops    │ 1 op (LZCNT) │
│ trailing zeros      │ 64 ops   │ 5 ops    │ 1 op (TZCNT) │
│ reverse bits        │ 64 ops   │ 17 ops   │ N/A          │
│ next power of 2     │ 6+ ops   │ 5 ops    │ 2 ops        │
│ is power of 2       │ loop     │ 2 ops    │ 2 ops        │
│ log2 (integer)      │ 64 ops   │ 5 ops    │ 1 op (LZCNT) │
│ abs (branchless)    │ 1 branch │ 3 ops    │ 3 ops        │
│ sign extend         │ 1 branch │ 2 ops    │ 2 ops        │
│ bit interleave      │ 64 ops   │ 12 ops   │ PDEP 1 op    │
└────────────────────────────────────────────────────────────┘

Detailed popcount benchmark (1M random 64-bit values):
  Naive (loop):      45.2 ms
  SWAR (parallel):   12.1 ms (3.7x faster)
  __builtin_popcountll: 2.3 ms (19.6x faster)
  Hardware POPCNT:   1.8 ms (25.1x faster)

Implementation Hints:

The classic popcount algorithm (SWAR - SIMD Within A Register):

Think of a 64-bit integer as 64 1-bit counters
Step 1: Add pairs of bits → 32 2-bit counters
Step 2: Add pairs of 2-bit counters → 16 4-bit counters
Step 3: Add pairs of 4-bit counters → 8 8-bit counters
...continue until one 64-bit counter

The magic constants mask out alternating positions.
What are those constants? (0x5555..., 0x3333..., 0x0F0F..., etc.)

For leading zeros:

Binary search, but with bits!
If the top 32 bits are zero, we know result ≥ 32
If the top 16 of remaining are zero, result ≥ 48
...continue halving

How do you check "top N bits are zero" with one operation?

The “is power of 2” hack: x && !(x & (x-1))

Why does this work?
What’s special about x - 1 when x is a power of 2?

Questions to explore:

Why is branchless abs faster than x < 0 ? -x : x?
What’s the relationship between popcount and Hamming distance?
How do modern CPUs implement POPCNT in hardware?
When should you use intrinsics vs. let the compiler optimize?

Learning milestones:

You can explain why each hack works → You understand bit-level reasoning
You implement popcount with SWAR → You understand parallel computation in registers
You know when to use intrinsics → You understand the performance tradeoffs
You can derive new bit hacks → You’ve internalized the techniques

Project 4: Assembly Language Analyzer

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C (analyzing assembly output)
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Assembly Language / Compiler Output / CPU Architecture
Software or Tool: objdump, Compiler Explorer (godbolt.org)
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron (Chapter 3)

What you’ll build: A tool that takes C source code, compiles it at various optimization levels, disassembles the output, and provides annotated analysis—showing instruction count, estimated cycles, register usage, memory accesses, and potential bottlenecks (dependencies, cache misses, branch mispredictions).

Why it teaches performance engineering: To optimize code, you must see what the compiler produces. This project forces you to read assembly fluently and understand the mapping from C to machine code. You’ll learn why some “obvious” optimizations don’t help (compiler already did them), why some code is slow (bad instruction selection), and how to write “compiler-friendly” C.

Core challenges you’ll face:

Parsing x86-64 assembly → maps to instruction format, operand encoding
Understanding calling conventions → maps to register usage, stack frames
Identifying optimization patterns → maps to loop unrolling, vectorization, strength reduction
Estimating cycle counts → maps to instruction latency, throughput, dependencies

Key Concepts:

x86-64 Assembly: “Computer Systems: A Programmer’s Perspective” Chapter 3 - Bryant & O’Hallaron
Compiler Optimizations: MIT 6.172 Lecture 5 - C to Assembly & Lecture 9 - What Compilers Can and Cannot Do
Intel Architecture: Intel 64 and IA-32 Optimization Reference Manual
Instruction Tables: Agner Fog’s Instruction Tables (online)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, basic understanding of CPU architecture, familiarity with assembly concepts

Real world outcome:

$ ./asm_analyzer sum_array.c -O0 -O2 -O3

Source: sum_array.c
┌─────────────────────────────────────────────────────────────┐
│ long sum_array(long *arr, int n) {                          │
│     long sum = 0;                                           │
│     for (int i = 0; i < n; i++) {                          │
│         sum += arr[i];                                      │
│     }                                                       │
│     return sum;                                             │
│ }                                                           │
└─────────────────────────────────────────────────────────────┘

═══════════════════════════════════════════════════════════════
OPTIMIZATION LEVEL: -O0 (no optimization)
═══════════════════════════════════════════════════════════════
Instructions: 23
Estimated cycles/iteration: 12-15
Issues identified:
  ⚠ Memory spilling: sum stored to stack every iteration
  ⚠ No loop unrolling
  ⚠ Redundant loads of loop variables

Assembly (annotated):
  push   rbp                  ; Function prologue
  mov    rbp, rsp
  mov    QWORD PTR [rbp-24], rdi  ; arr → stack (SPILL!)
  mov    DWORD PTR [rbp-28], esi  ; n → stack (SPILL!)
  mov    QWORD PTR [rbp-8], 0     ; sum = 0
  mov    DWORD PTR [rbp-12], 0    ; i = 0
.L3:                              ; Loop start
  mov    eax, DWORD PTR [rbp-12]  ; Load i (from stack!)
  cmp    eax, DWORD PTR [rbp-28]  ; Compare i < n (from stack!)
  jge    .L2                      ; Exit if >=
  mov    eax, DWORD PTR [rbp-12]  ; Load i AGAIN!
  cdqe
  lea    rdx, [0+rax*8]
  mov    rax, QWORD PTR [rbp-24]  ; Load arr (from stack!)
  add    rax, rdx
  mov    rax, QWORD PTR [rax]     ; Load arr[i]
  add    QWORD PTR [rbp-8], rax   ; sum += arr[i] (to stack!)
  add    DWORD PTR [rbp-12], 1    ; i++
  jmp    .L3
.L2:
  mov    rax, QWORD PTR [rbp-8]   ; Return sum
  pop    rbp
  ret

═══════════════════════════════════════════════════════════════
OPTIMIZATION LEVEL: -O2 (standard optimization)
═══════════════════════════════════════════════════════════════
Instructions: 8
Estimated cycles/iteration: 3-4
Optimizations applied:
  ✓ Register allocation (no spills)
  ✓ Pointer increment instead of indexing
  ✓ Minimal loop overhead

Assembly (annotated):
  test   esi, esi           ; if (n <= 0)
  jle    .L4                 ;   return 0
  mov    eax, esi
  lea    rax, [rdi+rax*8]    ; end = arr + n
  xor    eax, eax            ; sum = 0
.L3:                         ; Loop (tight!)
  add    rax, QWORD PTR [rdi] ; sum += *arr
  add    rdi, 8              ; arr++
  cmp    rdi, rax            ; while (arr != end)
  jne    .L3
  ret
.L4:
  xor    eax, eax            ; return 0
  ret

═══════════════════════════════════════════════════════════════
OPTIMIZATION LEVEL: -O3 (aggressive optimization)
═══════════════════════════════════════════════════════════════
Instructions: 24 (but vectorized!)
Estimated cycles/iteration: 0.5-1 (4 elements per iteration)
Optimizations applied:
  ✓ All -O2 optimizations
  ✓ Loop vectorization (SSE/AVX)
  ✓ 4x unrolling with SIMD

Vectorized loop uses PADDQ (parallel add) on 4 longs simultaneously!

Implementation Hints:

Structure of the analyzer:

1. Invoke compiler: gcc -S -masm=intel [opts] file.c -o file.s
2. Parse assembly output (or use objdump on .o file)
3. For each instruction:
   - Classify: ALU, memory, branch, etc.
   - Track register def/use chains
   - Estimate latency from instruction tables
4. Identify patterns:
   - Stack spills (mov to [rbp-N])
   - Vectorization (xmm/ymm registers, packed ops)
   - Loop unrolling (duplicated loop bodies)
5. Generate report

Key patterns to recognize:

Strength reduction: x * 8 → x << 3
Loop-invariant code motion: computation moved out of loop
Induction variable optimization: array indexing → pointer increment
Tail call optimization: jmp instead of call; ret

For cycle estimation, use Agner Fog’s tables:

Latency: cycles until result is ready
Throughput: cycles per instruction (can be < 1 due to superscalar)
Consider: Which instructions can execute in parallel?

Questions to explore:

Why does -O0 spill everything to the stack?
What prevents -O2 from vectorizing this loop?
How do you recognize auto-vectorized code?
What flags enable/disable specific optimizations?

Learning milestones:

You can read x86-64 assembly fluently → You understand the instruction set
You identify compiler optimizations → You understand what compilers do
You predict what the compiler will produce → You can write optimizer-friendly code
You spot missed optimizations → You can guide the compiler with hints

Project 5: Custom Memory Allocator

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Memory Management / Systems Programming
Software or Tool: Building replacement for malloc
Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A custom malloc/free implementation with multiple allocation strategies (first-fit, best-fit, segregated free lists, buddy allocator) and performance characteristics tuned for different workloads—benchmarked against glibc malloc and jemalloc.

Why it teaches performance engineering: Memory allocation is a perfect microcosm of performance engineering: you must balance speed, memory overhead, fragmentation, and cache efficiency. Every systems programmer should understand what happens when they call malloc(). This project teaches memory layout, pointer arithmetic, fragmentation strategies, and the trade-offs that real allocators make.

Core challenges you’ll face:

Managing the heap with sbrk/mmap → maps to virtual memory, page allocation
Free list management → maps to data structure choice, coalescing
Minimizing fragmentation → maps to internal vs. external fragmentation
Thread-safe allocation → maps to lock-free data structures or per-thread caches

Key Concepts:

Dynamic Memory Allocation: “Computer Systems: A Programmer’s Perspective” Chapter 9.9 - Bryant & O’Hallaron
Storage Allocation: MIT 6.172 Lectures 11-12 - Storage Allocation
Allocator Design: “The Linux Programming Interface” Chapter 7 - Michael Kerrisk
Buddy System: Knuth, “The Art of Computer Programming” Volume 1

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: C programming, pointers, virtual memory concepts, Projects 1-2

Real world outcome:

$ ./malloc_bench -a custom -w mixed
Memory Allocator Benchmark

Workload: Mixed (varying sizes, random order free)
Operations: 10M allocations, 10M frees

┌─────────────────────────────────────────────────────────────┐
│ Allocator        │ Alloc/s │ Free/s  │ Memory │ Fragmentation│
├─────────────────────────────────────────────────────────────┤
│ glibc malloc     │ 15.2M   │ 18.1M   │ 1.00x  │ 12%          │
│ jemalloc         │ 22.1M   │ 25.3M   │ 1.05x  │ 8%           │
│ custom-firstfit  │ 8.3M    │ 10.2M   │ 1.02x  │ 35%          │
│ custom-bestfit   │ 3.1M    │ 10.2M   │ 1.02x  │ 18%          │
│ custom-segregated│ 18.4M   │ 21.2M   │ 1.08x  │ 10%          │
│ custom-buddy     │ 20.1M   │ 22.8M   │ 1.50x  │ 0% (internal)│
└─────────────────────────────────────────────────────────────┘

$ ./malloc_test_suite custom
Testing: custom allocator
  ✓ Basic allocation and free
  ✓ Alignment (16-byte)
  ✓ Realloc preserves data
  ✓ Large allocations (> 128KB use mmap)
  ✓ Coalescing adjacent free blocks
  ✓ No memory leaks
  ✓ Thread safety (4 threads, 1M ops each)

All tests passed!

Implementation Hints:

Basic structure of a block:

┌──────────────────────────────────────────────────────────┐
│ Header (8 bytes): size | flags (allocated/free)          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│                      Payload                             │
│                    (user data)                           │
│                                                          │
├──────────────────────────────────────────────────────────┤
│ Footer (optional): size (for coalescing)                 │
└──────────────────────────────────────────────────────────┘

Strategies to implement:

First-fit: Traverse free list, use first block that fits
Best-fit: Find smallest block that fits (slow but less fragmentation)
Segregated free lists: Separate lists for different size classes
Buddy system: Split/merge powers of 2 (fast but internal fragmentation)

Key operations:

Splitting: When allocating from a larger free block
Coalescing: When freeing, merge with adjacent free blocks
Extending heap: When no suitable free block exists

For thread safety:

Simple approach: global lock
Better: per-size-class locks
Advanced: thread-local caches (like tcmalloc/jemalloc)

Questions to consider:

What alignment is required? (Usually 16 bytes on x86-64)
When should you use mmap vs. sbrk?
How do you handle realloc efficiently?
What about zero-size allocations?

Learning milestones:

First-fit allocator passes basic tests → You understand heap management
Coalescing works correctly → You understand fragmentation prevention
Segregated lists outperform first-fit → You understand data structure trade-offs
Thread-safe version works → You understand concurrent data structures

Project 6: Parallel Matrix Operations with Cilk

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C (with Cilk/OpenCilk extensions)
Alternative Programming Languages: C++ (with TBB), Rust (with Rayon)
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Parallel Programming / Work-Stealing / Fork-Join
Software or Tool: OpenCilk, Intel TBB, or Rayon
Main Book: “Introduction to Algorithms” by CLRS (Chapter 27: Multithreaded Algorithms)

What you’ll build: A parallel linear algebra library implementing matrix multiplication, transpose, and LU decomposition using fork-join parallelism—with work-span analysis demonstrating theoretical speedup and benchmarks showing actual speedup across core counts.

Why it teaches performance engineering: Parallelism is the only way to exploit modern CPUs, but naive parallelism often makes things worse (overhead, false sharing, poor cache usage). This project teaches you to analyze parallel algorithms (work-span), use structured parallelism (fork-join), and understand why practical speedup differs from theoretical speedup.

Core challenges you’ll face:

Work-span analysis → maps to understanding parallelism = work/span
Granularity control → maps to when parallel overhead exceeds benefit
Cache-efficient parallel algorithms → maps to cache-oblivious techniques
Measuring parallel efficiency → maps to speedup curves, Amdahl’s law

Key Concepts:

Multithreaded Algorithms: MIT 6.172 Lectures 6-8 - Multicore Programming
Work-Span Analysis: “Introduction to Algorithms” Chapter 27 - CLRS
Cilk Runtime: MIT 6.172 Lecture 13 - The Cilk Runtime System
Cache-Oblivious Algorithms: MIT 6.172 Lecture 15

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: C programming, matrix multiplication (Project 2), understanding of parallelism concepts

Real world outcome:

$ ./parallel_linalg_bench -n 2048 --cores 1,2,4,8,16

Parallel Linear Algebra Benchmark
Matrix size: 2048x2048 (double precision)
System: 16 cores, 32 threads

┌─────────────────────────────────────────────────────────────┐
│ Operation         │ Work (T₁) │ Span (T∞) │ Parallelism     │
├─────────────────────────────────────────────────────────────┤
│ Matrix Multiply   │ O(n³)     │ O(log n)  │ O(n³/log n) ≈ ∞ │
│ Matrix Transpose  │ O(n²)     │ O(log n)  │ O(n²/log n)     │
│ LU Decomposition  │ O(n³)     │ O(n log n)│ O(n²/log n)     │
└─────────────────────────────────────────────────────────────┘

Actual Speedup (Matrix Multiply 2048x2048):
┌─────────────────────────────────────────────────────────────┐
│ Cores │ Time (ms) │ Speedup │ Efficiency │ GFLOPS          │
├─────────────────────────────────────────────────────────────┤
│ 1     │ 4,521     │ 1.0x    │ 100%       │ 3.8             │
│ 2     │ 2,298     │ 1.97x   │ 98%        │ 7.5             │
│ 4     │ 1,178     │ 3.84x   │ 96%        │ 14.6            │
│ 8     │ 612       │ 7.39x   │ 92%        │ 28.1            │
│ 16    │ 342       │ 13.22x  │ 83%        │ 50.3            │
└─────────────────────────────────────────────────────────────┘

Efficiency analysis:
  - Near-linear scaling up to 8 cores (92%)
  - Diminishing returns at 16 cores (memory bandwidth limited)
  - Achieved 50 GFLOPS (33% of peak)

Implementation Hints:

Cilk provides two keywords:

cilk_spawn: Fork a task
cilk_sync: Wait for spawned children

For matrix multiply, the divide-and-conquer approach:

matmul(C, A, B, n):
    if n <= BASE_CASE:
        sequential_matmul(C, A, B, n)
    else:
        partition A, B, C into quadrants
        # 8 recursive multiplications, can spawn 7 of them
        cilk_spawn matmul(C00, A00, B00) + matmul(C00, A01, B10)
        cilk_spawn matmul(C01, A00, B01) + matmul(C01, A01, B11)
        ... (6 more)
        cilk_sync

Work-span analysis:

Work T₁: Total operations if run sequentially
Span T∞: Critical path (longest dependency chain)
Parallelism: T₁/T∞ (how many processors can be used effectively)

Granularity control is crucial:

Too fine: Overhead dominates (spawn cost ~hundreds of cycles)
Too coarse: Not enough parallelism
Rule of thumb: Base case should be ~10μs of work

For cache efficiency:

Use cache-oblivious recursive decomposition
This naturally fits in cache at the right level
Avoid false sharing between threads

Questions to explore:

What happens when parallelism » cores?
Why doesn’t LU decomposition scale as well as matrix multiply?
How does the work-stealing scheduler affect cache behavior?
What’s the overhead of cilk_spawn vs. sequential call?

Learning milestones:

Basic parallel matmul works → You understand fork-join parallelism
Near-linear speedup up to 4-8 cores → You understand granularity control
Work-span analysis matches empirical results → You understand parallel complexity
You identify bottlenecks (memory, contention) → You understand real-world limits

Project 7: Race Detector

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Concurrency / Race Detection / Program Analysis
Software or Tool: Building a tool similar to ThreadSanitizer
Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A race detector that instruments parallel C programs to detect data races at runtime—tracking memory accesses, maintaining vector clocks, and reporting conflicting accesses with source locations and stack traces.

Why it teaches performance engineering: Data races are the bane of parallel programming. They cause heisenbugs that appear and disappear unpredictably. Building a race detector teaches you how parallel programs actually execute, what constitutes a race (formally), and how to instrument programs to track memory accesses—skills essential for debugging and optimizing parallel code.

Core challenges you’ll face:

Tracking happens-before relationships → maps to vector clocks, lamport timestamps
Instrumenting memory accesses → maps to compiler instrumentation or binary instrumentation
Efficient shadow memory → maps to mapping application memory to metadata
Reducing false positives → maps to understanding synchronization semantics

Key Concepts:

Race Detection Theory: “The Art of Multiprocessor Programming” Chapter 8 - Herlihy & Shavit
Vector Clocks: Lamport’s paper “Time, Clocks, and the Ordering of Events”
Nondeterministic Parallel Programming: MIT 6.172 Lecture 16
Shadow Memory: How AddressSanitizer/ThreadSanitizer work (LLVM documentation)

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Parallel programming (Project 6), understanding of synchronization primitives, familiarity with compiler or binary instrumentation

Real world outcome:

$ ./race_detector ./buggy_parallel_program

Running with race detection enabled...

═══════════════════════════════════════════════════════════════
WARNING: DATA RACE DETECTED
═══════════════════════════════════════════════════════════════

Write of size 8 at 0x7f8a3c000040 by thread T2:
    #0 update_counter() at counter.c:15
    #1 worker_thread() at main.c:42
    #2 pthread_start()

Previous read of size 8 at 0x7f8a3c000040 by thread T1:
    #0 read_counter() at counter.c:21
    #1 worker_thread() at main.c:42
    #2 pthread_start()

Location is global variable 'shared_counter' declared at:
    counter.c:3

Threads involved:
    T1 (tid=1001): created at main.c:38
    T2 (tid=1002): created at main.c:38

═══════════════════════════════════════════════════════════════
SUMMARY: 1 race detected in 0.523 seconds
         Slowdown: 12x vs uninstrumented
═══════════════════════════════════════════════════════════════

Implementation Hints:

A data race occurs when:

Two threads access the same memory location
At least one access is a write
The accesses are not ordered by synchronization (happens-before)

Vector clocks track happens-before:

Each thread maintains a vector V[T] where V[T][i] = logical time of thread i
On sync: merge vectors (element-wise max)
On access: check if another thread's access could be concurrent

Shadow memory maps each application byte to metadata:

For each memory location, track:
- Last writer (thread, vector clock)
- Last readers (may need multiple)
- Access size and type

Instrumentation approaches:

Compiler instrumentation: Insert callbacks before every load/store
Binary instrumentation: Use Pin, DynamoRIO, or Valgrind
Hybrid: Compile with hooks, link with runtime library

Performance considerations:

Shadow memory lookup must be fast (direct mapping with offset)
Vector clock operations are expensive (use epochs to optimize)
Only check inter-thread accesses (per-thread fast path)

Questions to explore:

What’s the difference between benign races and bugs?
How do you handle atomic operations?
What about races through signals/interrupts?
How do production race detectors achieve reasonable slowdowns?

Learning milestones:

Detect obvious races (no sync) → You understand race definition
Handle pthread_mutex correctly → You understand happens-before
Reasonable overhead (<50x slowdown) → You understand optimization
No false positives on correct programs → You understand sync semantics

Project 8: Cache-Oblivious B-Tree

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Data Structures / Cache Optimization / Databases
Software or Tool: Building a high-performance index structure
Main Book: “Introduction to Algorithms” by CLRS

What you’ll build: A cache-oblivious B-tree (van Emde Boas layout) that achieves optimal cache behavior without knowing the cache size—benchmarked against standard B-trees and red-black trees across different cache hierarchies.

Why it teaches performance engineering: Modern programs are bottlenecked by memory, not CPU. Cache-oblivious algorithms achieve near-optimal cache performance without tuning for specific cache sizes. Building this data structure teaches you about memory hierarchy, cache analysis, and data layout—skills that apply to any data-intensive code.

Core challenges you’ll face:

van Emde Boas layout → maps to recursive layout for optimal cache behavior
Implicit pointers → maps to computing child positions without storing pointers
Cache complexity analysis → maps to understanding O(log_B N) cache behavior
Comparison with cache-aware B-tree → maps to understanding the tradeoffs

Key Concepts:

Cache-Oblivious Algorithms: MIT 6.172 Lecture 15
B-Trees: “Introduction to Algorithms” Chapter 18 - CLRS
van Emde Boas Layout: Prokop’s thesis, “Cache-Oblivious Algorithms”
Memory Hierarchy: “Computer Architecture” Chapter 2 - Hennessy & Patterson

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: B-trees, understanding of cache hierarchy, Project 2 (cache optimization experience)

Real world outcome:

$ ./btree_bench -n 10000000 -ops search,insert,range

B-Tree Benchmark: 10M keys, random order

┌─────────────────────────────────────────────────────────────┐
│ Data Structure        │ Search  │ Insert  │ Range(1K)      │
├─────────────────────────────────────────────────────────────┤
│ std::map (RB-tree)    │ 892 ns  │ 1,021 ns│ 45.2 μs        │
│ B-tree (B=64)         │ 423 ns  │ 512 ns  │ 12.3 μs        │
│ B-tree (B=256)        │ 389 ns  │ 498 ns  │ 11.8 μs        │
│ B-tree (B=1024)       │ 412 ns  │ 534 ns  │ 10.9 μs        │
│ Cache-oblivious B-tree│ 356 ns  │ 478 ns  │ 9.2 μs         │
└─────────────────────────────────────────────────────────────┘

Cache behavior (L3 misses per operation):
┌─────────────────────────────────────────────────────────────┐
│ Data Structure        │ Search  │ Insert  │ Notes          │
├─────────────────────────────────────────────────────────────┤
│ std::map (RB-tree)    │ 12.3    │ 15.1    │ pointer-chasing│
│ B-tree (B=64)         │ 4.2     │ 5.8     │ optimal for L1 │
│ B-tree (B=256)        │ 3.1     │ 4.2     │ optimal for L2 │
│ B-tree (B=1024)       │ 3.8     │ 5.1     │ too big for L2 │
│ Cache-oblivious       │ 2.8     │ 3.9     │ self-adapting  │
└─────────────────────────────────────────────────────────────┘

The cache-oblivious B-tree achieves near-optimal cache behavior
across ALL cache levels without tuning!

Implementation Hints:

The key insight: Layout matters more than algorithms for cache performance.

Standard B-tree with pointer-based nodes:

Each access chases a pointer → likely cache miss
log_2(N) accesses → log_2(N) cache misses

Cache-aware B-tree with large nodes (B keys per node):

Choose B to fill a cache line
log_B(N) node accesses → log_B(N) cache misses
But: Must tune B for specific cache!

Cache-oblivious layout (van Emde Boas):

Recursively divide tree:
- Top half of tree stored first
- Then bottom subtrees, each in recursive layout

Why it works:
- At some recursion level, subtree fits in cache
- All accesses within that subtree are cache hits
- Achieves O(log_B N) cache complexity for ANY B!

The layout construction:

1. Conceptually, build a complete binary search tree
2. Apply van Emde Boas layout recursively
3. For position calculations, derive formulas for:
   - Parent index from child index
   - Child index from parent index
   - These can be computed, no pointers stored!

Questions to explore:

Why is pointer-chasing so expensive?
What’s the relationship between tree depth and cache misses?
How do you handle updates (splits) in the packed layout?
What’s the practical difference vs. cache-aware tuning?

Learning milestones:

van Emde Boas layout works for static tree → You understand the layout
Search achieves O(log_B N) cache complexity → You understand cache analysis
Updates work correctly → You understand the algorithmic challenges
Outperforms tuned B-tree on varied cache sizes → You’ve proven cache-obliviousness

Project 9: Lock-Free Queue

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Concurrency / Lock-Free Programming / Memory Ordering
Software or Tool: Building concurrent data structures
Main Book: “The Art of Multiprocessor Programming” by Herlihy & Shavit

What you’ll build: A lock-free MPMC (multi-producer, multi-consumer) queue using compare-and-swap operations, with careful attention to memory ordering and the ABA problem—benchmarked against mutex-based queues under contention.

Why it teaches performance engineering: Lock-free data structures can dramatically outperform lock-based ones under contention, but they’re notoriously difficult to get right. This project teaches you about atomic operations, memory ordering, linearizability, and the subtle bugs that arise in concurrent code—essential knowledge for high-performance systems.

Core challenges you’ll face:

Compare-and-swap (CAS) loops → maps to atomic operations and retry logic
ABA problem → maps to why simple CAS isn’t enough
Memory ordering → maps to acquire/release semantics, memory barriers
Linearizability correctness → maps to formal reasoning about concurrent operations

Key Concepts:

Lock-Free Data Structures: MIT 6.172 Lecture 17 - Synchronization Without Locks
Memory Models: “C++ Concurrency in Action” Chapter 5 - Anthony Williams
Michael-Scott Queue: The classic lock-free queue algorithm
ABA Problem: “The Art of Multiprocessor Programming” Chapter 10 - Herlihy & Shavit

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Understanding of atomic operations, memory models, parallel programming experience (Project 6)

Real world outcome:

$ ./queue_bench -producers 8 -consumers 8 -ops 10000000

Queue Benchmark: 8 producers, 8 consumers, 10M total ops

┌─────────────────────────────────────────────────────────────┐
│ Implementation        │ Throughput │ Latency (p99)│ Fairness │
├─────────────────────────────────────────────────────────────┤
│ Mutex-based queue     │ 1.2M ops/s │ 45 μs        │ Good     │
│ Spinlock queue        │ 2.1M ops/s │ 380 μs       │ Poor     │
│ Lock-free (simple CAS)│ BROKEN     │ -            │ -        │
│ Lock-free (w/ counter)│ 8.2M ops/s │ 8.2 μs       │ Good     │
│ Lock-free (hazard ptr)│ 7.8M ops/s │ 7.5 μs       │ Good     │
└─────────────────────────────────────────────────────────────┘

Scalability (lock-free w/ counter):
┌─────────────────────────────────────────────────────────────┐
│ Threads (P+C) │ Throughput │ Scaling Efficiency             │
├─────────────────────────────────────────────────────────────┤
│ 2  (1+1)      │ 5.1M ops/s │ 100% (baseline)                │
│ 4  (2+2)      │ 9.2M ops/s │ 90%                            │
│ 8  (4+4)      │ 14.8M ops/s│ 73%                            │
│ 16 (8+8)      │ 18.2M ops/s│ 45% (contention limited)       │
└─────────────────────────────────────────────────────────────┘

Correctness tests:
  ✓ No lost items
  ✓ No duplicate items
  ✓ FIFO order preserved (per-producer)
  ✓ Linearizable operations
  ✓ ABA problem prevented

Implementation Hints:

The Michael-Scott queue structure:

Queue:
  head → Node → Node → Node → ... → tail

Node:
  value: T
  next: atomic<Node*>

Basic enqueue (BROKEN - has ABA problem):

Allocate new_node with value
loop:
    tail = queue.tail
    next = tail.next
    if tail == queue.tail:  # still valid?
        if next == null:
            if CAS(&tail.next, null, new_node):  # try to link
                CAS(&queue.tail, tail, new_node) # try to advance tail
                return
        else:
            CAS(&queue.tail, tail, next)  # help advance tail

The ABA problem:

Thread 1: reads tail = A
Thread 1: preempted
Thread 2: dequeues A, frees it
Thread 2: allocates new node, happens to get same address A!
Thread 2: enqueues new A
Thread 1: resumes, CAS succeeds (still A), but wrong A!

Solutions:

Counter/stamp with pointer: Pack version counter into pointer bits
Hazard pointers: Protect nodes from being freed while accessed
Epoch-based reclamation: Defer frees until no threads in critical section

Memory ordering is crucial:

memory_order_acquire on loads that guard following reads
memory_order_release on stores that guard preceding writes
memory_order_acq_rel on CAS operations
Getting this wrong causes subtle bugs!

Questions to explore:

Why is the spinlock version unfair under contention?
What happens if you use memory_order_relaxed everywhere?
How do you test for ABA bugs? (They’re rare and timing-dependent)
What’s the relationship between lock-free and wait-free?

Learning milestones:

Basic lock-free queue works (ignoring ABA) → You understand CAS patterns
ABA solution implemented and tested → You understand subtle concurrency bugs
Correct memory ordering → You understand memory models
Outperforms mutex under contention → You understand when lock-free helps

Project 10: SIMD JSON Parser

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: SIMD / Parsing / Data Processing
Software or Tool: Building something similar to simdjson
Main Book: “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron

What you’ll build: A JSON parser that uses SIMD instructions (AVX2/AVX-512) to parse multiple bytes simultaneously, achieving gigabytes-per-second parsing speeds—inspired by the simdjson project.

Why it teaches performance engineering: Parsing is traditionally done byte-by-byte, but SIMD allows processing 32-64 bytes at once. This project teaches you to think about data parallelism at the instruction level, design SIMD-friendly algorithms, and achieve performance that seems impossible. simdjson parses JSON at over 3 GB/s—faster than memcpy on some systems!

Core challenges you’ll face:

SIMD-based structural scanning → maps to finding quotes, braces, colons with vector ops
Handling escapes and strings → maps to complex state transitions in parallel
Branch-free parsing → maps to avoiding branch mispredictions
Validating UTF-8 with SIMD → maps to parallel state machines

Key Concepts:

SIMD Programming: Intel Intrinsics Guide (online)
simdjson Paper: “Parsing Gigabytes of JSON per Second” by Langdale & Lemire
SIMD String Processing: “Modern C” Chapter on SIMD - Jens Gustedt
Bit Manipulation for Parsing: MIT 6.172 Lecture 3 - Bit Hacks

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: SIMD experience (Project 2), bit manipulation (Project 3), understanding of parsing

Real world outcome:

$ ./simd_json_bench twitter.json

JSON Parsing Benchmark
File: twitter.json (631 KB, Twitter API response)

┌─────────────────────────────────────────────────────────────┐
│ Parser              │ Speed    │ Cycles/byte │ Time         │
├─────────────────────────────────────────────────────────────┤
│ Naive (byte-by-byte)│ 120 MB/s │ 25.0        │ 5.26 ms      │
│ RapidJSON           │ 380 MB/s │ 7.9         │ 1.66 ms      │
│ cJSON               │ 220 MB/s │ 13.6        │ 2.87 ms      │
│ simdjson            │ 3.2 GB/s │ 0.94        │ 0.20 ms      │
│ our_simd_json       │ 2.1 GB/s │ 1.43        │ 0.30 ms      │
└─────────────────────────────────────────────────────────────┘

Stage breakdown (our_simd_json):
  Stage 1 (structural scan):    0.08 ms  (7.9 GB/s)
  Stage 2 (tape construction):  0.15 ms  (4.2 GB/s)
  Stage 3 (DOM building):       0.07 ms  (9.0 GB/s)

Validation tests:
  ✓ All JSONTestSuite pass cases pass
  ✓ All JSONTestSuite fail cases fail
  ✓ Handles 4GB+ files (streaming mode)
  ✓ Correct UTF-8 validation

Implementation Hints:

Key insight: JSON structure can be found in parallel!

Stage 1: Find structural characters with SIMD

Load 32 bytes into YMM register
Compare against '{'  → bitmask of matches
Compare against '}'  → bitmask of matches
Compare against '['  → bitmask of matches
... for all structural chars

But wait! Chars inside strings don't count!
Need to track "inside string" state across chunks

The quote state problem:

"hello, world"  → inside string between quotes
"hello \" world" → escaped quote doesn't toggle
"hello \\ " → escaped backslash, quote does toggle

Solution: Use carryless multiplication (PCLMULQDQ)
or clever XOR prefix sum to track quote state

Structural characters to find:

{ } [ ] : , - structure
" - string delimiters
\ - escape (affects following char)
Whitespace (to skip)

For each 64-byte chunk:

Load into two YMM registers (or one ZMM)
Generate bitmask for each character class
XOR quote bits to get string regions
Mask out structural chars inside strings
Append valid structural positions to tape

Branch-free number parsing:

Parse all digits with SIMD, multiply-accumulate
Handle sign, decimal point, exponent in parallel
Only branch on error cases

Questions to explore:

Why is simdjson faster than memcpy for some workloads?
How do you handle strings longer than SIMD width?
What’s the cache behavior of the two-stage approach?
How would you parallelize across cores?

Learning milestones:

SIMD character classification works → You understand vector comparisons
Quote/escape handling correct → You understand SIMD state machines
1 GB/s+ parsing speed → You understand SIMD optimization
Full JSON compliance → You’ve handled all edge cases

Project 11: Profile-Guided Optimizer

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C/C++
Alternative Programming Languages: Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Compilers / Profiling / Optimization
Software or Tool: LLVM, gcc, profile data
Main Book: “Engineering a Compiler” by Cooper & Torczon

What you’ll build: A tool that uses runtime profiles to guide code optimizations—identifying hot paths, reordering basic blocks to reduce branch mispredictions, and inlining decisions based on actual call frequencies.

Why it teaches performance engineering: Compilers optimize based on static analysis, but real performance depends on runtime behavior. Profile-guided optimization (PGO) bridges this gap. This project teaches you about control flow graphs, profile data collection, and how to make informed optimization decisions—understanding what modern compilers do with -fprofile-use.

Core challenges you’ll face:

Instrumenting code for profiling → maps to basic block counters, edge profiling
Analyzing profile data → maps to hot paths, call graphs, branch frequencies
Basic block reordering → maps to reducing branch mispredictions
Inlining heuristics → maps to code size vs. call overhead

Key Concepts:

Compiler Optimizations: MIT 6.172 Lecture 9 - What Compilers Can and Cannot Do
Profile-Guided Optimization: “Engineering a Compiler” Chapter 8 - Cooper & Torczon
Basic Block Layout: “Profile Guided Code Positioning” (Pettis & Hansen paper)
LLVM PGO: LLVM documentation on profile-guided optimization

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Understanding of compiler structure, control flow graphs, assembly (Project 4)

Real world outcome:

$ ./pgo_tool analyze profile_data.txt program.cfg

Profile-Guided Analysis Report
==============================

Hot Functions (>5% of execution time):
┌─────────────────────────────────────────────────────────────┐
│ Function            │ Samples │ %Total │ Avg calls/invoc   │
├─────────────────────────────────────────────────────────────┤
│ process_request     │ 2,341K  │ 34.2%  │ 1 (entry point)   │
│ parse_json          │ 1,892K  │ 27.6%  │ 3.2               │
│ hash_lookup         │ 1,234K  │ 18.0%  │ 48.7              │
│ string_compare      │ 823K    │ 12.0%  │ 156.3             │
│ allocate_buffer     │ 412K    │ 6.0%   │ 8.2               │
└─────────────────────────────────────────────────────────────┘

Branch Analysis:
┌─────────────────────────────────────────────────────────────┐
│ Location            │ Taken % │ Current │ Optimal Layout   │
├─────────────────────────────────────────────────────────────┤
│ parse_json:42       │ 3%      │ taken   │ fall-through ↓   │
│ hash_lookup:78      │ 98%     │ fallthru│ (optimal)        │
│ process:121         │ 15%     │ taken   │ fall-through ↓   │
└─────────────────────────────────────────────────────────────┘

Inlining Recommendations:
  ✓ Inline string_compare into hash_lookup (hot caller, small callee)
  ✓ Inline hash_lookup into parse_json (very hot path)
  ✗ Don't inline allocate_buffer (cold path, large code)

$ ./pgo_tool optimize program.o -o program_optimized.o

Optimizations applied:
  - Reordered 23 basic blocks for fall-through on hot paths
  - Inlined 3 functions on hot paths
  - Moved cold code to .text.cold section

Estimated improvement: 12-18% (branch mispredictions reduced by 40%)

$ ./benchmark program_optimized vs program_baseline
Speedup: 15.3%

Implementation Hints:

Profile data collection (instrumentation approach):

For each basic block, insert:
  __profile_counter[block_id]++;

For each function entry:
  __profile_call_count[func_id]++;

For each branch:
  __profile_branch[branch_id][taken]++;

Control flow graph analysis:

Build CFG from assembly or intermediate representation
Label each edge with profile count
Identify:
  - Hot loops (high back-edge counts)
  - Hot paths (trace through most-taken branches)
  - Cold code (rarely executed)

Basic block reordering algorithm:

Goal: Maximize fall-through on hot paths
Approach:
  1. Find hottest edge from current block
  2. Place target block next (if not already placed)
  3. Repeat until all blocks placed
  4. Handle cold blocks at end

Inlining heuristics to consider:

Caller hotness × callee hotness
Callee code size (inlining bloats code, hurts I-cache)
Call sites per callee (inline if few call sites)
Depth in hot path (inline inner loops first)

Questions to explore:

How does PGO interact with branch prediction hardware?
What’s the benefit of .text.cold sections?
How do you handle profile data from multiple runs?
What optimizations can ONLY be done with profile data?

Learning milestones:

Profile data collected accurately → You understand instrumentation
Hot path identified correctly → You understand profile analysis
Block reordering improves performance → You understand branch prediction
Inlining decisions match production compilers → You understand optimizer tradeoffs

Project 12: Parallel Merge Sort with Work-Stealing

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Parallel Algorithms / Sorting / Work-Stealing
Software or Tool: OpenCilk or custom work-stealing scheduler
Main Book: “Introduction to Algorithms” by CLRS (Chapter 27)

What you’ll build: A parallel merge sort that achieves near-linear speedup through work-stealing, including a parallel merge operation (not sequential merge!) that maintains O(n log n) work with O(log² n) span.

Why it teaches performance engineering: Sorting is fundamental, and merge sort’s divide-and-conquer structure is naturally parallel. But the merge step is traditionally sequential, limiting parallelism. The parallel merge algorithm demonstrates how to find parallelism in seemingly sequential operations. This project ties together work-span analysis, cache efficiency, and practical parallelization.

Core challenges you’ll face:

Parallel divide phase → maps to trivial parallelism with spawn
Parallel merge with binary search → maps to finding parallelism in merge
Cache efficiency in parallel context → maps to maintaining locality across cores
Work-span analysis → maps to proving O(log² n) span

Key Concepts:

Parallel Merge Sort: MIT 6.172 Lecture 8 - Analysis of Multithreaded Algorithms
Parallel Merge Algorithm: “Introduction to Algorithms” Chapter 27.3 - CLRS
Work-Stealing Schedulers: MIT 6.172 Lecture 13 - The Cilk Runtime System
Cache-Efficient Sorting: “Algorithms for Memory Hierarchies” (Vitter survey)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Sequential merge sort, work-span analysis concepts, Project 6 (parallel programming)

Real world outcome:

$ ./parallel_sort_bench -n 100000000 -cores 1,2,4,8,16

Parallel Merge Sort Benchmark
Array size: 100M integers (400 MB)
Comparison: sequential vs parallel merge

┌─────────────────────────────────────────────────────────────┐
│ Implementation              │ Work     │ Span       │       │
├─────────────────────────────────────────────────────────────┤
│ Sequential merge sort       │ O(n lg n)│ O(n lg n)  │ P=1   │
│ Parallel sort, seq merge    │ O(n lg n)│ O(n)       │ P=lg n│
│ Parallel sort, par merge    │ O(n lg n)│ O(lg² n)   │ P=n/lg²n│
└─────────────────────────────────────────────────────────────┘

Actual Performance:
┌─────────────────────────────────────────────────────────────┐
│ Cores │ Par-sort,seq-merge │ Par-sort,par-merge │ Speedup  │
├─────────────────────────────────────────────────────────────┤
│ 1     │ 12.4s              │ 12.8s              │ 1.0x     │
│ 2     │ 7.2s               │ 6.8s               │ 1.88x    │
│ 4     │ 4.8s               │ 3.6s               │ 3.56x    │
│ 8     │ 4.1s               │ 1.9s               │ 6.74x    │
│ 16    │ 4.0s (bottleneck!) │ 1.1s               │ 11.6x    │
└─────────────────────────────────────────────────────────────┘

Note: Sequential merge becomes the bottleneck at 8+ cores!
Parallel merge achieves near-linear scaling.

Cache behavior:
  Cache misses (16 cores): 180M (par-merge) vs 450M (seq-merge)
  Parallel merge maintains better locality!

Implementation Hints:

Parallel merge algorithm:

merge(A[1..n], B[1..m], C[1..n+m]):
    # Base case
    if n + m <= BASE_SIZE:
        sequential_merge(A, B, C)
        return

    # Ensure A is longer
    if n < m:
        swap A, B

    # Pick middle of A
    mid_a = n/2
    # Binary search to find corresponding position in B
    mid_b = binary_search(B, A[mid_a])

    # A[mid_a] goes to position mid_a + mid_b in C
    C[mid_a + mid_b] = A[mid_a]

    # Recursively merge halves in parallel
    spawn merge(A[1..mid_a-1], B[1..mid_b], C[1..mid_a+mid_b-1])
    spawn merge(A[mid_a+1..n], B[mid_b+1..m], C[mid_a+mid_b+1..n+m])
    sync

Work-span analysis:

Work: Still O(n) for merge (each element touched once)
Span: T(n) = T(n/2) + O(log n) [binary search + spawn]
     = O(log² n)

Parallelism = O(n / log² n) — very high!

Why parallel merge improves cache behavior:

Each recursive call works on smaller subproblems
Different cores work on different cache-friendly chunks
vs. sequential merge which streams through all of A and B

Granularity tuning:

Too fine-grained: spawn overhead dominates
Too coarse: insufficient parallelism for all cores
Typical: BASE_SIZE ~ 10,000 elements

Questions to explore:

Why does sequential merge become a bottleneck?
What’s the memory bandwidth requirement for sorting?
How does this compare to parallel quicksort?
What about sample sort for distributed memory?

Learning milestones:

Parallel divide works correctly → You understand basic fork-join
Parallel merge produces correct output → You understand the algorithm
Scaling improves with parallel merge → You understand why it helps
You can derive the span analysis → You understand parallel complexity

Project 13: Hardware Performance Counter Dashboard

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust, Python (for UI)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Profiling / Performance Monitoring / Visualization
Software or Tool: perf, PAPI, Intel VTune concepts
Main Book: “Systems Performance” by Brendan Gregg

What you’ll build: A real-time dashboard that reads hardware performance counters (instructions, cycles, cache misses, branch mispredictions, etc.) and visualizes bottlenecks—showing whether code is CPU-bound, memory-bound, or suffering from specific microarchitectural issues.

Why it teaches performance engineering: Hardware performance counters reveal what’s really happening inside the CPU. This project teaches you to interpret counter data, understand microarchitectural bottlenecks, and diagnose performance issues systematically. It’s like having X-ray vision for your code.

Core challenges you’ll face:

Reading performance counters → maps to perf_event_open, counter multiplexing
Interpreting counter combinations → maps to derived metrics (CPI, cache hit rate)
Real-time sampling → maps to low-overhead continuous monitoring
Bottleneck classification → maps to Top-Down Microarchitecture Analysis

Key Concepts:

Performance Measurement: MIT 6.172 Lecture 10 - Measurement and Timing
Top-Down Analysis: Intel’s Top-Down Microarchitecture Analysis Method
Hardware Counters: “Systems Performance” Chapter 6 - Brendan Gregg
Roofline Model: “Roofline: An Insightful Visual Performance Model” (Williams et al.)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 1 (basic measurement), understanding of CPU architecture

Real world outcome:

$ ./perf_dashboard -p $(pidof my_app) --refresh 500ms

╔═══════════════════════════════════════════════════════════════════════╗
║             PERFORMANCE DASHBOARD - my_app (PID: 12345)               ║
╠═══════════════════════════════════════════════════════════════════════╣
║  CPU Utilization: ████████████████████░░░░ 82%    Cores: 4/8 active   ║
╠═══════════════════════════════════════════════════════════════════════╣
║                           CORE METRICS                                ║
╠══════════════════╦════════════════╦═══════════════╦═══════════════════╣
║  Instructions    ║  2,341 M/sec   ║  Cycles       ║  3,120 M/sec      ║
║  IPC             ║  0.75 ⚠️       ║  Frequency    ║  3.2 GHz          ║
╠══════════════════╩════════════════╩═══════════════╩═══════════════════╣
║                         MEMORY HIERARCHY                              ║
╠═══════════════════════════════════════════════════════════════════════╣
║  L1D Hit Rate    ║ ████████████████████ 99.2%    ║  OK ✓             ║
║  L2 Hit Rate     ║ ████████████████░░░░ 78.4%    ║  OK ✓             ║
║  L3 Hit Rate     ║ ████████░░░░░░░░░░░░ 42.1%    ║  WARN ⚠️          ║
║  Memory BW       ║ ██████████████░░░░░░ 24.2 GB/s (of 50 GB/s peak)  ║
╠═══════════════════════════════════════════════════════════════════════╣
║                        PIPELINE EFFICIENCY                            ║
╠═══════════════════════════════════════════════════════════════════════╣
║  Branch Mispred  ║ 2.3% ████████████████████ OK ✓                    ║
║  Frontend Bound  ║ 8.2% ████░░░░░░░░░░░░░░░░ OK ✓                    ║
║  Backend Bound   ║ 61.4%████████████████████ HIGH ❌ (Memory)         ║
║  Retiring        ║ 24.1%██████████░░░░░░░░░░                         ║
╠═══════════════════════════════════════════════════════════════════════╣
║                          BOTTLENECK ANALYSIS                          ║
╠═══════════════════════════════════════════════════════════════════════╣
║  Primary:   MEMORY BOUND (L3 misses → DRAM latency)                   ║
║  Secondary: Low ILP (long dependency chains detected)                 ║
║                                                                       ║
║  Recommendations:                                                     ║
║  • Consider cache blocking to improve L3 hit rate                     ║
║  • Prefetch hints may help with predictable access patterns           ║
║  • Check for false sharing if multi-threaded                          ║
╚═══════════════════════════════════════════════════════════════════════╝

Implementation Hints:

Key counters to collect:

Basic:
  - instructions, cycles → IPC
  - cache-references, cache-misses → miss rate
  - branches, branch-misses → misprediction rate

Memory hierarchy:
  - L1-dcache-loads, L1-dcache-load-misses
  - l2_rqsts.references, l2_rqsts.miss
  - LLC-loads, LLC-load-misses

Top-down analysis (Intel specific):
  - UOPS_RETIRED.RETIRE_SLOTS
  - UOPS_ISSUED.ANY
  - IDQ_UOPS_NOT_DELIVERED.CORE
  - Various backend stall counters

Top-Down Microarchitecture Analysis:

All cycles are classified into:
1. Retiring (useful work)
2. Bad Speculation (wrong-path work)
3. Frontend Bound (instruction supply issues)
4. Backend Bound (execution issues)
   - Memory Bound
   - Core Bound

This gives a systematic way to identify bottlenecks.

Counter multiplexing:

PMUs have limited counters (typically 4-8 per core)
OS multiplexes if you request more
Use groups to ensure counters are read together
Handle scaling for multiplexed counters

Real-time considerations:

Reading counters has overhead (~hundreds of nanoseconds)
Sample at reasonable intervals (100ms-1s)
Consider per-CPU vs. per-thread counting

Questions to explore:

Why is IPC alone not sufficient to identify bottlenecks?
How do you distinguish L3 misses from DRAM bandwidth saturation?
What counters indicate false sharing?
How does hyperthreading affect counter interpretation?

Learning milestones:

Basic counters display correctly → You understand the PMU interface
Derived metrics computed properly → You understand counter relationships
Top-down analysis matches Intel VTune → You understand the methodology
Recommendations are actionable → You understand optimization strategies

Project 14: Graph Algorithm Optimizer

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: Graph Algorithms / Cache Optimization / Parallelization
Software or Tool: Building optimized BFS/SSSP/PageRank
Main Book: “Introduction to Algorithms” by CLRS

What you’ll build: An optimized graph processing library implementing BFS, SSSP (Dijkstra/Bellman-Ford), and PageRank with cache-efficient data layouts, direction-optimizing traversal, and parallelization—benchmarked on real-world graphs (road networks, social graphs, web graphs).

Why it teaches performance engineering: Graphs are everywhere (social networks, road maps, web crawling), but graph algorithms are notoriously memory-bound due to irregular access patterns. This project teaches you to optimize for the worst case: random memory access. You’ll learn about graph representations, direction-optimizing traversal, and how to squeeze performance from cache-hostile workloads.

Core challenges you’ll face:

Graph representation (CSR vs adjacency list) → maps to memory layout tradeoffs
Direction-optimizing BFS → maps to switching between push and pull
Cache-efficient traversal → maps to vertex ordering, edge partitioning
Parallel graph algorithms → maps to synchronization, work distribution

Key Concepts:

Graph Optimization: MIT 6.172 Lecture 22 - Graph Optimization
Direction-Optimizing BFS: “Direction-Optimizing Breadth-First Search” (Beamer et al.)
Graph Representations: “Introduction to Algorithms” Chapter 22 - CLRS
Parallel Graph Algorithms: “Ligra: A Lightweight Graph Processing Framework”

Difficulty: Advanced Time estimate: 3-4 weeks Prerequisites: Basic graph algorithms, cache optimization (Project 2), parallel programming (Project 6)

Real world outcome:

$ ./graph_bench -g road_network_usa.mtx -a bfs,sssp,pagerank

Graph: USA Road Network
Vertices: 23,947,347
Edges: 58,333,344 (directed)

BFS Benchmark (from random source):
┌─────────────────────────────────────────────────────────────┐
│ Implementation          │ Time   │ MTEPS  │ Cache Misses   │
├─────────────────────────────────────────────────────────────┤
│ Naive (queue-based)     │ 1.82s  │ 32.1   │ 892M           │
│ CSR representation      │ 0.94s  │ 62.1   │ 412M           │
│ Direction-optimizing    │ 0.31s  │ 188.2  │ 156M           │
│ + Parallel (8 cores)    │ 0.052s │ 1121.8 │ 168M           │
└─────────────────────────────────────────────────────────────┘

PageRank (20 iterations):
┌─────────────────────────────────────────────────────────────┐
│ Implementation          │ Time   │ Edges/s │ Convergence   │
├─────────────────────────────────────────────────────────────┤
│ Pull-based (naive)      │ 28.4s  │ 41.1M   │ 20 iters      │
│ Push-based              │ 12.2s  │ 95.6M   │ 20 iters      │
│ + Vertex reordering     │ 8.4s   │ 138.9M  │ 20 iters      │
│ + Parallel (8 cores)    │ 1.2s   │ 972.2M  │ 20 iters      │
│ + Delta-based           │ 0.34s  │ N/A     │ converged @ 8 │
└─────────────────────────────────────────────────────────────┘

Memory bandwidth utilization: 34.2 GB/s (of 50 GB/s peak)
Graph processing is MEMORY BOUND as expected.

Implementation Hints:

CSR (Compressed Sparse Row) representation:

row_ptr[v] = start index of v's neighbors in col_idx
col_idx[row_ptr[v]..row_ptr[v+1]] = neighbors of v

Memory: O(V + E) — much better than adjacency matrix O(V²)
Access pattern: Sequential for one vertex's neighbors

Direction-optimizing BFS:

Top-down (push): Visit frontier, add unvisited neighbors
  Good when frontier is small

Bottom-up (pull): For each unvisited vertex, check if any neighbor is in frontier
  Good when frontier is large (many vertices in frontier → likely to find parent)

Switch based on |frontier| / |unvisited| ratio!

Why graph algorithms are hard to optimize:

Irregular access patterns → cache misses on every edge
Low arithmetic intensity → memory bandwidth bound
Power-law degree distributions → load imbalance

Optimization techniques:

Vertex reordering: Place frequently co-accessed vertices nearby
Edge partitioning: Divide edges for cache-friendly batches
Prefetching: Prefetch neighbor lists while processing current
Delta-stepping (SSSP): Bucket vertices by distance for better locality

Questions to explore:

Why does direction-optimizing BFS help so much?
How do you parallelize without excessive synchronization?
What’s the difference between road networks and social networks?
How would you handle graphs that don’t fit in memory?

Learning milestones:

CSR representation works correctly → You understand graph memory layout
Direction-optimizing outperforms naive by 3-5x → You understand the technique
Parallel version scales to 8 cores → You understand parallel graph algorithms
You identify memory bandwidth as the bottleneck → You understand real limits

Project 15: Traveling Salesman Solver

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Optimization / Branch-and-Bound / Bit Manipulation
Software or Tool: Building Concorde-inspired solver
Main Book: “The Algorithm Design Manual” by Steven Skiena

What you’ll build: A high-performance TSP solver using branch-and-bound with effective bounds (1-tree, Held-Karp), intelligent branching strategies, and parallelization—capable of solving instances with 50-100 cities to optimality.

Why it teaches performance engineering: TSP is NP-hard, so brute force is hopeless. But clever algorithms can solve surprisingly large instances. This project brings together everything: bit manipulation for state representation, cache-efficient data structures, branch prediction optimization, and parallelization of search. It’s where algorithms meet systems performance.

Core challenges you’ll face:

State representation with bitmasks → maps to bit manipulation for efficiency
Computing tight lower bounds → maps to 1-tree, Held-Karp bounds
Branch-and-bound search → maps to pruning, search order
Parallel search with load balancing → maps to work distribution, bound sharing

Key Concepts:

TSP Tuning: MIT 6.172 Lecture 21 - Tuning a TSP Algorithm
Branch and Bound: “The Algorithm Design Manual” Chapter 7 - Skiena
Bit Manipulation for Sets: MIT 6.172 Lecture 3 - Bit Hacks
TSP Algorithms: “The Traveling Salesman Problem: A Computational Study” (Applegate et al.)

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Dynamic programming, graph algorithms (Project 14), bit manipulation (Project 3), parallel programming (Project 6)

Real world outcome:

$ ./tsp_solver -i berlin52.tsp -t 60

TSP Solver - Berlin 52 cities
Optimal tour length: 7542 (known optimal)

Search Progress:
┌─────────────────────────────────────────────────────────────┐
│ Time   │ Best Tour │ Bound    │ Gap    │ Nodes    │ Pruned  │
├─────────────────────────────────────────────────────────────┤
│ 0.1s   │ 8,234     │ 6,891    │ 19.5%  │ 1,024    │ 45%     │
│ 0.5s   │ 7,891     │ 7,102    │ 11.1%  │ 12,892   │ 72%     │
│ 1.0s   │ 7,678     │ 7,289    │ 5.3%   │ 45,123   │ 84%     │
│ 2.0s   │ 7,542 ★   │ 7,421    │ 1.6%   │ 156,789  │ 91%     │
│ 3.2s   │ 7,542 ★   │ 7,542    │ 0.0%   │ 312,456  │ OPTIMAL │
└─────────────────────────────────────────────────────────────┘

Solution verified: OPTIMAL
Time to optimal: 2.0s
Time to prove: 3.2s
Nodes explored: 312,456 (vs ~2^52 = 4.5 × 10^15 brute force)

Parallel scaling (8 cores): 4.8x speedup

$ ./tsp_solver -i att48.tsp -algorithm dp
Dynamic Programming (Held-Karp):
  States: 2^48 × 48 = 13.5 × 10^15 — TOO LARGE
  Branch-and-bound is required for this size.

Implementation Hints:

Bitmask representation:

Visited set: 64-bit integer, bit i = 1 if city i visited
Operations:
  - Add city i: set |= (1ULL << i)
  - Remove city i: set &= ~(1ULL << i)
  - Check city i: (set >> i) & 1
  - Count visited: __builtin_popcountll(set)
  - Iterate unvisited: for each 0 bit...

Branch-and-bound structure:

bound = compute_lower_bound(partial_tour)
if bound >= best_known:
    prune!  # This subtree can't improve

for each unvisited city c:
    extend tour with c
    recurse
    backtrack

Lower bounds (in increasing tightness/cost):

Minimum spanning tree: O(n²) to compute
1-tree bound: MST + two edges to starting city
Held-Karp bound: Lagrangian relaxation of LP

Search order matters:

Nearest neighbor heuristic first → good initial bound
Best-first search → explore promising nodes first
Most constrained first → prune early

Parallelization:

Distribute subtrees to threads
Share bounds across threads (atomic best_known)
Load balancing: work stealing when thread finishes early

Questions to explore:

Why does pruning percentage increase over time?
How much does a good initial heuristic help?
What’s the tradeoff between bound quality and compute time?
How do you handle memory for large state spaces?

Learning milestones:

Correct exhaustive search → You understand the problem structure
Bounds prune 90%+ of search space → You understand branch-and-bound
Solve 20+ cities in seconds → You understand the optimizations
Parallel version scales well → You understand parallel search

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Performance Measurement Toolkit	Intermediate	Weekend-1 week	★★★★☆	★★★☆☆
2. Matrix Multiplication Optimizer	Advanced	2-3 weeks	★★★★★	★★★★☆
3. Bit Manipulation Library	Intermediate	1 week	★★★☆☆	★★★★☆
4. Assembly Language Analyzer	Advanced	2-3 weeks	★★★★★	★★★☆☆
5. Custom Memory Allocator	Expert	3-4 weeks	★★★★★	★★★★☆
6. Parallel Matrix Operations	Advanced	2-3 weeks	★★★★☆	★★★★☆
7. Race Detector	Expert	4-6 weeks	★★★★★	★★★☆☆
8. Cache-Oblivious B-Tree	Expert	3-4 weeks	★★★★★	★★★☆☆
9. Lock-Free Queue	Expert	3-4 weeks	★★★★★	★★★★☆
10. SIMD JSON Parser	Expert	4-6 weeks	★★★★★	★★★★★
11. Profile-Guided Optimizer	Expert	4-6 weeks	★★★★★	★★★☆☆
12. Parallel Merge Sort	Advanced	2-3 weeks	★★★★☆	★★★★☆
13. HW Perf Counter Dashboard	Advanced	2-3 weeks	★★★★☆	★★★★★
14. Graph Algorithm Optimizer	Advanced	3-4 weeks	★★★★☆	★★★★☆
15. TSP Solver	Expert	4-6 weeks	★★★★★	★★★★★

Recommended Learning Path

Phase 1: Foundations (4-6 weeks)

Build these first to establish measurement and low-level understanding:

Project 1: Performance Measurement Toolkit — You need this to validate all other optimizations
Project 3: Bit Manipulation Library — Quick win, teaches low-level thinking
Project 2: Matrix Multiplication Optimizer — THE canonical performance engineering exercise

Phase 2: Systems Understanding (6-8 weeks)

Deepen your understanding of how systems actually work:

Project 4: Assembly Language Analyzer — Learn to see what the compiler produces
Project 5: Custom Memory Allocator — Understand dynamic memory deeply
Project 13: HW Perf Counter Dashboard — Visual feedback on what’s happening

Phase 3: Parallelism (6-8 weeks)

Master parallel programming:

Project 6: Parallel Matrix Operations — Fork-join parallelism basics
Project 12: Parallel Merge Sort — Work-span analysis in practice
Project 7: Race Detector — Understand concurrency bugs

Phase 4: Advanced Topics (8-12 weeks)

Push into expert territory:

Project 9: Lock-Free Queue — Cutting-edge concurrency
Project 8: Cache-Oblivious B-Tree — Advanced data structure design
Project 10: SIMD JSON Parser — Pure magic performance

Phase 5: Integration (4-8 weeks)

Apply everything together:

Project 14: Graph Algorithm Optimizer — Real-world memory-bound problem
Project 11: Profile-Guided Optimizer — Compiler + runtime = holistic optimization
Project 15: TSP Solver — Algorithms meet systems

Final Capstone Project: High-Performance Key-Value Store

File: LEARN_PERFORMANCE_ENGINEERING.md
Main Programming Language: C
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Distributed Systems / Storage / Full-Stack Performance
Software or Tool: Building a Redis/RocksDB competitor
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A high-performance in-memory key-value store that combines everything you’ve learned: custom memory allocator for minimal fragmentation, lock-free hash table for concurrent access, SIMD-accelerated key comparison, parallel request handling, and efficient persistence with write-ahead logging.

Why it teaches performance engineering: This is the ultimate integration project. Key-value stores are the backbone of modern infrastructure, and building one requires mastery of every concept: memory management, concurrency, caching, I/O, and measurement. Companies pay millions for fast key-value stores; you’ll understand why.

Core challenges you’ll face:

Lock-free hash table → applies Project 9
Custom slab allocator → applies Project 5
SIMD key comparison → applies Project 10
Parallel request handling → applies Projects 6, 12
Efficient persistence → new (WAL, fsync, mmap)
Performance measurement at every layer → applies Projects 1, 13

Key Concepts:

In-Memory Databases: “Designing Data-Intensive Applications” Chapter 3 - Kleppmann
Hash Table Design: All previous projects
Write-Ahead Logging: “Database Internals” by Alex Petrov
Efficient I/O: io_uring, mmap, direct I/O

Difficulty: Master Time estimate: 2-3 months Prerequisites: All previous projects (or equivalent experience)

Real world outcome:

$ ./kvstore_bench -threads 8 -ops 10000000 -keysize 32 -valuesize 256

High-Performance KV Store Benchmark
Operations: 10M (50% GET, 50% SET)
Keys: 32 bytes, Values: 256 bytes
Threads: 8

┌─────────────────────────────────────────────────────────────┐
│ Metric              │ Our KV Store │ Redis  │ Memcached    │
├─────────────────────────────────────────────────────────────┤
│ Throughput          │ 4.2M ops/s   │ 180K   │ 250K         │
│ Avg Latency         │ 1.9 μs       │ 44 μs  │ 32 μs        │
│ P99 Latency         │ 8.2 μs       │ 120 μs │ 85 μs        │
│ Memory Efficiency   │ 92%          │ 78%    │ 82%          │
└─────────────────────────────────────────────────────────────┘

Why we're faster:
  ✓ Lock-free hash table (no mutex contention)
  ✓ Custom allocator (no malloc overhead, no fragmentation)
  ✓ SIMD key comparison (32-byte keys in one instruction)
  ✓ Batched fsync (amortized persistence cost)
  ✓ Thread-local operation queues (reduced synchronization)

Persistence test:
  Write 1M keys, kill process, restart, verify all keys present: PASS
  WAL replay time: 0.8 seconds

Implementation Hints:

Architecture overview:

┌──────────────────────────────────────────────────────────────┐
│                     Request Handlers                          │
│   (thread-local queues, batch processing)                    │
├──────────────────────────────────────────────────────────────┤
│                    Lock-Free Hash Table                       │
│   (open addressing, SIMD probing, hopscotch/robin hood)      │
├──────────────────────────────────────────────────────────────┤
│                    Slab Allocator                             │
│   (size classes, thread-local caches, no fragmentation)      │
├──────────────────────────────────────────────────────────────┤
│                    Write-Ahead Log                            │
│   (append-only, group commit, background sync)               │
└──────────────────────────────────────────────────────────────┘

Key design decisions:

Hash table: Lock-free with CAS, resize via gradual migration
Memory: Slab allocator with power-of-2 size classes
Key comparison: SIMD for fixed-size keys, optimized memcmp for variable
Persistence: Append to WAL, batch fsync, background compaction
Networking: Event-driven (epoll/kqueue) or io_uring

Questions to explore:

How do you resize a lock-free hash table?
What’s the tradeoff between immediate and batched fsync?
How do you handle hot keys (skewed access)?
What’s the latency breakdown (network, compute, storage)?

Learning milestones:

Lock-free hash table works correctly → You’ve mastered concurrency
Throughput exceeds 1M ops/s → You’ve optimized the hot path
Persistence doesn’t hurt performance → You understand I/O optimization
You can explain every microsecond of latency → You’re a performance engineer

Summary

#	Project	Main Language
1	Performance Measurement Toolkit	C
2	Matrix Multiplication Optimizer	C
3	Bit Manipulation Library	C
4	Assembly Language Analyzer	C
5	Custom Memory Allocator	C
6	Parallel Matrix Operations with Cilk	C (OpenCilk)
7	Race Detector	C
8	Cache-Oblivious B-Tree	C
9	Lock-Free Queue	C
10	SIMD JSON Parser	C
11	Profile-Guided Optimizer	C/C++
12	Parallel Merge Sort with Work-Stealing	C
13	Hardware Performance Counter Dashboard	C
14	Graph Algorithm Optimizer	C
15	Traveling Salesman Solver	C
Capstone	High-Performance Key-Value Store	C

Resources

Primary Reference

MIT 6.172 Performance Engineering of Software Systems - The course these projects are based on

Essential Books

“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron - Foundation for everything
“Introduction to Algorithms” by CLRS - Algorithms including parallel algorithms
“The Art of Multiprocessor Programming” by Herlihy & Shavit - Concurrency theory and practice

Performance-Specific Books

“Systems Performance” by Brendan Gregg - Practical performance engineering
“Hacker’s Delight” by Henry Warren - Bit manipulation techniques
“What Every Programmer Should Know About Memory” by Ulrich Drepper - Memory hierarchy deep dive

Online Resources

Agner Fog’s Optimization Manuals - CPU microarchitecture details
Intel Intrinsics Guide - SIMD reference
Compiler Explorer (godbolt.org) - Interactive assembly exploration

“The difference between a bad programmer and a good one is whether they consider their code or their data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.” — Linus Torvalds

“Premature optimization is the root of all evil — but so is premature pessimization.” — This course