PERFORMANCE ENGINEERING PROJECTS
Sprint 2: Data & Invariants — Project Recommendations
Goal
After completing these projects, you will understand performance engineering as a full-stack discipline: how programs actually consume CPU cycles, memory bandwidth, cache capacity, and I/O time, and how to turn that understanding into measurable latency and throughput improvements.
You will internalize:
- How to profile reality: sampling vs tracing, perf events, flamegraphs, and how to interpret them without guessing
- How to debug performance bugs: GDB, core dumps, and post-mortem analysis of hot paths and stalls
- How hardware really behaves: cache hierarchies, branch prediction, SIMD lanes, and why micro-optimizations sometimes work and often do not
- How to optimize latency: tail latency, queuing effects, lock contention, and jitter sources
- How to validate improvements: benchmarking discipline, statistical rigor, and regression detection
By the end, you will be able to explain why a system is slow, prove it with data, and implement targeted changes that predictably move real performance metrics.
Foundational Concepts: The Five Pillars
1. Measurement First — You Cannot Optimize What You Cannot Measure
┌─────────────────────────────────────────────────────────┐
│ PERFORMANCE ENGINEERING STARTS WITH MEASUREMENT │
│ │
│ You must know: │
│ • What is slow │
│ • Where time is spent │
│ • Which resource is saturated │
│ • How variability changes your conclusions │
└─────────────────────────────────────────────────────────┘
2. The Performance Triangle — CPU, Memory, I/O
┌───────────┐
│ CPU │ Compute cycles, pipeline stalls
└─────┬─────┘
│
│
┌─────────────┼─────────────┐
│ MEMORY │ Cache misses, bandwidth
└─────────────┼─────────────┘
│
│
┌─────┴─────┐
│ I/O │ Disk, network, syscalls
└───────────┘
3. The Latency Stack — Why Small Delays Add Up
Request Latency =
Queueing delay
+ Scheduling delay
+ CPU execution time
+ Cache miss penalties
+ Syscall overhead
+ I/O wait
+ Lock contention
4. Hardware Reality — Caches, Pipelines, SIMD
CPU Core
├─ L1 Cache (tiny, fastest)
├─ L2 Cache (larger, slower)
├─ L3 Cache (shared, much slower)
└─ DRAM (massive, very slow)
SIMD: One instruction, many data lanes
5. Debugging Performance — Finding the Real Cause
SYMPTOM: "Latency spikes"
↓
MEASUREMENT: "99p jumps from 5ms to 80ms"
↓
PROFILE: "Stack shows lock contention"
↓
ROOT CAUSE: "Mutex held during disk write"
Core Concept Analysis
The Measurement Stack
┌───────────────┐ perf stat / perf record
│ Counters │ → CPU cycles, cache misses, branch misses
└───────┬───────┘
│
┌───────┴───────┐ Flamegraphs
│ Samples │ → Which code paths consume time
└───────┬───────┘
│
┌───────┴───────┐ Traces
│ Events │ → Syscalls, scheduling, I/O timing
└───────────────┘
The Cache and Memory Model
Time Cost (approximate)
┌─────────────────────────────┐
│ Register ~1 cycle │
│ L1 Cache ~4 cycles │
│ L2 Cache ~12 cycles │
│ L3 Cache ~40 cycles │
│ DRAM 100+ cycles │
└─────────────────────────────┘
SIMD and Vectorization
Scalar: A1 A2 A3 A4 -> one per instruction
SIMD: [A1 A2 A3 A4] -> one instruction, 4 lanes
Debugging and Post-Mortem Analysis
Crash or stall -> Core dump -> GDB analysis -> Identify hot or blocked threads
Concept Summary Table
| Concept Cluster | What You Must Internalize |
|---|---|
| Measurement Discipline | You need repeatable benchmarks and correct statistics before trusting any optimization. |
| Profiling and Tracing | Sampling shows where time is spent; tracing shows why and when it happened. |
| CPU and Cache Behavior | Performance depends on cache locality, branch prediction, and memory access patterns. |
| SIMD and Vectorization | Wide data paths can multiply throughput, but only with aligned, predictable data. |
| Latency Engineering | Tail latency is a system property, not just a code property. |
| Debugging for Performance | GDB and core dumps can reveal blocked threads, stalls, and deadlocks. |
Deep Dive Reading By Concept
Concept 1: Measurement and Benchmarking
| Book | Chapter/Section | What You’ll Learn | Priority |
|---|---|---|---|
| “Systems Performance” by Brendan Gregg | Ch. 2: Methodology | How to structure a performance investigation | Essential |
| “High Performance Python” by Gorelick and Ozsvald | Ch. 1: Benchmarking | Measuring accurately and avoiding false conclusions | Recommended |
Concept 2: Profiling and Flamegraphs
| Book | Chapter/Section | What You’ll Learn | Priority |
|---|---|---|---|
| “Systems Performance” by Brendan Gregg | Ch. 6: CPUs | CPU profiling methods and tools | Essential |
| “Performance Analysis and Tuning on Modern CPUs” by Fog | Ch. 1-3: Microarchitecture | How microarchitecture affects profiling results | Recommended |
Concept 3: Cache and Memory
| Book | Chapter/Section | What You’ll Learn | Priority |
|---|---|---|---|
| “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron | Ch. 6: Memory Hierarchy | Cache behavior and memory access patterns | Essential |
| “What Every Programmer Should Know About Memory” by Drepper | Sections 2-6 | Cache levels, bandwidth, and latency | Recommended |
Concept 4: SIMD and Vectorization
| Book | Chapter/Section | What You’ll Learn | Priority |
|---|---|---|---|
| “Computer Architecture: A Quantitative Approach” by Hennessy and Patterson | Ch. 3: Instruction-Level Parallelism | Vector units and throughput | Recommended |
| “Optimizing Software in C++” by Agner Fog | Ch. 10: Vectorization | SIMD principles and pitfalls | Essential |
Concept 5: Latency Engineering
| Book | Chapter/Section | What You’ll Learn | Priority |
|---|---|---|---|
| “Designing Data-Intensive Applications” by Kleppmann | Ch. 8: The Trouble with Distributed Systems | Tail latency and queuing | Recommended |
| “Site Reliability Engineering” by Beyer et al. | Ch. 4: Service Level Objectives | Latency targets and error budgets | Recommended |
Project List
Projects are ordered from foundational profiling to advanced hardware and latency optimization.
Project 1: Performance Baseline Lab
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Benchmarking and Measurement
- Software or Tool: perf, time, taskset
- Main Book: “Systems Performance” by Brendan Gregg
What you’ll build: A repeatable benchmarking harness that measures runtime, variance, and CPU usage for a set of micro-tasks.
Why it teaches performance engineering: Without a reliable baseline, every optimization is a guess. This project forces you to build measurement discipline first.
Core challenges you’ll face:
- Designing repeatable workloads (maps to measurement methodology)
- Handling variance and noise (maps to statistics and sampling)
- Recording and comparing runs (maps to regression detection)
Key Concepts
- Benchmarking noise: “Systems Performance” - Brendan Gregg
- Measurement bias: “High Performance Python” - Gorelick and Ozsvald
- Repeatability: “The Practice of Programming” - Kernighan and Pike
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C, command-line usage
Real World Outcome
You will have a CLI tool that runs a defined workload N times, reports min/median/99p duration, and stores results in a simple text report. When you run it, you will see a clear comparison between baseline and changed versions.
Example Output:
$ ./perf_lab run --workload "memcopy" --iters 50
Workload: memcopy
Runs: 50
Min: 12.4 ms
Median: 13.1 ms
99p: 14.9 ms
CPU cycles: 3.2e9
Cache misses: 1.1e7
Saved report: reports/memcopy_2025-01-10.txt
The Core Question You’re Answering
“How do I know if a change actually made performance better?”
Before you write any code, sit with this question. Most optimization efforts fail because there is no trustworthy baseline or measurement process.
Concepts You Must Understand First
Stop and research these before coding:
- Benchmarking discipline
- What makes a benchmark repeatable?
- How does CPU frequency scaling affect results?
- Why does warm-up matter?
- Book Reference: “Systems Performance” Ch. 2 - Brendan Gregg
- Variance and noise
- What are outliers and how do they distort averages?
- Why is p99 more informative than the mean for latency?
- Book Reference: “High Performance Python” Ch. 1 - Gorelick and Ozsvald
Questions to Guide Your Design
Before implementing, think through these:
- Workload definition
- How will you isolate CPU-bound vs memory-bound tasks?
- What inputs represent realistic sizes?
- How will you keep input stable across runs?
- Metrics and reporting
- Which metrics are most meaningful for your workload?
- How will you store historical baselines?
Thinking Exercise
The Baseline Trap
Before coding, describe a scenario where an optimization appears faster but is actually noise. Write down three sources of variability (scheduler, CPU turbo, cache warmth) and how each could mislead your conclusion.
Describe: "Before" vs "After" results and why they are not comparable.
List: 3 variability sources and how you would control them.
The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you design a trustworthy benchmark?”
- “Why is median better than mean for latency?”
- “What is p99 and why does it matter?”
- “How can CPU frequency scaling distort a benchmark?”
- “How do you detect regression over time?”
Hints in Layers
Hint 1: Starting Point Pick two micro-workloads: one CPU-bound (simple arithmetic loop), one memory-bound (array scan). Only then add metrics.
Hint 2: Next Level Run each workload multiple times with CPU pinned to a single core. Compare median and p99.
Hint 3: Technical Details Record both wall-clock time and CPU counters. Store results with timestamps for comparison.
Hint 4: Tools/Debugging Use perf stat to validate your counters, then compare them to your stored reports.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Benchmarking methodology | “Systems Performance” by Brendan Gregg | Ch. 2 |
| Measurement pitfalls | “High Performance Python” by Gorelick and Ozsvald | Ch. 1 |
| Debugging experiments | “The Practice of Programming” by Kernighan and Pike | Ch. 5 |
Project 2: perf + Flamegraph Investigator
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Profiling and Flamegraphs
- Software or Tool: perf, flamegraph scripts
- Main Book: “Systems Performance” by Brendan Gregg
What you’ll build: A repeatable workflow that profiles a workload, generates flamegraphs, and annotates hotspots with causes.
Why it teaches performance engineering: Flamegraphs force you to attribute time to exact code paths, not guesses.
Core challenges you’ll face:
- Collecting reliable samples (maps to profiling fundamentals)
- Interpreting flamegraph width vs depth (maps to call stack attribution)
- Separating CPU time from I/O wait (maps to tracing vs sampling)
Key Concepts
- Profiling methodology: “Systems Performance” - Brendan Gregg
- Stack sampling: “Performance Analysis and Tuning on Modern CPUs” - Fog
- Flamegraph interpretation: Brendan Gregg blog (conceptual reference)
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic Linux tooling
Real World Outcome
You will produce a flamegraph report for a real workload and a short narrative of the top three hotspots and their likely causes. You will also have a repeatable command sequence for re-running the analysis after changes.
Example Output:
$ ./profile_run.sh
Profile captured: profiles/run_2025-01-10.data
Flamegraph: reports/run_2025-01-10.svg
Top hotspots:
1) parse_input -> 38%
2) hash_lookup -> 21%
3) serialize_output -> 14%
The Core Question You’re Answering
“Which specific functions are consuming the most CPU time, and why?”
Concepts You Must Understand First
Stop and research these before coding:
- Sampling vs tracing
- What does a sample represent?
- Why can sampling miss short functions?
- Book Reference: “Systems Performance” Ch. 6 - Brendan Gregg
- Call stacks and attribution
- How do stack frames map to time?
- What does a wide flamegraph bar mean?
- Book Reference: “Performance Analysis and Tuning on Modern CPUs” Ch. 1 - Fog
Questions to Guide Your Design
Before implementing, think through these:
- Profiling configuration
- What sample rate balances overhead vs accuracy?
- How will you ensure symbols are available for stacks?
- Interpreting results
- How will you differentiate hotspots that are expected from unexpected?
Thinking Exercise
The Misleading Hotspot
Describe a case where a function appears hot in a flamegraph, but optimizing it would not reduce total latency. Write down how you would verify whether it is truly the bottleneck.
Explain why a hotspot might be a symptom rather than a cause.
List two additional signals you would check.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a flamegraph and what does width represent?”
- “How do you choose a sampling rate for perf?”
- “Why can a function appear hot but not be the real bottleneck?”
- “How do you separate CPU-bound work from I/O wait?”
- “What are the limitations of sampling profilers?”
Hints in Layers
Hint 1: Starting Point Profile a single workload from Project 1 to keep scope small.
Hint 2: Next Level Run multiple profiles and compare stability of hotspots.
Hint 3: Technical Details Use a separate step to symbolize and render flamegraphs so you can re-run without re-profiling.
Hint 4: Tools/Debugging Validate results with a second profiler to cross-check hotspots.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| CPU profiling | “Systems Performance” by Brendan Gregg | Ch. 6 |
| Microarchitecture basics | “Performance Analysis and Tuning on Modern CPUs” by Fog | Ch. 1-3 |
| Profiling practices | “Optimizing Software in C++” by Agner Fog | Ch. 2 |
Project 3: GDB + Core Dump Performance Autopsy
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust, Go
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Debugging and Post-Mortem Analysis
- Software or Tool: GDB, core dumps
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: A repeatable crash-and-stall investigation workflow using core dumps and GDB to identify blocked threads and hot loops.
Why it teaches performance engineering: Post-mortem analysis is essential when performance issues only happen in production.
Core challenges you’ll face:
- Capturing useful core dumps (maps to process state capture)
- Reading thread backtraces (maps to stack and scheduler awareness)
- Correlating stack state with latency symptoms (maps to root-cause analysis)
Key Concepts
- Core dump generation: “The Linux Programming Interface” - Kerrisk
- GDB backtrace reasoning: “Debugging with GDB” - FSF
- Thread states: “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic GDB usage
Real World Outcome
You will have a written post-mortem that identifies the exact thread and function responsible for a synthetic stall. You will also have a checklist for capturing core dumps and analyzing them in GDB.
Example Output:
$ ./autopsy.sh --core core.1234
Loaded core: core.1234
Threads: 8
Hot thread: TID 7
Top frame: lock_acquire
Evidence: 3 threads waiting on same mutex
Conclusion: contention in request queue
The Core Question You’re Answering
“How do I explain a performance incident when I cannot reproduce it?”
Concepts You Must Understand First
Stop and research these before coding:
- Core dumps and process state
- What is captured in a core dump?
- How do you map memory to symbols?
- Book Reference: “The Linux Programming Interface” Ch. 20 - Kerrisk
- Thread scheduling and blocking
- How do you identify blocked threads?
- What does a waiting thread look like in backtrace?
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 26 - Arpaci-Dusseau
Questions to Guide Your Design
Before implementing, think through these:
- Reproducible stall scenario
- What controllable condition will cause blocking?
- How will you capture a core at the right moment?
- Analysis workflow
- Which thread clues indicate a hot loop vs a lock wait?
Thinking Exercise
The Frozen System
Write a narrative of how you would distinguish between a CPU-bound loop and a thread deadlock using only a core dump and backtraces.
List the signals you would look for in stack traces and thread states.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a core dump and how do you use it?”
- “How can you diagnose lock contention from a core dump?”
- “How do you identify a hot loop in GDB?”
- “What makes performance bugs hard to reproduce?”
- “How do you capture a core dump safely in production?”
Hints in Layers
Hint 1: Starting Point Create a workload that deliberately blocks on a lock and capture a core at peak stall.
Hint 2: Next Level Compare thread backtraces to see which thread is holding the lock and which are waiting.
Hint 3: Technical Details Use thread backtrace summaries to identify hot frames and blocked frames.
Hint 4: Tools/Debugging Write a short checklist for core capture, symbol loading, and triage steps.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Core dumps | “The Linux Programming Interface” by Kerrisk | Ch. 20 |
| Debugging practice | “Debugging with GDB” by FSF | Ch. 7 |
| Thread states | “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau | Ch. 26 |
Project 4: Cache Locality Visualizer
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust, Zig
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: CPU Caches and Memory Access
- Software or Tool: perf, cachegrind (optional)
- Main Book: “Computer Systems: A Programmer’s Perspective”
What you’ll build: A set of experiments that visualize how data layout and access patterns change cache miss rates and runtime.
Why it teaches performance engineering: Cache behavior is often the biggest performance lever in real systems.
Core challenges you’ll face:
- Designing access patterns (maps to spatial and temporal locality)
- Measuring cache miss rates (maps to hardware counters)
- Connecting layout decisions to runtime impact (maps to memory hierarchy)
Key Concepts
- Memory hierarchy: “CS:APP” - Bryant and O’Hallaron
- Cache effects: “What Every Programmer Should Know About Memory” - Drepper
- Data layout: “Optimizing Software in C++” - Agner Fog
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 complete, basic understanding of caches
Real World Outcome
You will produce a report with side-by-side timing and cache-miss data for different data layouts and access patterns, plus a short explanation of which pattern wins and why.
Example Output:
$ ./cache_lab report
Pattern A (row-major): 120 ms, L1 miss rate 2%
Pattern B (column-major): 940 ms, L1 miss rate 38%
Conclusion: stride access destroys cache locality
The Core Question You’re Answering
“Why does the same algorithm run 8x slower just by changing data layout?”
Concepts You Must Understand First
Stop and research these before coding:
- Cache hierarchy
- What is the cost difference between L1 and DRAM?
- How do cache lines work?
- Book Reference: “CS:APP” Ch. 6 - Bryant and O’Hallaron
- Locality
- What is spatial vs temporal locality?
- How does stride size affect misses?
- Book Reference: Drepper, Sections 2-3
Questions to Guide Your Design
Before implementing, think through these:
- Experiment design
- How will you isolate access patterns from computation cost?
- What data sizes cross cache boundaries?
- Measurement
- Which counters best indicate cache pressure?
Thinking Exercise
The Cache Line Walk
Draw a grid of memory addresses for a 2D array. Mark which cache lines are touched when traversing row-major vs column-major. Explain why one pattern reuses cache lines and the other does not.
Sketch: memory layout of a 4x4 array and show access order.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a cache line and why does it matter?”
- “Explain spatial vs temporal locality.”
- “Why is column-major access slow in C for row-major arrays?”
- “How do you measure cache misses with perf?”
- “How can data layout dominate algorithmic complexity?”
Hints in Layers
Hint 1: Starting Point Start with a 2D array scan and compare two traversal orders.
Hint 2: Next Level Scale the array so it crosses L1, then L2, then L3 sizes.
Hint 3: Technical Details Record cache miss counters alongside runtime for each test.
Hint 4: Tools/Debugging Use perf stat with cache events to validate your assumptions.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Memory hierarchy | “Computer Systems: A Programmer’s Perspective” | Ch. 6 |
| Cache behavior | “What Every Programmer Should Know About Memory” | Sections 2-4 |
| Data layout | “Optimizing Software in C++” by Agner Fog | Ch. 5 |
Project 5: Branch Predictor and Pipeline Lab
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust, Zig
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: CPU Microarchitecture
- Software or Tool: perf, cpu topology tools
- Main Book: “Computer Architecture: A Quantitative Approach”
What you’ll build: A set of controlled experiments that show branch misprediction penalties and pipeline effects.
Why it teaches performance engineering: Branch mispredictions are invisible in code review but dominate CPU time.
Core challenges you’ll face:
- Creating predictable vs unpredictable branches (maps to branch prediction)
- Measuring misprediction rates (maps to hardware counters)
- Relating pipeline stalls to latency (maps to CPU execution)
Key Concepts
- Branch prediction: “Computer Architecture” - Hennessy and Patterson
- Pipeline stalls: “Performance Analysis and Tuning on Modern CPUs” - Fog
- CPU counters: “Systems Performance” - Brendan Gregg
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 complete, basic CPU architecture
Real World Outcome
You will produce a report that shows how predictable branches run faster and how mispredictions increase cycles per instruction.
Example Output:
$ ./branch_lab report
Predictable branch: 1.2 cycles/instruction, 2% mispredict
Unpredictable branch: 3.8 cycles/instruction, 41% mispredict
Conclusion: branch predictability dominates throughput
The Core Question You’re Answering
“Why does a tiny conditional statement slow down an entire loop?”
Concepts You Must Understand First
Stop and research these before coding:
- Branch prediction
- What does the CPU predict and when?
- What happens on misprediction?
- Book Reference: “Computer Architecture” Ch. 3 - Hennessy and Patterson
- Pipeline execution
- How do pipeline stalls reduce throughput?
- Book Reference: “Performance Analysis and Tuning on Modern CPUs” Ch. 2 - Fog
Questions to Guide Your Design
Before implementing, think through these:
- Experiment design
- How will you control input distributions to change predictability?
- How will you isolate branch cost from memory cost?
- Metrics
- Which counters best indicate misprediction impact?
Thinking Exercise
The Pipeline Flush
Describe the sequence of events when a branch is mispredicted. Explain how many stages are wasted and why that reduces instruction throughput.
Write a step-by-step narrative of a misprediction penalty.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is branch prediction and why does it matter?”
- “What is a pipeline stall?”
- “How does misprediction affect CPI?”
- “How do you measure branch misses in perf?”
- “When does branch prediction not matter?”
Hints in Layers
Hint 1: Starting Point Create two loops with identical work but different branch predictability.
Hint 2: Next Level Use perf to compare branch miss rates and CPI.
Hint 3: Technical Details Keep data in cache to avoid confounding memory effects.
Hint 4: Tools/Debugging Validate with multiple runs to ensure stability.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Branch prediction | “Computer Architecture” by Hennessy and Patterson | Ch. 3 |
| Pipeline stalls | “Performance Analysis and Tuning on Modern CPUs” by Fog | Ch. 2 |
| CPU counters | “Systems Performance” by Brendan Gregg | Ch. 6 |
Project 6: SIMD Throughput Explorer
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: C++, Rust, Zig
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: SIMD and Vectorization
- Software or Tool: perf, compiler vectorization reports
- Main Book: “Optimizing Software in C++” by Agner Fog
What you’ll build: A set of experiments showing scalar vs SIMD throughput for numeric workloads.
Why it teaches performance engineering: SIMD is one of the most powerful (and misunderstood) sources of speedup.
Core challenges you’ll face:
- Structuring data for vectorization (maps to alignment and layout)
- Measuring speedup correctly (maps to measurement discipline)
- Handling edge cases where SIMD fails (maps to control flow divergence)
Key Concepts
- SIMD fundamentals: “Optimizing Software in C++” - Agner Fog
- Vector units: “Computer Architecture” - Hennessy and Patterson
- Alignment: “CS:APP” - Bryant and O’Hallaron
Difficulty: Expert Time estimate: 1 month+ Prerequisites: Projects 1,2,4 complete; basic CPU architecture
Real World Outcome
You will produce a side-by-side report showing the throughput difference between scalar and vectorized workloads, with evidence of alignment and data layout effects.
Example Output:
$ ./simd_lab report
Scalar: 1.0x baseline
SIMD: 3.7x speedup
Aligned data: 4.2x speedup
Unaligned data: 2.1x speedup
The Core Question You’re Answering
“When does SIMD actually speed things up, and when does it not?”
Concepts You Must Understand First
Stop and research these before coding:
- Vectorization basics
- What is a SIMD lane?
- What data sizes match vector width?
- Book Reference: “Optimizing Software in C++” Ch. 10 - Fog
- Alignment and layout
- Why does alignment matter for vector loads?
- Book Reference: “CS:APP” Ch. 6 - Bryant and O’Hallaron
Questions to Guide Your Design
Before implementing, think through these:
- Data preparation
- How will you ensure contiguous, aligned data?
- Measurement
- How will you isolate compute from memory bottlenecks?
Thinking Exercise
The SIMD Mismatch
Describe a dataset where SIMD would not help due to irregular memory access. Explain how you would detect that in a profiler.
List the symptoms of non-vectorizable workloads.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is SIMD and why does it help?”
- “How does alignment affect SIMD performance?”
- “What workloads are good candidates for vectorization?”
- “How do you validate a claimed SIMD speedup?”
- “Why might SIMD not help even if the CPU supports it?”
Hints in Layers
Hint 1: Starting Point Start with a simple numeric loop and compare scalar vs vectorized versions conceptually.
Hint 2: Next Level Measure with large enough data to avoid timing noise.
Hint 3: Technical Details Separate aligned vs unaligned data experiments.
Hint 4: Tools/Debugging Use compiler reports to confirm vectorization decisions.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| SIMD fundamentals | “Optimizing Software in C++” by Agner Fog | Ch. 10 |
| Vector units | “Computer Architecture” by Hennessy and Patterson | Ch. 3 |
| Alignment | “Computer Systems: A Programmer’s Perspective” | Ch. 6 |
Project 7: Latency Budget and Tail Latency Simulator
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Go, Rust, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Latency Engineering
- Software or Tool: perf, tracing tools
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A workload simulator that produces latency distributions and demonstrates how small delays create large p99 spikes.
Why it teaches performance engineering: It connects micro-level delays to system-level tail latency.
Core challenges you’ll face:
- Modeling queueing delay (maps to latency distributions)
- Measuring p95/p99 (maps to statistics)
- Correlating spikes with system events (maps to tracing)
Key Concepts
- Tail latency: “Designing Data-Intensive Applications” - Kleppmann
- Queuing effects: “Systems Performance” - Gregg
- Latency metrics: “Site Reliability Engineering” - Beyer et al.
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic statistics
Real World Outcome
You will produce a report with latency histograms and a clear explanation of how queueing amplifies minor slowdowns.
Example Output:
$ ./latency_sim report
Median: 4.8 ms
95p: 9.2 ms
99p: 41.7 ms
Spike cause: queue depth burst during periodic I/O
The Core Question You’re Answering
“Why does p99 explode even when average latency looks fine?”
Concepts You Must Understand First
Stop and research these before coding:
- Latency percentiles
- Why are percentiles more useful than averages?
- Book Reference: “Site Reliability Engineering” Ch. 4 - Beyer et al.
- Queueing theory basics
- What happens as utilization approaches 100%?
- Book Reference: “Systems Performance” Ch. 2 - Gregg
Questions to Guide Your Design
Before implementing, think through these:
- Workload shaping
- How will you generate bursts of load?
- How will you log queue depth over time?
- Visualization
- How will you present the latency distribution clearly?
Thinking Exercise
The Slow Tail
Describe a system where 1 percent of requests take 10x longer. Explain how that affects user experience and system capacity.
Write a short narrative with numbers (median, 95p, 99p).
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is tail latency and why is it important?”
- “Why does queueing cause nonlinear latency growth?”
- “How do you measure and report p99?”
- “What are common sources of latency spikes?”
- “How do you reduce tail latency without overprovisioning?”
Hints in Layers
Hint 1: Starting Point Generate a steady workload and measure p50 and p99.
Hint 2: Next Level Introduce periodic bursts and observe the p99 jump.
Hint 3: Technical Details Record queue depth alongside latency metrics for correlation.
Hint 4: Tools/Debugging Use tracing to identify which stage introduces the long tail.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Tail latency | “Designing Data-Intensive Applications” by Kleppmann | Ch. 8 |
| Queueing effects | “Systems Performance” by Brendan Gregg | Ch. 2 |
| SLOs and latency | “Site Reliability Engineering” by Beyer et al. | Ch. 4 |
Project 8: Lock Contention and Concurrency Profiler
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Go, Rust, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 3: Advanced
- Knowledge Area: Concurrency and Contention
- Software or Tool: perf, tracing tools
- Main Book: “The Art of Multiprocessor Programming” by Herlihy and Shavit
What you’ll build: A contention diagnostic that measures lock hold time, wait time, and impact on throughput.
Why it teaches performance engineering: Concurrency issues are a leading cause of real-world latency spikes.
Core challenges you’ll face:
- Measuring lock hold time (maps to concurrency profiling)
- Visualizing contention hotspots (maps to performance attribution)
- Relating throughput drops to lock behavior (maps to system metrics)
Key Concepts
- Locks and contention: “The Art of Multiprocessor Programming”
- Profiling contention: “Systems Performance” - Gregg
- Scheduling overhead: “Operating Systems: Three Easy Pieces”
Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 complete, basic concurrency
Real World Outcome
You will produce a report that shows which locks are most contended and how they impact throughput and latency.
Example Output:
$ ./lock_prof report
Lock A: hold time 2.4 ms, wait time 8.1 ms
Lock B: hold time 0.2 ms, wait time 0.5 ms
Throughput drop: 35% under contention
The Core Question You’re Answering
“Which locks are killing performance, and why?”
Concepts You Must Understand First
Stop and research these before coding:
- Lock contention
- What is the difference between hold time and wait time?
- Book Reference: “The Art of Multiprocessor Programming” Ch. 2
- Scheduling effects
- How do context switches affect latency?
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 26
Questions to Guide Your Design
Before implementing, think through these:
- Metrics
- How will you capture lock wait time accurately?
- Visualization
- How will you compare lock hotspots across runs?
Thinking Exercise
The Contention Map
Describe how you would visualize a system where 20 threads compete for the same lock. Explain how that would show up in throughput and latency metrics.
Create a simple map: threads -> lock -> wait time.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is lock contention and how do you measure it?”
- “How does contention affect tail latency?”
- “What is the difference between throughput and latency in contention scenarios?”
- “How can you reduce contention without removing correctness?”
- “What are alternatives to heavy locks?”
Hints in Layers
Hint 1: Starting Point Start with a single shared lock and scale thread count.
Hint 2: Next Level Measure wait time and throughput at each thread count.
Hint 3: Technical Details Separate lock hold time from wait time in your metrics.
Hint 4: Tools/Debugging Use tracing to confirm where threads are blocked.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Locks and contention | “The Art of Multiprocessor Programming” | Ch. 2 |
| Profiling systems | “Systems Performance” by Brendan Gregg | Ch. 6 |
| Scheduling | “Operating Systems: Three Easy Pieces” | Ch. 26 |
Project 9: System Call and I/O Latency Profiler
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Go, Rust, Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: I/O and Syscalls
- Software or Tool: perf, strace, bpftrace
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: A profiler that captures syscall latency distributions and identifies slow I/O paths.
Why it teaches performance engineering: Many latency spikes come from slow syscalls and blocking I/O.
Core challenges you’ll face:
- Capturing syscall timing (maps to tracing)
- Associating syscalls with call sites (maps to profiling)
- Interpreting I/O delays (maps to system behavior)
Key Concepts
- Syscall overhead: “The Linux Programming Interface” - Kerrisk
- Tracing methodology: “Systems Performance” - Gregg
- I/O latency: “Operating Systems: Three Easy Pieces”
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic Linux tooling
Real World Outcome
You will have a report showing which syscalls are slow, their latency distribution, and how often they appear in real workloads.
Example Output:
$ ./syscall_prof report
Top slow syscalls:
1) read: p99 18.2 ms
2) fsync: p99 42.7 ms
3) open: p99 6.3 ms
Conclusion: fsync spikes dominate tail latency
The Core Question You’re Answering
“Which system calls are responsible for I/O latency spikes?”
Concepts You Must Understand First
Stop and research these before coding:
- Syscall mechanics
- What happens during a syscall transition?
- Book Reference: “The Linux Programming Interface” Ch. 3 - Kerrisk
- I/O stack
- Why does disk or network I/O dominate latency?
- Book Reference: “Operating Systems: Three Easy Pieces” Ch. 36
Questions to Guide Your Design
Before implementing, think through these:
- Trace capture
- How will you collect syscall timing with low overhead?
- Reporting
- How will you summarize p50, p95, p99?
Thinking Exercise
The Slow Disk
Describe how a single slow disk operation can inflate tail latency. Explain how you would prove it using syscall timing data.
Write a short narrative using p95 and p99 measurements.
The Interview Questions They’ll Ask
Prepare to answer these:
- “What is a syscall and why is it expensive?”
- “How do you measure syscall latency?”
- “Why does fsync cause latency spikes?”
- “How do you separate CPU time from I/O wait?”
- “What is the risk of tracing at scale?”
Hints in Layers
Hint 1: Starting Point Trace a single process and capture only read/write calls.
Hint 2: Next Level Add latency percentiles to your report.
Hint 3: Technical Details Correlate slow syscalls with timestamps from your workload.
Hint 4: Tools/Debugging Validate findings with strace or a second tracing tool.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Syscalls | “The Linux Programming Interface” by Kerrisk | Ch. 3 |
| Tracing | “Systems Performance” by Brendan Gregg | Ch. 7 |
| I/O behavior | “Operating Systems: Three Easy Pieces” | Ch. 36 |
Project 10: End-to-End Performance Regression Dashboard
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Go, Rust, Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Performance Engineering Systems
- Software or Tool: perf, flamegraphs, tracing tools
- Main Book: “Systems Performance” by Brendan Gregg
What you’ll build: A dashboard that tracks performance metrics across builds and highlights regressions with evidence.
Why it teaches performance engineering: It forces you to automate profiling and make results actionable for real teams.
Core challenges you’ll face:
- Automating profiling runs (maps to measurement discipline)
- Detecting statistically significant regressions (maps to benchmarking)
- Presenting root cause clues (maps to flamegraph analysis)
Key Concepts
- Regression detection: “Systems Performance” - Gregg
- Statistical testing: “High Performance Python” - Gorelick and Ozsvald
- Profiling workflow: “Performance Analysis and Tuning on Modern CPUs” - Fog
Difficulty: Advanced Time estimate: 1 month+ Prerequisites: Projects 1-3 complete, basic data visualization
Real World Outcome
You will have a dashboard-like report that shows performance trends across versions, flags regressions, and links to the profiling evidence that explains them.
Example Output:
$ ./perf_dashboard report
Version: v1.4.2 -> v1.4.3
Regression: +18% median latency
Hotspot shift: parse_input increased from 22% to 41%
Evidence: reports/v1.4.3_flamegraph.svg
The Core Question You’re Answering
“How do I prevent performance regressions from silently reaching production?”
Concepts You Must Understand First
Stop and research these before coding:
- Regression detection
- What qualifies as a statistically significant slowdown?
- Book Reference: “High Performance Python” Ch. 1 - Gorelick and Ozsvald
- Automated profiling
- How do you integrate profiling into CI without excessive overhead?
- Book Reference: “Systems Performance” Ch. 2 - Gregg
Questions to Guide Your Design
Before implementing, think through these:
- Signal vs noise
- How will you avoid false positives?
- Evidence collection
- How will you attach profiling data to regressions?
Thinking Exercise
The False Regression
Describe a case where a performance regression alert triggers, but the underlying cause is measurement noise. Explain how you would verify and dismiss it.
List two signals that confirm it is noise.
The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you detect performance regressions in CI?”
- “What is the difference between a regression and natural variance?”
- “How do you attach evidence to a performance alert?”
- “How do you avoid alert fatigue in performance monitoring?”
- “What is a safe rollback plan for performance issues?”
Hints in Layers
Hint 1: Starting Point Start with a single workload and record its baseline in a file.
Hint 2: Next Level Add a comparison step that flags changes beyond a threshold.
Hint 3: Technical Details Attach a flamegraph or perf report to each flagged run.
Hint 4: Tools/Debugging Validate regressions by repeating runs on pinned CPU cores.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Regression workflows | “Systems Performance” by Brendan Gregg | Ch. 2 |
| Benchmarking statistics | “High Performance Python” by Gorelick and Ozsvald | Ch. 1 |
| Profiling tooling | “Performance Analysis and Tuning on Modern CPUs” by Fog | Ch. 1-3 |
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Performance Baseline Lab | Beginner | Weekend | Medium | Medium |
| perf + Flamegraph Investigator | Intermediate | 1-2 weeks | High | High |
| GDB + Core Dump Performance Autopsy | Intermediate | 1-2 weeks | High | Medium |
| Cache Locality Visualizer | Advanced | 1-2 weeks | High | High |
| Branch Predictor and Pipeline Lab | Advanced | 1-2 weeks | High | Medium |
| SIMD Throughput Explorer | Expert | 1 month+ | Very High | High |
| Latency Budget and Tail Latency Simulator | Intermediate | 1-2 weeks | High | High |
| Lock Contention and Concurrency Profiler | Advanced | 1-2 weeks | High | Medium |
| System Call and I/O Latency Profiler | Intermediate | 1-2 weeks | High | Medium |
| End-to-End Performance Regression Dashboard | Advanced | 1 month+ | Very High | High |
Recommendation
Start with Project 1 (Performance Baseline Lab) to build measurement discipline, then move to Project 2 (perf + Flamegraph Investigator) to learn attribution. After that, choose based on your interest:
- Hardware focus: Projects 4-6
- Debugging focus: Project 3
- Latency focus: Projects 7-9
Final Overall Project
Project: Full-Stack Performance Engineering Field Manual
- File: PERFORMANCE_ENGINEERING_PROJECTS.md
- Main Programming Language: C
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: Performance Engineering Systems
- Software or Tool: perf, flamegraphs, tracing tools, GDB
- Main Book: “Systems Performance” by Brendan Gregg
What you’ll build: A comprehensive performance toolkit that benchmarks, profiles, and produces a structured report with root-cause hypotheses and next actions.
Why it teaches performance engineering: It integrates measurement, profiling, hardware reasoning, and debugging into a single repeatable workflow.
Core challenges you’ll face:
- Designing a consistent experiment methodology
- Automating profiling and tracing capture
- Producing a clear narrative from raw metrics
Key Concepts
- Methodology: “Systems Performance” - Gregg
- CPU behavior: “Performance Analysis and Tuning on Modern CPUs” - Fog
- Debugging: “The Linux Programming Interface” - Kerrisk
Difficulty: Expert Time estimate: 1 month+ Prerequisites: Projects 1-9 complete
Summary
| Project | Focus | Outcome |
|---|---|---|
| Performance Baseline Lab | Measurement discipline | Reliable baselines and variance control |
| perf + Flamegraph Investigator | Profiling | Attribution of CPU hotspots |
| GDB + Core Dump Performance Autopsy | Debugging | Post-mortem performance analysis |
| Cache Locality Visualizer | Cache behavior | Evidence of locality effects |
| Branch Predictor and Pipeline Lab | CPU pipeline | Measured misprediction penalties |
| SIMD Throughput Explorer | Vectorization | Proven SIMD speedups and limits |
| Latency Budget and Tail Latency Simulator | Tail latency | Latency distribution understanding |
| Lock Contention and Concurrency Profiler | Concurrency | Contention hotspots and mitigation |
| System Call and I/O Latency Profiler | I/O latency | Slow syscall identification |
| End-to-End Performance Regression Dashboard | Regression detection | Automated performance monitoring |
| Full-Stack Performance Engineering Field Manual | Integration | A complete performance workflow |