← Back to all projects

PERFORMANCE ENGINEERING PROJECTS

Sprint 2: Data & Invariants — Project Recommendations

Goal

After completing these projects, you will understand performance engineering as a full-stack discipline: how programs actually consume CPU cycles, memory bandwidth, cache capacity, and I/O time, and how to turn that understanding into measurable latency and throughput improvements.

You will internalize:

  • How to profile reality: sampling vs tracing, perf events, flamegraphs, and how to interpret them without guessing
  • How to debug performance bugs: GDB, core dumps, and post-mortem analysis of hot paths and stalls
  • How hardware really behaves: cache hierarchies, branch prediction, SIMD lanes, and why micro-optimizations sometimes work and often do not
  • How to optimize latency: tail latency, queuing effects, lock contention, and jitter sources
  • How to validate improvements: benchmarking discipline, statistical rigor, and regression detection

By the end, you will be able to explain why a system is slow, prove it with data, and implement targeted changes that predictably move real performance metrics.


Foundational Concepts: The Five Pillars

1. Measurement First — You Cannot Optimize What You Cannot Measure

┌─────────────────────────────────────────────────────────┐
│  PERFORMANCE ENGINEERING STARTS WITH MEASUREMENT       │
│                                                         │
│  You must know:                                         │
│  • What is slow                                          │
│  • Where time is spent                                   │
│  • Which resource is saturated                           │
│  • How variability changes your conclusions              │
└─────────────────────────────────────────────────────────┘

2. The Performance Triangle — CPU, Memory, I/O

        ┌───────────┐
        │    CPU    │  Compute cycles, pipeline stalls
        └─────┬─────┘
              │
              │
┌─────────────┼─────────────┐
│           MEMORY          │  Cache misses, bandwidth
└─────────────┼─────────────┘
              │
              │
        ┌─────┴─────┐
        │    I/O    │  Disk, network, syscalls
        └───────────┘

3. The Latency Stack — Why Small Delays Add Up

Request Latency =
  Queueing delay
+ Scheduling delay
+ CPU execution time
+ Cache miss penalties
+ Syscall overhead
+ I/O wait
+ Lock contention

4. Hardware Reality — Caches, Pipelines, SIMD

CPU Core
  ├─ L1 Cache (tiny, fastest)
  ├─ L2 Cache (larger, slower)
  ├─ L3 Cache (shared, much slower)
  └─ DRAM (massive, very slow)

SIMD: One instruction, many data lanes

5. Debugging Performance — Finding the Real Cause

SYMPTOM: "Latency spikes"
  ↓
MEASUREMENT: "99p jumps from 5ms to 80ms"
  ↓
PROFILE: "Stack shows lock contention"
  ↓
ROOT CAUSE: "Mutex held during disk write"

Core Concept Analysis

The Measurement Stack

┌───────────────┐   perf stat / perf record
│  Counters     │ → CPU cycles, cache misses, branch misses
└───────┬───────┘
        │
┌───────┴───────┐   Flamegraphs
│   Samples    │ → Which code paths consume time
└───────┬───────┘
        │
┌───────┴───────┐   Traces
│   Events     │ → Syscalls, scheduling, I/O timing
└───────────────┘

The Cache and Memory Model

Time Cost (approximate)
┌─────────────────────────────┐
│ Register        ~1 cycle    │
│ L1 Cache        ~4 cycles   │
│ L2 Cache        ~12 cycles  │
│ L3 Cache        ~40 cycles  │
│ DRAM            100+ cycles │
└─────────────────────────────┘

SIMD and Vectorization

Scalar:  A1 A2 A3 A4 -> one per instruction
SIMD:    [A1 A2 A3 A4] -> one instruction, 4 lanes

Debugging and Post-Mortem Analysis

Crash or stall -> Core dump -> GDB analysis -> Identify hot or blocked threads

Concept Summary Table

Concept Cluster What You Must Internalize
Measurement Discipline You need repeatable benchmarks and correct statistics before trusting any optimization.
Profiling and Tracing Sampling shows where time is spent; tracing shows why and when it happened.
CPU and Cache Behavior Performance depends on cache locality, branch prediction, and memory access patterns.
SIMD and Vectorization Wide data paths can multiply throughput, but only with aligned, predictable data.
Latency Engineering Tail latency is a system property, not just a code property.
Debugging for Performance GDB and core dumps can reveal blocked threads, stalls, and deadlocks.

Deep Dive Reading By Concept

Concept 1: Measurement and Benchmarking

Book Chapter/Section What You’ll Learn Priority
“Systems Performance” by Brendan Gregg Ch. 2: Methodology How to structure a performance investigation Essential
“High Performance Python” by Gorelick and Ozsvald Ch. 1: Benchmarking Measuring accurately and avoiding false conclusions Recommended

Concept 2: Profiling and Flamegraphs

Book Chapter/Section What You’ll Learn Priority
“Systems Performance” by Brendan Gregg Ch. 6: CPUs CPU profiling methods and tools Essential
“Performance Analysis and Tuning on Modern CPUs” by Fog Ch. 1-3: Microarchitecture How microarchitecture affects profiling results Recommended

Concept 3: Cache and Memory

Book Chapter/Section What You’ll Learn Priority
“Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron Ch. 6: Memory Hierarchy Cache behavior and memory access patterns Essential
“What Every Programmer Should Know About Memory” by Drepper Sections 2-6 Cache levels, bandwidth, and latency Recommended

Concept 4: SIMD and Vectorization

Book Chapter/Section What You’ll Learn Priority
“Computer Architecture: A Quantitative Approach” by Hennessy and Patterson Ch. 3: Instruction-Level Parallelism Vector units and throughput Recommended
“Optimizing Software in C++” by Agner Fog Ch. 10: Vectorization SIMD principles and pitfalls Essential

Concept 5: Latency Engineering

Book Chapter/Section What You’ll Learn Priority
“Designing Data-Intensive Applications” by Kleppmann Ch. 8: The Trouble with Distributed Systems Tail latency and queuing Recommended
“Site Reliability Engineering” by Beyer et al. Ch. 4: Service Level Objectives Latency targets and error budgets Recommended

Project List

Projects are ordered from foundational profiling to advanced hardware and latency optimization.


Project 1: Performance Baseline Lab

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Benchmarking and Measurement
  • Software or Tool: perf, time, taskset
  • Main Book: “Systems Performance” by Brendan Gregg

What you’ll build: A repeatable benchmarking harness that measures runtime, variance, and CPU usage for a set of micro-tasks.

Why it teaches performance engineering: Without a reliable baseline, every optimization is a guess. This project forces you to build measurement discipline first.

Core challenges you’ll face:

  • Designing repeatable workloads (maps to measurement methodology)
  • Handling variance and noise (maps to statistics and sampling)
  • Recording and comparing runs (maps to regression detection)

Key Concepts

  • Benchmarking noise: “Systems Performance” - Brendan Gregg
  • Measurement bias: “High Performance Python” - Gorelick and Ozsvald
  • Repeatability: “The Practice of Programming” - Kernighan and Pike

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic C, command-line usage


Real World Outcome

You will have a CLI tool that runs a defined workload N times, reports min/median/99p duration, and stores results in a simple text report. When you run it, you will see a clear comparison between baseline and changed versions.

Example Output:

$ ./perf_lab run --workload "memcopy" --iters 50
Workload: memcopy
Runs: 50
Min: 12.4 ms
Median: 13.1 ms
99p: 14.9 ms
CPU cycles: 3.2e9
Cache misses: 1.1e7
Saved report: reports/memcopy_2025-01-10.txt

The Core Question You’re Answering

“How do I know if a change actually made performance better?”

Before you write any code, sit with this question. Most optimization efforts fail because there is no trustworthy baseline or measurement process.


Concepts You Must Understand First

Stop and research these before coding:

  1. Benchmarking discipline
    • What makes a benchmark repeatable?
    • How does CPU frequency scaling affect results?
    • Why does warm-up matter?
    • Book Reference: “Systems Performance” Ch. 2 - Brendan Gregg
  2. Variance and noise
    • What are outliers and how do they distort averages?
    • Why is p99 more informative than the mean for latency?
    • Book Reference: “High Performance Python” Ch. 1 - Gorelick and Ozsvald

Questions to Guide Your Design

Before implementing, think through these:

  1. Workload definition
    • How will you isolate CPU-bound vs memory-bound tasks?
    • What inputs represent realistic sizes?
    • How will you keep input stable across runs?
  2. Metrics and reporting
    • Which metrics are most meaningful for your workload?
    • How will you store historical baselines?

Thinking Exercise

The Baseline Trap

Before coding, describe a scenario where an optimization appears faster but is actually noise. Write down three sources of variability (scheduler, CPU turbo, cache warmth) and how each could mislead your conclusion.

Describe: "Before" vs "After" results and why they are not comparable.
List: 3 variability sources and how you would control them.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you design a trustworthy benchmark?”
  2. “Why is median better than mean for latency?”
  3. “What is p99 and why does it matter?”
  4. “How can CPU frequency scaling distort a benchmark?”
  5. “How do you detect regression over time?”

Hints in Layers

Hint 1: Starting Point Pick two micro-workloads: one CPU-bound (simple arithmetic loop), one memory-bound (array scan). Only then add metrics.

Hint 2: Next Level Run each workload multiple times with CPU pinned to a single core. Compare median and p99.

Hint 3: Technical Details Record both wall-clock time and CPU counters. Store results with timestamps for comparison.

Hint 4: Tools/Debugging Use perf stat to validate your counters, then compare them to your stored reports.


Books That Will Help

Topic Book Chapter
Benchmarking methodology “Systems Performance” by Brendan Gregg Ch. 2
Measurement pitfalls “High Performance Python” by Gorelick and Ozsvald Ch. 1
Debugging experiments “The Practice of Programming” by Kernighan and Pike Ch. 5

Project 2: perf + Flamegraph Investigator

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Profiling and Flamegraphs
  • Software or Tool: perf, flamegraph scripts
  • Main Book: “Systems Performance” by Brendan Gregg

What you’ll build: A repeatable workflow that profiles a workload, generates flamegraphs, and annotates hotspots with causes.

Why it teaches performance engineering: Flamegraphs force you to attribute time to exact code paths, not guesses.

Core challenges you’ll face:

  • Collecting reliable samples (maps to profiling fundamentals)
  • Interpreting flamegraph width vs depth (maps to call stack attribution)
  • Separating CPU time from I/O wait (maps to tracing vs sampling)

Key Concepts

  • Profiling methodology: “Systems Performance” - Brendan Gregg
  • Stack sampling: “Performance Analysis and Tuning on Modern CPUs” - Fog
  • Flamegraph interpretation: Brendan Gregg blog (conceptual reference)

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic Linux tooling


Real World Outcome

You will produce a flamegraph report for a real workload and a short narrative of the top three hotspots and their likely causes. You will also have a repeatable command sequence for re-running the analysis after changes.

Example Output:

$ ./profile_run.sh
Profile captured: profiles/run_2025-01-10.data
Flamegraph: reports/run_2025-01-10.svg
Top hotspots:
1) parse_input -> 38%
2) hash_lookup -> 21%
3) serialize_output -> 14%

The Core Question You’re Answering

“Which specific functions are consuming the most CPU time, and why?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Sampling vs tracing
    • What does a sample represent?
    • Why can sampling miss short functions?
    • Book Reference: “Systems Performance” Ch. 6 - Brendan Gregg
  2. Call stacks and attribution
    • How do stack frames map to time?
    • What does a wide flamegraph bar mean?
    • Book Reference: “Performance Analysis and Tuning on Modern CPUs” Ch. 1 - Fog

Questions to Guide Your Design

Before implementing, think through these:

  1. Profiling configuration
    • What sample rate balances overhead vs accuracy?
    • How will you ensure symbols are available for stacks?
  2. Interpreting results
    • How will you differentiate hotspots that are expected from unexpected?

Thinking Exercise

The Misleading Hotspot

Describe a case where a function appears hot in a flamegraph, but optimizing it would not reduce total latency. Write down how you would verify whether it is truly the bottleneck.

Explain why a hotspot might be a symptom rather than a cause.
List two additional signals you would check.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a flamegraph and what does width represent?”
  2. “How do you choose a sampling rate for perf?”
  3. “Why can a function appear hot but not be the real bottleneck?”
  4. “How do you separate CPU-bound work from I/O wait?”
  5. “What are the limitations of sampling profilers?”

Hints in Layers

Hint 1: Starting Point Profile a single workload from Project 1 to keep scope small.

Hint 2: Next Level Run multiple profiles and compare stability of hotspots.

Hint 3: Technical Details Use a separate step to symbolize and render flamegraphs so you can re-run without re-profiling.

Hint 4: Tools/Debugging Validate results with a second profiler to cross-check hotspots.


Books That Will Help

Topic Book Chapter
CPU profiling “Systems Performance” by Brendan Gregg Ch. 6
Microarchitecture basics “Performance Analysis and Tuning on Modern CPUs” by Fog Ch. 1-3
Profiling practices “Optimizing Software in C++” by Agner Fog Ch. 2

Project 3: GDB + Core Dump Performance Autopsy

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Debugging and Post-Mortem Analysis
  • Software or Tool: GDB, core dumps
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A repeatable crash-and-stall investigation workflow using core dumps and GDB to identify blocked threads and hot loops.

Why it teaches performance engineering: Post-mortem analysis is essential when performance issues only happen in production.

Core challenges you’ll face:

  • Capturing useful core dumps (maps to process state capture)
  • Reading thread backtraces (maps to stack and scheduler awareness)
  • Correlating stack state with latency symptoms (maps to root-cause analysis)

Key Concepts

  • Core dump generation: “The Linux Programming Interface” - Kerrisk
  • GDB backtrace reasoning: “Debugging with GDB” - FSF
  • Thread states: “Operating Systems: Three Easy Pieces” - Arpaci-Dusseau

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic GDB usage


Real World Outcome

You will have a written post-mortem that identifies the exact thread and function responsible for a synthetic stall. You will also have a checklist for capturing core dumps and analyzing them in GDB.

Example Output:

$ ./autopsy.sh --core core.1234
Loaded core: core.1234
Threads: 8
Hot thread: TID 7
Top frame: lock_acquire
Evidence: 3 threads waiting on same mutex
Conclusion: contention in request queue

The Core Question You’re Answering

“How do I explain a performance incident when I cannot reproduce it?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Core dumps and process state
    • What is captured in a core dump?
    • How do you map memory to symbols?
    • Book Reference: “The Linux Programming Interface” Ch. 20 - Kerrisk
  2. Thread scheduling and blocking
    • How do you identify blocked threads?
    • What does a waiting thread look like in backtrace?
    • Book Reference: “Operating Systems: Three Easy Pieces” Ch. 26 - Arpaci-Dusseau

Questions to Guide Your Design

Before implementing, think through these:

  1. Reproducible stall scenario
    • What controllable condition will cause blocking?
    • How will you capture a core at the right moment?
  2. Analysis workflow
    • Which thread clues indicate a hot loop vs a lock wait?

Thinking Exercise

The Frozen System

Write a narrative of how you would distinguish between a CPU-bound loop and a thread deadlock using only a core dump and backtraces.

List the signals you would look for in stack traces and thread states.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a core dump and how do you use it?”
  2. “How can you diagnose lock contention from a core dump?”
  3. “How do you identify a hot loop in GDB?”
  4. “What makes performance bugs hard to reproduce?”
  5. “How do you capture a core dump safely in production?”

Hints in Layers

Hint 1: Starting Point Create a workload that deliberately blocks on a lock and capture a core at peak stall.

Hint 2: Next Level Compare thread backtraces to see which thread is holding the lock and which are waiting.

Hint 3: Technical Details Use thread backtrace summaries to identify hot frames and blocked frames.

Hint 4: Tools/Debugging Write a short checklist for core capture, symbol loading, and triage steps.


Books That Will Help

Topic Book Chapter
Core dumps “The Linux Programming Interface” by Kerrisk Ch. 20
Debugging practice “Debugging with GDB” by FSF Ch. 7
Thread states “Operating Systems: Three Easy Pieces” by Arpaci-Dusseau Ch. 26

Project 4: Cache Locality Visualizer

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Zig
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: CPU Caches and Memory Access
  • Software or Tool: perf, cachegrind (optional)
  • Main Book: “Computer Systems: A Programmer’s Perspective”

What you’ll build: A set of experiments that visualize how data layout and access patterns change cache miss rates and runtime.

Why it teaches performance engineering: Cache behavior is often the biggest performance lever in real systems.

Core challenges you’ll face:

  • Designing access patterns (maps to spatial and temporal locality)
  • Measuring cache miss rates (maps to hardware counters)
  • Connecting layout decisions to runtime impact (maps to memory hierarchy)

Key Concepts

  • Memory hierarchy: “CS:APP” - Bryant and O’Hallaron
  • Cache effects: “What Every Programmer Should Know About Memory” - Drepper
  • Data layout: “Optimizing Software in C++” - Agner Fog

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 complete, basic understanding of caches


Real World Outcome

You will produce a report with side-by-side timing and cache-miss data for different data layouts and access patterns, plus a short explanation of which pattern wins and why.

Example Output:

$ ./cache_lab report
Pattern A (row-major): 120 ms, L1 miss rate 2%
Pattern B (column-major): 940 ms, L1 miss rate 38%
Conclusion: stride access destroys cache locality

The Core Question You’re Answering

“Why does the same algorithm run 8x slower just by changing data layout?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Cache hierarchy
    • What is the cost difference between L1 and DRAM?
    • How do cache lines work?
    • Book Reference: “CS:APP” Ch. 6 - Bryant and O’Hallaron
  2. Locality
    • What is spatial vs temporal locality?
    • How does stride size affect misses?
    • Book Reference: Drepper, Sections 2-3

Questions to Guide Your Design

Before implementing, think through these:

  1. Experiment design
    • How will you isolate access patterns from computation cost?
    • What data sizes cross cache boundaries?
  2. Measurement
    • Which counters best indicate cache pressure?

Thinking Exercise

The Cache Line Walk

Draw a grid of memory addresses for a 2D array. Mark which cache lines are touched when traversing row-major vs column-major. Explain why one pattern reuses cache lines and the other does not.

Sketch: memory layout of a 4x4 array and show access order.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a cache line and why does it matter?”
  2. “Explain spatial vs temporal locality.”
  3. “Why is column-major access slow in C for row-major arrays?”
  4. “How do you measure cache misses with perf?”
  5. “How can data layout dominate algorithmic complexity?”

Hints in Layers

Hint 1: Starting Point Start with a 2D array scan and compare two traversal orders.

Hint 2: Next Level Scale the array so it crosses L1, then L2, then L3 sizes.

Hint 3: Technical Details Record cache miss counters alongside runtime for each test.

Hint 4: Tools/Debugging Use perf stat with cache events to validate your assumptions.


Books That Will Help

Topic Book Chapter
Memory hierarchy “Computer Systems: A Programmer’s Perspective” Ch. 6
Cache behavior “What Every Programmer Should Know About Memory” Sections 2-4
Data layout “Optimizing Software in C++” by Agner Fog Ch. 5

Project 5: Branch Predictor and Pipeline Lab

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Zig
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: CPU Microarchitecture
  • Software or Tool: perf, cpu topology tools
  • Main Book: “Computer Architecture: A Quantitative Approach”

What you’ll build: A set of controlled experiments that show branch misprediction penalties and pipeline effects.

Why it teaches performance engineering: Branch mispredictions are invisible in code review but dominate CPU time.

Core challenges you’ll face:

  • Creating predictable vs unpredictable branches (maps to branch prediction)
  • Measuring misprediction rates (maps to hardware counters)
  • Relating pipeline stalls to latency (maps to CPU execution)

Key Concepts

  • Branch prediction: “Computer Architecture” - Hennessy and Patterson
  • Pipeline stalls: “Performance Analysis and Tuning on Modern CPUs” - Fog
  • CPU counters: “Systems Performance” - Brendan Gregg

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 complete, basic CPU architecture


Real World Outcome

You will produce a report that shows how predictable branches run faster and how mispredictions increase cycles per instruction.

Example Output:

$ ./branch_lab report
Predictable branch: 1.2 cycles/instruction, 2% mispredict
Unpredictable branch: 3.8 cycles/instruction, 41% mispredict
Conclusion: branch predictability dominates throughput

The Core Question You’re Answering

“Why does a tiny conditional statement slow down an entire loop?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Branch prediction
    • What does the CPU predict and when?
    • What happens on misprediction?
    • Book Reference: “Computer Architecture” Ch. 3 - Hennessy and Patterson
  2. Pipeline execution
    • How do pipeline stalls reduce throughput?
    • Book Reference: “Performance Analysis and Tuning on Modern CPUs” Ch. 2 - Fog

Questions to Guide Your Design

Before implementing, think through these:

  1. Experiment design
    • How will you control input distributions to change predictability?
    • How will you isolate branch cost from memory cost?
  2. Metrics
    • Which counters best indicate misprediction impact?

Thinking Exercise

The Pipeline Flush

Describe the sequence of events when a branch is mispredicted. Explain how many stages are wasted and why that reduces instruction throughput.

Write a step-by-step narrative of a misprediction penalty.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is branch prediction and why does it matter?”
  2. “What is a pipeline stall?”
  3. “How does misprediction affect CPI?”
  4. “How do you measure branch misses in perf?”
  5. “When does branch prediction not matter?”

Hints in Layers

Hint 1: Starting Point Create two loops with identical work but different branch predictability.

Hint 2: Next Level Use perf to compare branch miss rates and CPI.

Hint 3: Technical Details Keep data in cache to avoid confounding memory effects.

Hint 4: Tools/Debugging Validate with multiple runs to ensure stability.


Books That Will Help

Topic Book Chapter
Branch prediction “Computer Architecture” by Hennessy and Patterson Ch. 3
Pipeline stalls “Performance Analysis and Tuning on Modern CPUs” by Fog Ch. 2
CPU counters “Systems Performance” by Brendan Gregg Ch. 6

Project 6: SIMD Throughput Explorer

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: C++, Rust, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: SIMD and Vectorization
  • Software or Tool: perf, compiler vectorization reports
  • Main Book: “Optimizing Software in C++” by Agner Fog

What you’ll build: A set of experiments showing scalar vs SIMD throughput for numeric workloads.

Why it teaches performance engineering: SIMD is one of the most powerful (and misunderstood) sources of speedup.

Core challenges you’ll face:

  • Structuring data for vectorization (maps to alignment and layout)
  • Measuring speedup correctly (maps to measurement discipline)
  • Handling edge cases where SIMD fails (maps to control flow divergence)

Key Concepts

  • SIMD fundamentals: “Optimizing Software in C++” - Agner Fog
  • Vector units: “Computer Architecture” - Hennessy and Patterson
  • Alignment: “CS:APP” - Bryant and O’Hallaron

Difficulty: Expert Time estimate: 1 month+ Prerequisites: Projects 1,2,4 complete; basic CPU architecture


Real World Outcome

You will produce a side-by-side report showing the throughput difference between scalar and vectorized workloads, with evidence of alignment and data layout effects.

Example Output:

$ ./simd_lab report
Scalar: 1.0x baseline
SIMD:   3.7x speedup
Aligned data: 4.2x speedup
Unaligned data: 2.1x speedup

The Core Question You’re Answering

“When does SIMD actually speed things up, and when does it not?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Vectorization basics
    • What is a SIMD lane?
    • What data sizes match vector width?
    • Book Reference: “Optimizing Software in C++” Ch. 10 - Fog
  2. Alignment and layout
    • Why does alignment matter for vector loads?
    • Book Reference: “CS:APP” Ch. 6 - Bryant and O’Hallaron

Questions to Guide Your Design

Before implementing, think through these:

  1. Data preparation
    • How will you ensure contiguous, aligned data?
  2. Measurement
    • How will you isolate compute from memory bottlenecks?

Thinking Exercise

The SIMD Mismatch

Describe a dataset where SIMD would not help due to irregular memory access. Explain how you would detect that in a profiler.

List the symptoms of non-vectorizable workloads.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is SIMD and why does it help?”
  2. “How does alignment affect SIMD performance?”
  3. “What workloads are good candidates for vectorization?”
  4. “How do you validate a claimed SIMD speedup?”
  5. “Why might SIMD not help even if the CPU supports it?”

Hints in Layers

Hint 1: Starting Point Start with a simple numeric loop and compare scalar vs vectorized versions conceptually.

Hint 2: Next Level Measure with large enough data to avoid timing noise.

Hint 3: Technical Details Separate aligned vs unaligned data experiments.

Hint 4: Tools/Debugging Use compiler reports to confirm vectorization decisions.


Books That Will Help

Topic Book Chapter
SIMD fundamentals “Optimizing Software in C++” by Agner Fog Ch. 10
Vector units “Computer Architecture” by Hennessy and Patterson Ch. 3
Alignment “Computer Systems: A Programmer’s Perspective” Ch. 6

Project 7: Latency Budget and Tail Latency Simulator

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Go, Rust, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Latency Engineering
  • Software or Tool: perf, tracing tools
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A workload simulator that produces latency distributions and demonstrates how small delays create large p99 spikes.

Why it teaches performance engineering: It connects micro-level delays to system-level tail latency.

Core challenges you’ll face:

  • Modeling queueing delay (maps to latency distributions)
  • Measuring p95/p99 (maps to statistics)
  • Correlating spikes with system events (maps to tracing)

Key Concepts

  • Tail latency: “Designing Data-Intensive Applications” - Kleppmann
  • Queuing effects: “Systems Performance” - Gregg
  • Latency metrics: “Site Reliability Engineering” - Beyer et al.

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic statistics


Real World Outcome

You will produce a report with latency histograms and a clear explanation of how queueing amplifies minor slowdowns.

Example Output:

$ ./latency_sim report
Median: 4.8 ms
95p: 9.2 ms
99p: 41.7 ms
Spike cause: queue depth burst during periodic I/O

The Core Question You’re Answering

“Why does p99 explode even when average latency looks fine?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Latency percentiles
    • Why are percentiles more useful than averages?
    • Book Reference: “Site Reliability Engineering” Ch. 4 - Beyer et al.
  2. Queueing theory basics
    • What happens as utilization approaches 100%?
    • Book Reference: “Systems Performance” Ch. 2 - Gregg

Questions to Guide Your Design

Before implementing, think through these:

  1. Workload shaping
    • How will you generate bursts of load?
    • How will you log queue depth over time?
  2. Visualization
    • How will you present the latency distribution clearly?

Thinking Exercise

The Slow Tail

Describe a system where 1 percent of requests take 10x longer. Explain how that affects user experience and system capacity.

Write a short narrative with numbers (median, 95p, 99p).

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is tail latency and why is it important?”
  2. “Why does queueing cause nonlinear latency growth?”
  3. “How do you measure and report p99?”
  4. “What are common sources of latency spikes?”
  5. “How do you reduce tail latency without overprovisioning?”

Hints in Layers

Hint 1: Starting Point Generate a steady workload and measure p50 and p99.

Hint 2: Next Level Introduce periodic bursts and observe the p99 jump.

Hint 3: Technical Details Record queue depth alongside latency metrics for correlation.

Hint 4: Tools/Debugging Use tracing to identify which stage introduces the long tail.


Books That Will Help

Topic Book Chapter
Tail latency “Designing Data-Intensive Applications” by Kleppmann Ch. 8
Queueing effects “Systems Performance” by Brendan Gregg Ch. 2
SLOs and latency “Site Reliability Engineering” by Beyer et al. Ch. 4

Project 8: Lock Contention and Concurrency Profiler

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Go, Rust, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Concurrency and Contention
  • Software or Tool: perf, tracing tools
  • Main Book: “The Art of Multiprocessor Programming” by Herlihy and Shavit

What you’ll build: A contention diagnostic that measures lock hold time, wait time, and impact on throughput.

Why it teaches performance engineering: Concurrency issues are a leading cause of real-world latency spikes.

Core challenges you’ll face:

  • Measuring lock hold time (maps to concurrency profiling)
  • Visualizing contention hotspots (maps to performance attribution)
  • Relating throughput drops to lock behavior (maps to system metrics)

Key Concepts

  • Locks and contention: “The Art of Multiprocessor Programming”
  • Profiling contention: “Systems Performance” - Gregg
  • Scheduling overhead: “Operating Systems: Three Easy Pieces”

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Projects 1-2 complete, basic concurrency


Real World Outcome

You will produce a report that shows which locks are most contended and how they impact throughput and latency.

Example Output:

$ ./lock_prof report
Lock A: hold time 2.4 ms, wait time 8.1 ms
Lock B: hold time 0.2 ms, wait time 0.5 ms
Throughput drop: 35% under contention

The Core Question You’re Answering

“Which locks are killing performance, and why?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Lock contention
    • What is the difference between hold time and wait time?
    • Book Reference: “The Art of Multiprocessor Programming” Ch. 2
  2. Scheduling effects
    • How do context switches affect latency?
    • Book Reference: “Operating Systems: Three Easy Pieces” Ch. 26

Questions to Guide Your Design

Before implementing, think through these:

  1. Metrics
    • How will you capture lock wait time accurately?
  2. Visualization
    • How will you compare lock hotspots across runs?

Thinking Exercise

The Contention Map

Describe how you would visualize a system where 20 threads compete for the same lock. Explain how that would show up in throughput and latency metrics.

Create a simple map: threads -> lock -> wait time.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is lock contention and how do you measure it?”
  2. “How does contention affect tail latency?”
  3. “What is the difference between throughput and latency in contention scenarios?”
  4. “How can you reduce contention without removing correctness?”
  5. “What are alternatives to heavy locks?”

Hints in Layers

Hint 1: Starting Point Start with a single shared lock and scale thread count.

Hint 2: Next Level Measure wait time and throughput at each thread count.

Hint 3: Technical Details Separate lock hold time from wait time in your metrics.

Hint 4: Tools/Debugging Use tracing to confirm where threads are blocked.


Books That Will Help

Topic Book Chapter
Locks and contention “The Art of Multiprocessor Programming” Ch. 2
Profiling systems “Systems Performance” by Brendan Gregg Ch. 6
Scheduling “Operating Systems: Three Easy Pieces” Ch. 26

Project 9: System Call and I/O Latency Profiler

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Go, Rust, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: I/O and Syscalls
  • Software or Tool: perf, strace, bpftrace
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A profiler that captures syscall latency distributions and identifies slow I/O paths.

Why it teaches performance engineering: Many latency spikes come from slow syscalls and blocking I/O.

Core challenges you’ll face:

  • Capturing syscall timing (maps to tracing)
  • Associating syscalls with call sites (maps to profiling)
  • Interpreting I/O delays (maps to system behavior)

Key Concepts

  • Syscall overhead: “The Linux Programming Interface” - Kerrisk
  • Tracing methodology: “Systems Performance” - Gregg
  • I/O latency: “Operating Systems: Three Easy Pieces”

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 complete, basic Linux tooling


Real World Outcome

You will have a report showing which syscalls are slow, their latency distribution, and how often they appear in real workloads.

Example Output:

$ ./syscall_prof report
Top slow syscalls:
1) read: p99 18.2 ms
2) fsync: p99 42.7 ms
3) open: p99 6.3 ms
Conclusion: fsync spikes dominate tail latency

The Core Question You’re Answering

“Which system calls are responsible for I/O latency spikes?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Syscall mechanics
    • What happens during a syscall transition?
    • Book Reference: “The Linux Programming Interface” Ch. 3 - Kerrisk
  2. I/O stack
    • Why does disk or network I/O dominate latency?
    • Book Reference: “Operating Systems: Three Easy Pieces” Ch. 36

Questions to Guide Your Design

Before implementing, think through these:

  1. Trace capture
    • How will you collect syscall timing with low overhead?
  2. Reporting
    • How will you summarize p50, p95, p99?

Thinking Exercise

The Slow Disk

Describe how a single slow disk operation can inflate tail latency. Explain how you would prove it using syscall timing data.

Write a short narrative using p95 and p99 measurements.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a syscall and why is it expensive?”
  2. “How do you measure syscall latency?”
  3. “Why does fsync cause latency spikes?”
  4. “How do you separate CPU time from I/O wait?”
  5. “What is the risk of tracing at scale?”

Hints in Layers

Hint 1: Starting Point Trace a single process and capture only read/write calls.

Hint 2: Next Level Add latency percentiles to your report.

Hint 3: Technical Details Correlate slow syscalls with timestamps from your workload.

Hint 4: Tools/Debugging Validate findings with strace or a second tracing tool.


Books That Will Help

Topic Book Chapter
Syscalls “The Linux Programming Interface” by Kerrisk Ch. 3
Tracing “Systems Performance” by Brendan Gregg Ch. 7
I/O behavior “Operating Systems: Three Easy Pieces” Ch. 36

Project 10: End-to-End Performance Regression Dashboard

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Go, Rust, Python
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Performance Engineering Systems
  • Software or Tool: perf, flamegraphs, tracing tools
  • Main Book: “Systems Performance” by Brendan Gregg

What you’ll build: A dashboard that tracks performance metrics across builds and highlights regressions with evidence.

Why it teaches performance engineering: It forces you to automate profiling and make results actionable for real teams.

Core challenges you’ll face:

  • Automating profiling runs (maps to measurement discipline)
  • Detecting statistically significant regressions (maps to benchmarking)
  • Presenting root cause clues (maps to flamegraph analysis)

Key Concepts

  • Regression detection: “Systems Performance” - Gregg
  • Statistical testing: “High Performance Python” - Gorelick and Ozsvald
  • Profiling workflow: “Performance Analysis and Tuning on Modern CPUs” - Fog

Difficulty: Advanced Time estimate: 1 month+ Prerequisites: Projects 1-3 complete, basic data visualization


Real World Outcome

You will have a dashboard-like report that shows performance trends across versions, flags regressions, and links to the profiling evidence that explains them.

Example Output:

$ ./perf_dashboard report
Version: v1.4.2 -> v1.4.3
Regression: +18% median latency
Hotspot shift: parse_input increased from 22% to 41%
Evidence: reports/v1.4.3_flamegraph.svg

The Core Question You’re Answering

“How do I prevent performance regressions from silently reaching production?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Regression detection
    • What qualifies as a statistically significant slowdown?
    • Book Reference: “High Performance Python” Ch. 1 - Gorelick and Ozsvald
  2. Automated profiling
    • How do you integrate profiling into CI without excessive overhead?
    • Book Reference: “Systems Performance” Ch. 2 - Gregg

Questions to Guide Your Design

Before implementing, think through these:

  1. Signal vs noise
    • How will you avoid false positives?
  2. Evidence collection
    • How will you attach profiling data to regressions?

Thinking Exercise

The False Regression

Describe a case where a performance regression alert triggers, but the underlying cause is measurement noise. Explain how you would verify and dismiss it.

List two signals that confirm it is noise.

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you detect performance regressions in CI?”
  2. “What is the difference between a regression and natural variance?”
  3. “How do you attach evidence to a performance alert?”
  4. “How do you avoid alert fatigue in performance monitoring?”
  5. “What is a safe rollback plan for performance issues?”

Hints in Layers

Hint 1: Starting Point Start with a single workload and record its baseline in a file.

Hint 2: Next Level Add a comparison step that flags changes beyond a threshold.

Hint 3: Technical Details Attach a flamegraph or perf report to each flagged run.

Hint 4: Tools/Debugging Validate regressions by repeating runs on pinned CPU cores.


Books That Will Help

Topic Book Chapter
Regression workflows “Systems Performance” by Brendan Gregg Ch. 2
Benchmarking statistics “High Performance Python” by Gorelick and Ozsvald Ch. 1
Profiling tooling “Performance Analysis and Tuning on Modern CPUs” by Fog Ch. 1-3

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
Performance Baseline Lab Beginner Weekend Medium Medium
perf + Flamegraph Investigator Intermediate 1-2 weeks High High
GDB + Core Dump Performance Autopsy Intermediate 1-2 weeks High Medium
Cache Locality Visualizer Advanced 1-2 weeks High High
Branch Predictor and Pipeline Lab Advanced 1-2 weeks High Medium
SIMD Throughput Explorer Expert 1 month+ Very High High
Latency Budget and Tail Latency Simulator Intermediate 1-2 weeks High High
Lock Contention and Concurrency Profiler Advanced 1-2 weeks High Medium
System Call and I/O Latency Profiler Intermediate 1-2 weeks High Medium
End-to-End Performance Regression Dashboard Advanced 1 month+ Very High High

Recommendation

Start with Project 1 (Performance Baseline Lab) to build measurement discipline, then move to Project 2 (perf + Flamegraph Investigator) to learn attribution. After that, choose based on your interest:

  • Hardware focus: Projects 4-6
  • Debugging focus: Project 3
  • Latency focus: Projects 7-9

Final Overall Project

Project: Full-Stack Performance Engineering Field Manual

  • File: PERFORMANCE_ENGINEERING_PROJECTS.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Performance Engineering Systems
  • Software or Tool: perf, flamegraphs, tracing tools, GDB
  • Main Book: “Systems Performance” by Brendan Gregg

What you’ll build: A comprehensive performance toolkit that benchmarks, profiles, and produces a structured report with root-cause hypotheses and next actions.

Why it teaches performance engineering: It integrates measurement, profiling, hardware reasoning, and debugging into a single repeatable workflow.

Core challenges you’ll face:

  • Designing a consistent experiment methodology
  • Automating profiling and tracing capture
  • Producing a clear narrative from raw metrics

Key Concepts

  • Methodology: “Systems Performance” - Gregg
  • CPU behavior: “Performance Analysis and Tuning on Modern CPUs” - Fog
  • Debugging: “The Linux Programming Interface” - Kerrisk

Difficulty: Expert Time estimate: 1 month+ Prerequisites: Projects 1-9 complete


Summary

Project Focus Outcome
Performance Baseline Lab Measurement discipline Reliable baselines and variance control
perf + Flamegraph Investigator Profiling Attribution of CPU hotspots
GDB + Core Dump Performance Autopsy Debugging Post-mortem performance analysis
Cache Locality Visualizer Cache behavior Evidence of locality effects
Branch Predictor and Pipeline Lab CPU pipeline Measured misprediction penalties
SIMD Throughput Explorer Vectorization Proven SIMD speedups and limits
Latency Budget and Tail Latency Simulator Tail latency Latency distribution understanding
Lock Contention and Concurrency Profiler Concurrency Contention hotspots and mitigation
System Call and I/O Latency Profiler I/O latency Slow syscall identification
End-to-End Performance Regression Dashboard Regression detection Automated performance monitoring
Full-Stack Performance Engineering Field Manual Integration A complete performance workflow