Project 11: Full-Stack Performance Engineering Field Manual
Project 11: Full-Stack Performance Engineering Field Manual
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Expert |
| Time Estimate | 1 month+ |
| Primary Language | C |
| Alternative Languages | Rust, Go, C++ |
| Knowledge Area | Performance Engineering Systems |
| Tools Required | perf, flamegraphs, tracing tools, GDB |
| Primary Reference | โSystems Performanceโ by Brendan Gregg |
Learning Objectives
By completing this project, you will be able to:
- Execute a complete performance investigation from symptom to root cause
- Apply systematic methodology for any performance problem
- Produce actionable reports for stakeholders at different levels
- Build a reusable toolkit for future investigations
- Train others in performance engineering practices
- Make defensible optimization decisions with evidence
Deep Theoretical Foundation
The USE Method
Brendan Greggโs USE method provides systematic coverage:
Utilization: How busy is the resource? Saturation: How much work is queued? Errors: Are there any errors?
Apply to each resource:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Resource โ Utilization โ Saturation โ Errors โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโค
โ CPU โ CPU% per core โ Run queue len โ N/A โ
โ Memory โ Memory used % โ Swapping rate โ OOM events โ
โ Disk I/O โ I/O busy % โ I/O queue len โ Device errors โ
โ Network โ Bandwidth used % โ Socket backlog โ Dropped pkts โ
โ Mutex โ Hold time % โ Waiter count โ Deadlocks โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโ
The RED Method
For services (request-driven systems):
Rate: Requests per second Errors: Failed requests per second Duration: Time per request (latency distribution)
Performance Investigation Workflow
โโโโโโโโโโโโโโโโโโโโโโโ
โ Symptom Reported โ
โ "System is slow" โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Quantify the Problemโ
โ Baseline vs Current โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Global Analysis โ โ Resource Analysis โ
โ (USE/RED metrics) โ โ (bottleneck hunting) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Drill Down โ
โ Profile + Trace โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Root Cause โ
โ Code/Config/HW โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Fix + Validate โ
โ Prove improvement โ
โโโโโโโโโโโโโโโโโโโโโโโ
The Performance Checklist
Before any investigation:
โก What is the symptom? (latency, throughput, resource usage)
โก When did it start? (time correlation with changes)
โก How is it measured? (metrics, methodology)
โก What is "normal"? (baseline comparison)
โก What has changed? (code, config, traffic, hardware)
โก Who is affected? (all users, subset, specific operations)
โก What is the business impact? (SLA, revenue, experience)
Complete Project Specification
What Youโre Building
A comprehensive performance toolkit called perf_manual that:
- Executes systematic methodology through guided workflows
- Collects all relevant data (metrics, profiles, traces)
- Produces structured reports for technical and non-technical audiences
- Stores investigation artifacts for future reference
- Tracks optimization experiments with before/after data
Functional Requirements
# Investigation workflow
perf_manual investigate --workload <name> --symptom "<description>"
perf_manual use-check --target <pid|service>
perf_manual red-check --endpoint <url> --duration <sec>
# Data collection
perf_manual collect --suite <name> --output <dir>
perf_manual profile --pid <pid> --duration <sec>
perf_manual trace --events <list> --pid <pid>
# Analysis
perf_manual analyze --data <dir> --output <report.md>
perf_manual diff --before <dir> --after <dir>
perf_manual root-cause --hypothesis "<text>" --evidence <dir>
# Reporting
perf_manual report --format <technical|executive> --output <file>
perf_manual present --slides --data <dir>
Investigation Report Structure
# Performance Investigation Report
## Executive Summary
- **Issue**: Request latency increased 3x over past week
- **Impact**: 5% of users affected, estimated $50K/day revenue impact
- **Root Cause**: Database query N+1 pattern in new feature
- **Fix**: Query optimization, estimated 80% latency reduction
- **Status**: Fix deployed, monitoring for 48 hours
## Timeline
- 2025-01-20 10:00: Latency SLO breach detected
- 2025-01-20 10:15: Investigation started
- 2025-01-20 12:30: Root cause identified
- 2025-01-20 15:00: Fix implemented in staging
- 2025-01-21 09:00: Fix deployed to production
## Technical Details
### Symptom Quantification
- Baseline (Jan 13-19): p99 = 50ms
- Current (Jan 20): p99 = 180ms
- Change: +260% (3.6x increase)
### Resource Analysis (USE Method)
| Resource | Utilization | Saturation | Errors |
|----------|-------------|------------|--------|
| CPU | 45% | Low | None |
| Memory | 62% | None | None |
| Disk | 12% | None | None |
| Database | 89% | HIGH | Timeouts |
Database saturation identified as primary bottleneck.
### Profile Evidence
[Flamegraph: before/after comparison]
Hotspot shift:
- `db_query()`: 15% โ 62% of CPU time
- New function `fetch_related()` appears at 45%
### Root Cause
N+1 query pattern in `fetch_related()`:
- Main query returns 100 items
- Each item triggers 1 additional query
- 101 queries instead of 2 queries (JOIN)
### Fix Validation
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| p99 | 180ms | 42ms | -77% |
| DB queries/req | 101 | 2 | -98% |
| DB CPU | 89% | 25% | -64% |
## Recommendations
1. [DONE] Optimize query with JOIN
2. [PLANNED] Add N+1 query detection to CI
3. [PLANNED] Database connection pool tuning
## Appendix
- Full flamegraph: attached
- Query plans: attached
- Benchmark data: attached
Solution Architecture
Toolkit Structure
perf_manual/
โโโ bin/
โ โโโ perf_manual # Main CLI
โโโ lib/
โ โโโ collectors/
โ โ โโโ cpu.sh # CPU metrics collection
โ โ โโโ memory.sh # Memory metrics
โ โ โโโ disk.sh # Disk I/O
โ โ โโโ network.sh # Network
โ โ โโโ application.sh # Custom app metrics
โ โโโ profilers/
โ โ โโโ flamegraph.sh # perf + flamegraph
โ โ โโโ offcpu.sh # Off-CPU profiling
โ โ โโโ memory.sh # Memory profiling
โ โโโ tracers/
โ โ โโโ syscall.sh # Syscall tracing
โ โ โโโ lock.sh # Lock tracing
โ โ โโโ io.sh # I/O tracing
โ โโโ analyzers/
โ โโโ use_check.py # USE method analysis
โ โโโ red_check.py # RED method analysis
โ โโโ regression.py # Statistical analysis
โ โโโ root_cause.py # Evidence correlation
โโโ templates/
โ โโโ technical_report.md # Technical report template
โ โโโ executive_report.md # Executive summary template
โ โโโ checklist.md # Investigation checklist
โโโ investigations/ # Stored investigations
โโโ YYYY-MM-DD_<name>/
โโโ metadata.yaml
โโโ raw_data/
โโโ analysis/
โโโ report.md
Data Collection Framework
#!/bin/bash
# collect_all.sh - Comprehensive data collection
OUTPUT_DIR=$1
DURATION=$2
PID=$3
mkdir -p "$OUTPUT_DIR"/{cpu,memory,disk,network,profile,trace}
# CPU metrics
mpstat -P ALL 1 $DURATION > "$OUTPUT_DIR/cpu/mpstat.txt" &
# Memory metrics
vmstat 1 $DURATION > "$OUTPUT_DIR/memory/vmstat.txt" &
# Disk I/O
iostat -xz 1 $DURATION > "$OUTPUT_DIR/disk/iostat.txt" &
# Network
sar -n DEV 1 $DURATION > "$OUTPUT_DIR/network/sar_dev.txt" &
# CPU profile (flamegraph)
perf record -F 99 -g -p $PID -o "$OUTPUT_DIR/profile/perf.data" -- sleep $DURATION &
# Syscall trace (sampling)
perf trace -p $PID --duration ${DURATION}000 2>&1 | head -10000 > "$OUTPUT_DIR/trace/syscalls.txt" &
# Wait for all collectors
wait
# Generate flamegraph
perf script -i "$OUTPUT_DIR/profile/perf.data" | \
stackcollapse-perf.pl | \
flamegraph.pl > "$OUTPUT_DIR/profile/flamegraph.svg"
echo "Collection complete: $OUTPUT_DIR"
USE Method Checker
#!/usr/bin/env python3
# use_check.py - Automated USE method analysis
import subprocess
import json
def check_cpu():
"""CPU utilization, saturation, errors"""
# Utilization: CPU busy %
result = subprocess.run(
['mpstat', '1', '1'], capture_output=True, text=True
)
# Parse idle%, calculate busy%
# Saturation: run queue length
with open('/proc/loadavg') as f:
load = float(f.read().split()[0])
num_cpus = os.cpu_count()
saturation = load / num_cpus
return {
'resource': 'CPU',
'utilization': busy_pct,
'saturation': 'HIGH' if saturation > 1.0 else 'LOW',
'errors': 'N/A'
}
def check_memory():
"""Memory utilization, saturation, errors"""
# Utilization: memory used %
with open('/proc/meminfo') as f:
meminfo = dict(line.split(':') for line in f)
total = int(meminfo['MemTotal'].strip().split()[0])
avail = int(meminfo['MemAvailable'].strip().split()[0])
used_pct = (1 - avail/total) * 100
# Saturation: swap usage
with open('/proc/vmstat') as f:
vmstat = dict(line.split() for line in f)
swap_in = int(vmstat.get('pswpin', 0))
swap_out = int(vmstat.get('pswpout', 0))
# Errors: OOM events
oom_count = 0 # Parse dmesg for OOM
return {
'resource': 'Memory',
'utilization': f'{used_pct:.1f}%',
'saturation': 'HIGH' if swap_out > 0 else 'LOW',
'errors': oom_count
}
def check_disk():
"""Disk I/O utilization, saturation, errors"""
result = subprocess.run(
['iostat', '-xz', '1', '2'], capture_output=True, text=True
)
# Parse %util for each device
# Saturation: avgqu-sz (average queue size)
return {
'resource': 'Disk',
'utilization': max_util,
'saturation': 'HIGH' if avg_queue > 1 else 'LOW',
'errors': error_count
}
def run_use_check():
"""Run complete USE check"""
results = [
check_cpu(),
check_memory(),
check_disk(),
check_network(),
]
print("\n=== USE Method Analysis ===\n")
print(f"{'Resource':<12} {'Utilization':<15} {'Saturation':<12} {'Errors':<10}")
print("-" * 50)
for r in results:
status = 'โ ๏ธ' if r['saturation'] == 'HIGH' or r['errors'] else 'โ'
print(f"{r['resource']:<12} {r['utilization']:<15} {r['saturation']:<12} {r['errors']:<10} {status}")
if __name__ == '__main__':
run_use_check()
Phased Implementation Guide
Phase 1: Data Collection Framework (Week 1)
Goal: Comprehensive, reproducible data collection.
Steps:
- Create modular collection scripts per resource
- Implement unified collection runner
- Design output directory structure
- Add metadata capture (timestamp, config, version)
- Test with sample workload
Validation: All collectors run, data parseable.
Phase 2: USE/RED Automation (Week 2)
Goal: Automated methodology checklists.
Steps:
- Implement USE checker for all resources
- Implement RED checker for HTTP endpoints
- Create severity classification
- Generate checklist reports
- Identify bottlenecks automatically
Validation: Correctly identifies simulated bottlenecks.
Phase 3: Profile Integration (Week 3)
Goal: Unified profiling with evidence collection.
Steps:
- Integrate CPU profiling (flamegraph)
- Add off-CPU profiling
- Implement differential flamegraph
- Add memory profiling
- Link profiles to investigation
Validation: Profiles captured and accessible.
Phase 4: Analysis and Reporting (Week 4)
Goal: Structured analysis with actionable reports.
Steps:
- Create report templates (technical, executive)
- Implement automated analysis summary
- Add root cause hypothesis tracking
- Generate before/after comparisons
- Build evidence linkage
Validation: Report accurately summarizes investigation.
Phase 5: Investigation Management (Week 5+)
Goal: Persistent investigation tracking.
Steps:
- Create investigation storage structure
- Implement timeline tracking
- Add hypothesis/evidence correlation
- Build search/retrieval for past investigations
- Export to knowledge base
Validation: Can retrieve and reference past investigations.
Testing Strategy
Synthetic Scenarios
1. CPU Bottleneck
// Infinite loop consuming CPU
while(1) { volatile int x = 0; for(int i=0;i<10000000;i++) x++; }
Expected: USE shows high CPU utilization.
2. Memory Pressure
// Allocate until swapping
while(1) { malloc(1024*1024); }
Expected: USE shows memory saturation (swapping).
3. I/O Bottleneck
// Synchronous writes
while(1) { write(fd, data, 4096); fsync(fd); }
Expected: USE shows disk saturation.
4. Contention
// All threads fighting for one lock
pthread_mutex_lock(&global_lock);
work();
pthread_mutex_unlock(&global_lock);
Expected: Profile shows lock contention.
End-to-End Tests
- Complete investigation: Symptom โ Root cause โ Fix
- Report generation: Data โ Technical report โ Executive summary
- Evidence linking: Profile โ Hypothesis โ Validation
Common Pitfalls and Debugging
Pitfall 1: Analysis Paralysis
Symptom: Collecting data forever, never concluding.
Solution: Time-box each phase:
- 30 min: Quantify symptom
- 1 hour: USE/RED check
- 2 hours: Profile and drill down
- If no progress, escalate or document uncertainty
Pitfall 2: Confirmation Bias
Symptom: Finding evidence for pre-existing belief.
Solution:
- Collect data before forming hypothesis
- Actively seek disconfirming evidence
- Have peer review findings
Pitfall 3: Fixing Symptoms Not Causes
Symptom: Problem recurs after โfixโ.
Solution:
- Ask โwhyโ 5 times (root cause analysis)
- Verify fix addresses root cause
- Monitor for regression
Pitfall 4: Missing the Forest for Trees
Symptom: Optimizing 1% of runtime.
Solution:
- Always start with highest-impact opportunity
- Calculate theoretical maximum improvement
- Stop when diminishing returns
Extensions and Challenges
Extension 1: Automated Incident Response
Integrate with monitoring:
- Alert triggers investigation
- Automatic data collection
- Initial analysis before human involvement
Extension 2: Knowledge Base
Build searchable repository:
- Past investigations indexed
- Similar symptoms matched
- Solutions suggested
Extension 3: Performance Review Process
Integrate with development:
- Pre-merge performance check
- Post-deploy monitoring
- Quarterly performance review
Challenge: Real Incident
Apply full methodology to production incident:
- Time-bounded (SLA pressure)
- Incomplete information
- Stakeholder communication
- Post-mortem documentation
Real-World Connections
Industry Methodologies
- Googleโs Data-Driven Approach: Everything measured, decisions justified
- Facebookโs Performance Culture: Continuous profiling, efficiency focus
- Netflixโs Chaos Engineering: Proactive performance testing
- Amazonโs Two-Pizza Teams: Ownership of service performance
Career Application
This project demonstrates:
- Systematic problem-solving
- Communication at multiple levels
- Evidence-based decision making
- Tool building and automation
Self-Assessment Checklist
Before considering this project complete, verify:
- You can execute complete investigation methodology
- USE/RED checks work automatically
- Data collection is comprehensive and reproducible
- Reports are clear and actionable
- Past investigations are searchable
- Youโve practiced on real or realistic scenarios
- Others can use your toolkit
Resources
Essential Reading
- โSystems Performanceโ by Brendan Gregg (entire book)
- โPerformance Analysis and Tuning on Modern CPUsโ by Bakhvalov
- โThe Linux Programming Interfaceโ by Kerrisk
Methodology
- USE Method: Brendan Greggโs blog
- RED Method: Tom Wilkieโs talk
- Google SRE Book: Chapter 26 (Data Processing Pipelines)
Tools
- perf: Linux performance tool
- bpftrace: eBPF tracing
- Flamegraph: Visualization
- Grafana: Dashboarding
Communication
- โThinking, Fast and Slowโ by Kahneman (decision making)
- โThe Pyramid Principleโ by Minto (structured communication)