Project 11: Full-Stack Performance Engineering Field Manual
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Expert |
| Time Estimate | 1 month+ |
| Primary Language | C |
| Alternative Languages | Rust, Go, C++ |
| Knowledge Area | Performance Engineering Systems |
| Tools Required | perf, flamegraphs, tracing tools, GDB |
| Primary Reference | “Systems Performance” by Brendan Gregg |
Learning Objectives
By completing this project, you will be able to:
- Execute a complete performance investigation from symptom to root cause
- Apply systematic methodology for any performance problem
- Produce actionable reports for stakeholders at different levels
- Build a reusable toolkit for future investigations
- Train others in performance engineering practices
- Make defensible optimization decisions with evidence
Deep Theoretical Foundation
The USE Method
Brendan Gregg’s USE method provides systematic coverage:
Utilization: How busy is the resource? Saturation: How much work is queued? Errors: Are there any errors?
Apply to each resource:
┌─────────────────────────────────────────────────────────────────┐
│ Resource │ Utilization │ Saturation │ Errors │
├─────────────┼──────────────────┼────────────────┼───────────────┤
│ CPU │ CPU% per core │ Run queue len │ N/A │
│ Memory │ Memory used % │ Swapping rate │ OOM events │
│ Disk I/O │ I/O busy % │ I/O queue len │ Device errors │
│ Network │ Bandwidth used % │ Socket backlog │ Dropped pkts │
│ Mutex │ Hold time % │ Waiter count │ Deadlocks │
└─────────────┴──────────────────┴────────────────┴───────────────┘
The RED Method
For services (request-driven systems):
Rate: Requests per second Errors: Failed requests per second Duration: Time per request (latency distribution)
Performance Investigation Workflow
┌─────────────────────┐
│ Symptom Reported │
│ "System is slow" │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Quantify the Problem│
│ Baseline vs Current │
└──────────┬──────────┘
│
▼
┌──────────────────┴──────────────────┐
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ Global Analysis │ │ Resource Analysis │
│ (USE/RED metrics) │ │ (bottleneck hunting) │
└───────────┬───────────┘ └───────────┬───────────┘
│ │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────┐
│ Drill Down │
│ Profile + Trace │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Root Cause │
│ Code/Config/HW │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Fix + Validate │
│ Prove improvement │
└─────────────────────┘
The Performance Checklist
Before any investigation:
□ What is the symptom? (latency, throughput, resource usage)
□ When did it start? (time correlation with changes)
□ How is it measured? (metrics, methodology)
□ What is "normal"? (baseline comparison)
□ What has changed? (code, config, traffic, hardware)
□ Who is affected? (all users, subset, specific operations)
□ What is the business impact? (SLA, revenue, experience)
Complete Project Specification
What You’re Building
A comprehensive performance toolkit called perf_manual that:
- Executes systematic methodology through guided workflows
- Collects all relevant data (metrics, profiles, traces)
- Produces structured reports for technical and non-technical audiences
- Stores investigation artifacts for future reference
- Tracks optimization experiments with before/after data
Functional Requirements
# Investigation workflow
perf_manual investigate --workload <name> --symptom "<description>"
perf_manual use-check --target <pid|service>
perf_manual red-check --endpoint <url> --duration <sec>
# Data collection
perf_manual collect --suite <name> --output <dir>
perf_manual profile --pid <pid> --duration <sec>
perf_manual trace --events <list> --pid <pid>
# Analysis
perf_manual analyze --data <dir> --output <report.md>
perf_manual diff --before <dir> --after <dir>
perf_manual root-cause --hypothesis "<text>" --evidence <dir>
# Reporting
perf_manual report --format <technical|executive> --output <file>
perf_manual present --slides --data <dir>
Investigation Report Structure
# Performance Investigation Report
## Executive Summary
- **Issue**: Request latency increased 3x over past week
- **Impact**: 5% of users affected, estimated $50K/day revenue impact
- **Root Cause**: Database query N+1 pattern in new feature
- **Fix**: Query optimization, estimated 80% latency reduction
- **Status**: Fix deployed, monitoring for 48 hours
## Timeline
- 2025-01-20 10:00: Latency SLO breach detected
- 2025-01-20 10:15: Investigation started
- 2025-01-20 12:30: Root cause identified
- 2025-01-20 15:00: Fix implemented in staging
- 2025-01-21 09:00: Fix deployed to production
## Technical Details
### Symptom Quantification
- Baseline (Jan 13-19): p99 = 50ms
- Current (Jan 20): p99 = 180ms
- Change: +260% (3.6x increase)
### Resource Analysis (USE Method)
| Resource | Utilization | Saturation | Errors |
|----------|-------------|------------|--------|
| CPU | 45% | Low | None |
| Memory | 62% | None | None |
| Disk | 12% | None | None |
| Database | 89% | HIGH | Timeouts |
Database saturation identified as primary bottleneck.
### Profile Evidence
[Flamegraph: before/after comparison]
Hotspot shift:
- `db_query()`: 15% → 62% of CPU time
- New function `fetch_related()` appears at 45%
### Root Cause
N+1 query pattern in `fetch_related()`:
- Main query returns 100 items
- Each item triggers 1 additional query
- 101 queries instead of 2 queries (JOIN)
### Fix Validation
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| p99 | 180ms | 42ms | -77% |
| DB queries/req | 101 | 2 | -98% |
| DB CPU | 89% | 25% | -64% |
## Recommendations
1. [DONE] Optimize query with JOIN
2. [PLANNED] Add N+1 query detection to CI
3. [PLANNED] Database connection pool tuning
## Appendix
- Full flamegraph: attached
- Query plans: attached
- Benchmark data: attached
Solution Architecture
Toolkit Structure
perf_manual/
├── bin/
│ └── perf_manual # Main CLI
├── lib/
│ ├── collectors/
│ │ ├── cpu.sh # CPU metrics collection
│ │ ├── memory.sh # Memory metrics
│ │ ├── disk.sh # Disk I/O
│ │ ├── network.sh # Network
│ │ └── application.sh # Custom app metrics
│ ├── profilers/
│ │ ├── flamegraph.sh # perf + flamegraph
│ │ ├── offcpu.sh # Off-CPU profiling
│ │ └── memory.sh # Memory profiling
│ ├── tracers/
│ │ ├── syscall.sh # Syscall tracing
│ │ ├── lock.sh # Lock tracing
│ │ └── io.sh # I/O tracing
│ └── analyzers/
│ ├── use_check.py # USE method analysis
│ ├── red_check.py # RED method analysis
│ ├── regression.py # Statistical analysis
│ └── root_cause.py # Evidence correlation
├── templates/
│ ├── technical_report.md # Technical report template
│ ├── executive_report.md # Executive summary template
│ └── checklist.md # Investigation checklist
└── investigations/ # Stored investigations
└── YYYY-MM-DD_<name>/
├── metadata.yaml
├── raw_data/
├── analysis/
└── report.md
Data Collection Framework
#!/bin/bash
# collect_all.sh - Comprehensive data collection
OUTPUT_DIR=$1
DURATION=$2
PID=$3
mkdir -p "$OUTPUT_DIR"/{cpu,memory,disk,network,profile,trace}
# CPU metrics
mpstat -P ALL 1 $DURATION > "$OUTPUT_DIR/cpu/mpstat.txt" &
# Memory metrics
vmstat 1 $DURATION > "$OUTPUT_DIR/memory/vmstat.txt" &
# Disk I/O
iostat -xz 1 $DURATION > "$OUTPUT_DIR/disk/iostat.txt" &
# Network
sar -n DEV 1 $DURATION > "$OUTPUT_DIR/network/sar_dev.txt" &
# CPU profile (flamegraph)
perf record -F 99 -g -p $PID -o "$OUTPUT_DIR/profile/perf.data" -- sleep $DURATION &
# Syscall trace (sampling)
perf trace -p $PID --duration ${DURATION}000 2>&1 | head -10000 > "$OUTPUT_DIR/trace/syscalls.txt" &
# Wait for all collectors
wait
# Generate flamegraph
perf script -i "$OUTPUT_DIR/profile/perf.data" | \
stackcollapse-perf.pl | \
flamegraph.pl > "$OUTPUT_DIR/profile/flamegraph.svg"
echo "Collection complete: $OUTPUT_DIR"
USE Method Checker
#!/usr/bin/env python3
# use_check.py - Automated USE method analysis
import subprocess
import json
def check_cpu():
"""CPU utilization, saturation, errors"""
# Utilization: CPU busy %
result = subprocess.run(
['mpstat', '1', '1'], capture_output=True, text=True
)
# Parse idle%, calculate busy%
# Saturation: run queue length
with open('/proc/loadavg') as f:
load = float(f.read().split()[0])
num_cpus = os.cpu_count()
saturation = load / num_cpus
return {
'resource': 'CPU',
'utilization': busy_pct,
'saturation': 'HIGH' if saturation > 1.0 else 'LOW',
'errors': 'N/A'
}
def check_memory():
"""Memory utilization, saturation, errors"""
# Utilization: memory used %
with open('/proc/meminfo') as f:
meminfo = dict(line.split(':') for line in f)
total = int(meminfo['MemTotal'].strip().split()[0])
avail = int(meminfo['MemAvailable'].strip().split()[0])
used_pct = (1 - avail/total) * 100
# Saturation: swap usage
with open('/proc/vmstat') as f:
vmstat = dict(line.split() for line in f)
swap_in = int(vmstat.get('pswpin', 0))
swap_out = int(vmstat.get('pswpout', 0))
# Errors: OOM events
oom_count = 0 # Parse dmesg for OOM
return {
'resource': 'Memory',
'utilization': f'{used_pct:.1f}%',
'saturation': 'HIGH' if swap_out > 0 else 'LOW',
'errors': oom_count
}
def check_disk():
"""Disk I/O utilization, saturation, errors"""
result = subprocess.run(
['iostat', '-xz', '1', '2'], capture_output=True, text=True
)
# Parse %util for each device
# Saturation: avgqu-sz (average queue size)
return {
'resource': 'Disk',
'utilization': max_util,
'saturation': 'HIGH' if avg_queue > 1 else 'LOW',
'errors': error_count
}
def run_use_check():
"""Run complete USE check"""
results = [
check_cpu(),
check_memory(),
check_disk(),
check_network(),
]
print("\n=== USE Method Analysis ===\n")
print(f"{'Resource':<12} {'Utilization':<15} {'Saturation':<12} {'Errors':<10}")
print("-" * 50)
for r in results:
status = '⚠️' if r['saturation'] == 'HIGH' or r['errors'] else '✓'
print(f"{r['resource']:<12} {r['utilization']:<15} {r['saturation']:<12} {r['errors']:<10} {status}")
if __name__ == '__main__':
run_use_check()
Phased Implementation Guide
Phase 1: Data Collection Framework (Week 1)
Goal: Comprehensive, reproducible data collection.
Steps:
- Create modular collection scripts per resource
- Implement unified collection runner
- Design output directory structure
- Add metadata capture (timestamp, config, version)
- Test with sample workload
Validation: All collectors run, data parseable.
Phase 2: USE/RED Automation (Week 2)
Goal: Automated methodology checklists.
Steps:
- Implement USE checker for all resources
- Implement RED checker for HTTP endpoints
- Create severity classification
- Generate checklist reports
- Identify bottlenecks automatically
Validation: Correctly identifies simulated bottlenecks.
Phase 3: Profile Integration (Week 3)
Goal: Unified profiling with evidence collection.
Steps:
- Integrate CPU profiling (flamegraph)
- Add off-CPU profiling
- Implement differential flamegraph
- Add memory profiling
- Link profiles to investigation
Validation: Profiles captured and accessible.
Phase 4: Analysis and Reporting (Week 4)
Goal: Structured analysis with actionable reports.
Steps:
- Create report templates (technical, executive)
- Implement automated analysis summary
- Add root cause hypothesis tracking
- Generate before/after comparisons
- Build evidence linkage
Validation: Report accurately summarizes investigation.
Phase 5: Investigation Management (Week 5+)
Goal: Persistent investigation tracking.
Steps:
- Create investigation storage structure
- Implement timeline tracking
- Add hypothesis/evidence correlation
- Build search/retrieval for past investigations
- Export to knowledge base
Validation: Can retrieve and reference past investigations.
Testing Strategy
Synthetic Scenarios
1. CPU Bottleneck
// Infinite loop consuming CPU
while(1) { volatile int x = 0; for(int i=0;i<10000000;i++) x++; }
Expected: USE shows high CPU utilization.
2. Memory Pressure
// Allocate until swapping
while(1) { malloc(1024*1024); }
Expected: USE shows memory saturation (swapping).
3. I/O Bottleneck
// Synchronous writes
while(1) { write(fd, data, 4096); fsync(fd); }
Expected: USE shows disk saturation.
4. Contention
// All threads fighting for one lock
pthread_mutex_lock(&global_lock);
work();
pthread_mutex_unlock(&global_lock);
Expected: Profile shows lock contention.
End-to-End Tests
- Complete investigation: Symptom → Root cause → Fix
- Report generation: Data → Technical report → Executive summary
- Evidence linking: Profile → Hypothesis → Validation
Common Pitfalls and Debugging
Pitfall 1: Analysis Paralysis
Symptom: Collecting data forever, never concluding.
Solution: Time-box each phase:
- 30 min: Quantify symptom
- 1 hour: USE/RED check
- 2 hours: Profile and drill down
- If no progress, escalate or document uncertainty
Pitfall 2: Confirmation Bias
Symptom: Finding evidence for pre-existing belief.
Solution:
- Collect data before forming hypothesis
- Actively seek disconfirming evidence
- Have peer review findings
Pitfall 3: Fixing Symptoms Not Causes
Symptom: Problem recurs after “fix”.
Solution:
- Ask “why” 5 times (root cause analysis)
- Verify fix addresses root cause
- Monitor for regression
Pitfall 4: Missing the Forest for Trees
Symptom: Optimizing 1% of runtime.
Solution:
- Always start with highest-impact opportunity
- Calculate theoretical maximum improvement
- Stop when diminishing returns
Extensions and Challenges
Extension 1: Automated Incident Response
Integrate with monitoring:
- Alert triggers investigation
- Automatic data collection
- Initial analysis before human involvement
Extension 2: Knowledge Base
Build searchable repository:
- Past investigations indexed
- Similar symptoms matched
- Solutions suggested
Extension 3: Performance Review Process
Integrate with development:
- Pre-merge performance check
- Post-deploy monitoring
- Quarterly performance review
Challenge: Real Incident
Apply full methodology to production incident:
- Time-bounded (SLA pressure)
- Incomplete information
- Stakeholder communication
- Post-mortem documentation
Real-World Connections
Industry Methodologies
- Google’s Data-Driven Approach: Everything measured, decisions justified
- Facebook’s Performance Culture: Continuous profiling, efficiency focus
- Netflix’s Chaos Engineering: Proactive performance testing
- Amazon’s Two-Pizza Teams: Ownership of service performance
Career Application
This project demonstrates:
- Systematic problem-solving
- Communication at multiple levels
- Evidence-based decision making
- Tool building and automation
Self-Assessment Checklist
Before considering this project complete, verify:
- You can execute complete investigation methodology
- USE/RED checks work automatically
- Data collection is comprehensive and reproducible
- Reports are clear and actionable
- Past investigations are searchable
- You’ve practiced on real or realistic scenarios
- Others can use your toolkit
Resources
Essential Reading
- “Systems Performance” by Brendan Gregg (entire book)
- “Performance Analysis and Tuning on Modern CPUs” by Bakhvalov
- “The Linux Programming Interface” by Kerrisk
Methodology
- USE Method: Brendan Gregg’s blog
- RED Method: Tom Wilkie’s talk
- Google SRE Book: Chapter 26 (Data Processing Pipelines)
Tools
- perf: Linux performance tool
- bpftrace: eBPF tracing
- Flamegraph: Visualization
- Grafana: Dashboarding
Communication
- “Thinking, Fast and Slow” by Kahneman (decision making)
- “The Pyramid Principle” by Minto (structured communication)