Project 11: Full-Stack Performance Engineering Field Manual

Project Overview

Attribute	Details
Difficulty	Expert
Time Estimate	1 month+
Primary Language	C
Alternative Languages	Rust, Go, C++
Knowledge Area	Performance Engineering Systems
Tools Required	perf, flamegraphs, tracing tools, GDB
Primary Reference	“Systems Performance” by Brendan Gregg

Learning Objectives

By completing this project, you will be able to:

Execute a complete performance investigation from symptom to root cause
Apply systematic methodology for any performance problem
Produce actionable reports for stakeholders at different levels
Build a reusable toolkit for future investigations
Train others in performance engineering practices
Make defensible optimization decisions with evidence

Deep Theoretical Foundation

The USE Method

Brendan Gregg’s USE method provides systematic coverage:

Utilization: How busy is the resource? Saturation: How much work is queued? Errors: Are there any errors?

Apply to each resource:

┌─────────────────────────────────────────────────────────────────┐
│ Resource    │ Utilization      │ Saturation     │ Errors        │
├─────────────┼──────────────────┼────────────────┼───────────────┤
│ CPU         │ CPU% per core    │ Run queue len  │ N/A           │
│ Memory      │ Memory used %    │ Swapping rate  │ OOM events    │
│ Disk I/O    │ I/O busy %       │ I/O queue len  │ Device errors │
│ Network     │ Bandwidth used % │ Socket backlog │ Dropped pkts  │
│ Mutex       │ Hold time %      │ Waiter count   │ Deadlocks     │
└─────────────┴──────────────────┴────────────────┴───────────────┘

The RED Method

For services (request-driven systems):

Rate: Requests per second Errors: Failed requests per second Duration: Time per request (latency distribution)

Performance Investigation Workflow

                    ┌─────────────────────┐
                    │ Symptom Reported    │
                    │ "System is slow"    │
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │ Quantify the Problem│
                    │ Baseline vs Current │
                    └──────────┬──────────┘
                               │
                               ▼
            ┌──────────────────┴──────────────────┐
            │                                      │
            ▼                                      ▼
┌───────────────────────┐          ┌───────────────────────┐
│ Global Analysis       │          │ Resource Analysis     │
│ (USE/RED metrics)     │          │ (bottleneck hunting)  │
└───────────┬───────────┘          └───────────┬───────────┘
            │                                      │
            └──────────────────┬──────────────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │ Drill Down          │
                    │ Profile + Trace     │
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │ Root Cause          │
                    │ Code/Config/HW      │
                    └──────────┬──────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │ Fix + Validate      │
                    │ Prove improvement   │
                    └─────────────────────┘

The Performance Checklist

Before any investigation:

□ What is the symptom? (latency, throughput, resource usage)
□ When did it start? (time correlation with changes)
□ How is it measured? (metrics, methodology)
□ What is "normal"? (baseline comparison)
□ What has changed? (code, config, traffic, hardware)
□ Who is affected? (all users, subset, specific operations)
□ What is the business impact? (SLA, revenue, experience)

Complete Project Specification

What You’re Building

A comprehensive performance toolkit called perf_manual that:

Executes systematic methodology through guided workflows
Collects all relevant data (metrics, profiles, traces)
Produces structured reports for technical and non-technical audiences
Stores investigation artifacts for future reference
Tracks optimization experiments with before/after data

Functional Requirements

# Investigation workflow
perf_manual investigate --workload <name> --symptom "<description>"
perf_manual use-check --target <pid|service>
perf_manual red-check --endpoint <url> --duration <sec>

# Data collection
perf_manual collect --suite <name> --output <dir>
perf_manual profile --pid <pid> --duration <sec>
perf_manual trace --events <list> --pid <pid>

# Analysis
perf_manual analyze --data <dir> --output <report.md>
perf_manual diff --before <dir> --after <dir>
perf_manual root-cause --hypothesis "<text>" --evidence <dir>

# Reporting
perf_manual report --format <technical|executive> --output <file>
perf_manual present --slides --data <dir>

Investigation Report Structure

# Performance Investigation Report

## Executive Summary
- **Issue**: Request latency increased 3x over past week
- **Impact**: 5% of users affected, estimated $50K/day revenue impact
- **Root Cause**: Database query N+1 pattern in new feature
- **Fix**: Query optimization, estimated 80% latency reduction
- **Status**: Fix deployed, monitoring for 48 hours

## Timeline
- 2025-01-20 10:00: Latency SLO breach detected
- 2025-01-20 10:15: Investigation started
- 2025-01-20 12:30: Root cause identified
- 2025-01-20 15:00: Fix implemented in staging
- 2025-01-21 09:00: Fix deployed to production

## Technical Details

### Symptom Quantification
- Baseline (Jan 13-19): p99 = 50ms
- Current (Jan 20): p99 = 180ms
- Change: +260% (3.6x increase)

### Resource Analysis (USE Method)
| Resource | Utilization | Saturation | Errors |
|----------|-------------|------------|--------|
| CPU      | 45%         | Low        | None   |
| Memory   | 62%         | None       | None   |
| Disk     | 12%         | None       | None   |
| Database | 89%         | HIGH       | Timeouts |

Database saturation identified as primary bottleneck.

### Profile Evidence
[Flamegraph: before/after comparison]

Hotspot shift:
- `db_query()`: 15% → 62% of CPU time
- New function `fetch_related()` appears at 45%

### Root Cause
N+1 query pattern in `fetch_related()`:
- Main query returns 100 items
- Each item triggers 1 additional query
- 101 queries instead of 2 queries (JOIN)

### Fix Validation
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| p99    | 180ms  | 42ms  | -77%   |
| DB queries/req | 101 | 2 | -98% |
| DB CPU | 89% | 25% | -64% |

## Recommendations
1. [DONE] Optimize query with JOIN
2. [PLANNED] Add N+1 query detection to CI
3. [PLANNED] Database connection pool tuning

## Appendix
- Full flamegraph: attached
- Query plans: attached
- Benchmark data: attached

Solution Architecture

Toolkit Structure

perf_manual/
├── bin/
│   └── perf_manual           # Main CLI
├── lib/
│   ├── collectors/
│   │   ├── cpu.sh            # CPU metrics collection
│   │   ├── memory.sh         # Memory metrics
│   │   ├── disk.sh           # Disk I/O
│   │   ├── network.sh        # Network
│   │   └── application.sh    # Custom app metrics
│   ├── profilers/
│   │   ├── flamegraph.sh     # perf + flamegraph
│   │   ├── offcpu.sh         # Off-CPU profiling
│   │   └── memory.sh         # Memory profiling
│   ├── tracers/
│   │   ├── syscall.sh        # Syscall tracing
│   │   ├── lock.sh           # Lock tracing
│   │   └── io.sh             # I/O tracing
│   └── analyzers/
│       ├── use_check.py      # USE method analysis
│       ├── red_check.py      # RED method analysis
│       ├── regression.py     # Statistical analysis
│       └── root_cause.py     # Evidence correlation
├── templates/
│   ├── technical_report.md   # Technical report template
│   ├── executive_report.md   # Executive summary template
│   └── checklist.md          # Investigation checklist
└── investigations/           # Stored investigations
    └── YYYY-MM-DD_<name>/
        ├── metadata.yaml
        ├── raw_data/
        ├── analysis/
        └── report.md

Data Collection Framework

#!/bin/bash
# collect_all.sh - Comprehensive data collection

OUTPUT_DIR=$1
DURATION=$2
PID=$3

mkdir -p "$OUTPUT_DIR"/{cpu,memory,disk,network,profile,trace}

# CPU metrics
mpstat -P ALL 1 $DURATION > "$OUTPUT_DIR/cpu/mpstat.txt" &

# Memory metrics
vmstat 1 $DURATION > "$OUTPUT_DIR/memory/vmstat.txt" &

# Disk I/O
iostat -xz 1 $DURATION > "$OUTPUT_DIR/disk/iostat.txt" &

# Network
sar -n DEV 1 $DURATION > "$OUTPUT_DIR/network/sar_dev.txt" &

# CPU profile (flamegraph)
perf record -F 99 -g -p $PID -o "$OUTPUT_DIR/profile/perf.data" -- sleep $DURATION &

# Syscall trace (sampling)
perf trace -p $PID --duration ${DURATION}000 2>&1 | head -10000 > "$OUTPUT_DIR/trace/syscalls.txt" &

# Wait for all collectors
wait

# Generate flamegraph
perf script -i "$OUTPUT_DIR/profile/perf.data" | \
    stackcollapse-perf.pl | \
    flamegraph.pl > "$OUTPUT_DIR/profile/flamegraph.svg"

echo "Collection complete: $OUTPUT_DIR"

USE Method Checker

#!/usr/bin/env python3
# use_check.py - Automated USE method analysis

import subprocess
import json

def check_cpu():
    """CPU utilization, saturation, errors"""
    # Utilization: CPU busy %
    result = subprocess.run(
        ['mpstat', '1', '1'], capture_output=True, text=True
    )
    # Parse idle%, calculate busy%

    # Saturation: run queue length
    with open('/proc/loadavg') as f:
        load = float(f.read().split()[0])
    num_cpus = os.cpu_count()
    saturation = load / num_cpus

    return {
        'resource': 'CPU',
        'utilization': busy_pct,
        'saturation': 'HIGH' if saturation > 1.0 else 'LOW',
        'errors': 'N/A'
    }

def check_memory():
    """Memory utilization, saturation, errors"""
    # Utilization: memory used %
    with open('/proc/meminfo') as f:
        meminfo = dict(line.split(':') for line in f)
    total = int(meminfo['MemTotal'].strip().split()[0])
    avail = int(meminfo['MemAvailable'].strip().split()[0])
    used_pct = (1 - avail/total) * 100

    # Saturation: swap usage
    with open('/proc/vmstat') as f:
        vmstat = dict(line.split() for line in f)
    swap_in = int(vmstat.get('pswpin', 0))
    swap_out = int(vmstat.get('pswpout', 0))

    # Errors: OOM events
    oom_count = 0  # Parse dmesg for OOM

    return {
        'resource': 'Memory',
        'utilization': f'{used_pct:.1f}%',
        'saturation': 'HIGH' if swap_out > 0 else 'LOW',
        'errors': oom_count
    }

def check_disk():
    """Disk I/O utilization, saturation, errors"""
    result = subprocess.run(
        ['iostat', '-xz', '1', '2'], capture_output=True, text=True
    )
    # Parse %util for each device
    # Saturation: avgqu-sz (average queue size)

    return {
        'resource': 'Disk',
        'utilization': max_util,
        'saturation': 'HIGH' if avg_queue > 1 else 'LOW',
        'errors': error_count
    }

def run_use_check():
    """Run complete USE check"""
    results = [
        check_cpu(),
        check_memory(),
        check_disk(),
        check_network(),
    ]

    print("\n=== USE Method Analysis ===\n")
    print(f"{'Resource':<12} {'Utilization':<15} {'Saturation':<12} {'Errors':<10}")
    print("-" * 50)
    for r in results:
        status = '⚠️' if r['saturation'] == 'HIGH' or r['errors'] else '✓'
        print(f"{r['resource']:<12} {r['utilization']:<15} {r['saturation']:<12} {r['errors']:<10} {status}")

if __name__ == '__main__':
    run_use_check()

Phased Implementation Guide

Phase 1: Data Collection Framework (Week 1)

Goal: Comprehensive, reproducible data collection.

Steps:

Create modular collection scripts per resource
Implement unified collection runner
Design output directory structure
Add metadata capture (timestamp, config, version)
Test with sample workload

Validation: All collectors run, data parseable.

Phase 2: USE/RED Automation (Week 2)

Goal: Automated methodology checklists.

Steps:

Implement USE checker for all resources
Implement RED checker for HTTP endpoints
Create severity classification
Generate checklist reports
Identify bottlenecks automatically

Validation: Correctly identifies simulated bottlenecks.

Phase 3: Profile Integration (Week 3)

Goal: Unified profiling with evidence collection.

Steps:

Integrate CPU profiling (flamegraph)
Add off-CPU profiling
Implement differential flamegraph
Add memory profiling
Link profiles to investigation

Validation: Profiles captured and accessible.

Phase 4: Analysis and Reporting (Week 4)

Goal: Structured analysis with actionable reports.

Steps:

Create report templates (technical, executive)
Implement automated analysis summary
Add root cause hypothesis tracking
Generate before/after comparisons
Build evidence linkage

Validation: Report accurately summarizes investigation.

Phase 5: Investigation Management (Week 5+)

Goal: Persistent investigation tracking.

Steps:

Create investigation storage structure
Implement timeline tracking
Add hypothesis/evidence correlation
Build search/retrieval for past investigations
Export to knowledge base

Validation: Can retrieve and reference past investigations.

Testing Strategy

Synthetic Scenarios

1. CPU Bottleneck

// Infinite loop consuming CPU
while(1) { volatile int x = 0; for(int i=0;i<10000000;i++) x++; }

Expected: USE shows high CPU utilization.

2. Memory Pressure

// Allocate until swapping
while(1) { malloc(1024*1024); }

Expected: USE shows memory saturation (swapping).

3. I/O Bottleneck

// Synchronous writes
while(1) { write(fd, data, 4096); fsync(fd); }

Expected: USE shows disk saturation.

4. Contention

// All threads fighting for one lock
pthread_mutex_lock(&global_lock);
work();
pthread_mutex_unlock(&global_lock);

Expected: Profile shows lock contention.

End-to-End Tests

Complete investigation: Symptom → Root cause → Fix
Report generation: Data → Technical report → Executive summary
Evidence linking: Profile → Hypothesis → Validation

Common Pitfalls and Debugging

Pitfall 1: Analysis Paralysis

Symptom: Collecting data forever, never concluding.

Solution: Time-box each phase:

30 min: Quantify symptom
1 hour: USE/RED check
2 hours: Profile and drill down
If no progress, escalate or document uncertainty

Pitfall 2: Confirmation Bias

Symptom: Finding evidence for pre-existing belief.

Solution:

Collect data before forming hypothesis
Actively seek disconfirming evidence
Have peer review findings

Pitfall 3: Fixing Symptoms Not Causes

Symptom: Problem recurs after “fix”.

Solution:

Ask “why” 5 times (root cause analysis)
Verify fix addresses root cause
Monitor for regression

Pitfall 4: Missing the Forest for Trees

Symptom: Optimizing 1% of runtime.

Solution:

Always start with highest-impact opportunity
Calculate theoretical maximum improvement
Stop when diminishing returns

Extensions and Challenges

Extension 1: Automated Incident Response

Integrate with monitoring:

Alert triggers investigation
Automatic data collection
Initial analysis before human involvement

Extension 2: Knowledge Base

Build searchable repository:

Past investigations indexed
Similar symptoms matched
Solutions suggested

Extension 3: Performance Review Process

Integrate with development:

Pre-merge performance check
Post-deploy monitoring
Quarterly performance review

Challenge: Real Incident

Apply full methodology to production incident:

Time-bounded (SLA pressure)
Incomplete information
Stakeholder communication
Post-mortem documentation

Real-World Connections

Industry Methodologies

Google’s Data-Driven Approach: Everything measured, decisions justified
Facebook’s Performance Culture: Continuous profiling, efficiency focus
Netflix’s Chaos Engineering: Proactive performance testing
Amazon’s Two-Pizza Teams: Ownership of service performance

Career Application

This project demonstrates:

Systematic problem-solving
Communication at multiple levels
Evidence-based decision making
Tool building and automation

Self-Assessment Checklist

Before considering this project complete, verify:

You can execute complete investigation methodology
USE/RED checks work automatically
Data collection is comprehensive and reproducible
Reports are clear and actionable
Past investigations are searchable
You’ve practiced on real or realistic scenarios
Others can use your toolkit

Resources

Essential Reading

“Systems Performance” by Brendan Gregg (entire book)
“Performance Analysis and Tuning on Modern CPUs” by Bakhvalov
“The Linux Programming Interface” by Kerrisk

Methodology

USE Method: Brendan Gregg’s blog
RED Method: Tom Wilkie’s talk
Google SRE Book: Chapter 26 (Data Processing Pipelines)

Tools

perf: Linux performance tool
bpftrace: eBPF tracing
Flamegraph: Visualization
Grafana: Dashboarding

Communication

“Thinking, Fast and Slow” by Kahneman (decision making)
“The Pyramid Principle” by Minto (structured communication)