Project 11: Full-Stack Performance Engineering Field Manual

Project 11: Full-Stack Performance Engineering Field Manual

Project Overview

Attribute Details
Difficulty Expert
Time Estimate 1 month+
Primary Language C
Alternative Languages Rust, Go, C++
Knowledge Area Performance Engineering Systems
Tools Required perf, flamegraphs, tracing tools, GDB
Primary Reference โ€œSystems Performanceโ€ by Brendan Gregg

Learning Objectives

By completing this project, you will be able to:

  1. Execute a complete performance investigation from symptom to root cause
  2. Apply systematic methodology for any performance problem
  3. Produce actionable reports for stakeholders at different levels
  4. Build a reusable toolkit for future investigations
  5. Train others in performance engineering practices
  6. Make defensible optimization decisions with evidence

Deep Theoretical Foundation

The USE Method

Brendan Greggโ€™s USE method provides systematic coverage:

Utilization: How busy is the resource? Saturation: How much work is queued? Errors: Are there any errors?

Apply to each resource:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Resource    โ”‚ Utilization      โ”‚ Saturation     โ”‚ Errors        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ CPU         โ”‚ CPU% per core    โ”‚ Run queue len  โ”‚ N/A           โ”‚
โ”‚ Memory      โ”‚ Memory used %    โ”‚ Swapping rate  โ”‚ OOM events    โ”‚
โ”‚ Disk I/O    โ”‚ I/O busy %       โ”‚ I/O queue len  โ”‚ Device errors โ”‚
โ”‚ Network     โ”‚ Bandwidth used % โ”‚ Socket backlog โ”‚ Dropped pkts  โ”‚
โ”‚ Mutex       โ”‚ Hold time %      โ”‚ Waiter count   โ”‚ Deadlocks     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The RED Method

For services (request-driven systems):

Rate: Requests per second Errors: Failed requests per second Duration: Time per request (latency distribution)

Performance Investigation Workflow

                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Symptom Reported    โ”‚
                    โ”‚ "System is slow"    โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Quantify the Problemโ”‚
                    โ”‚ Baseline vs Current โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚                                      โ”‚
            โ–ผ                                      โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Global Analysis       โ”‚          โ”‚ Resource Analysis     โ”‚
โ”‚ (USE/RED metrics)     โ”‚          โ”‚ (bottleneck hunting)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
            โ”‚                                      โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Drill Down          โ”‚
                    โ”‚ Profile + Trace     โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Root Cause          โ”‚
                    โ”‚ Code/Config/HW      โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Fix + Validate      โ”‚
                    โ”‚ Prove improvement   โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Performance Checklist

Before any investigation:

โ–ก What is the symptom? (latency, throughput, resource usage)
โ–ก When did it start? (time correlation with changes)
โ–ก How is it measured? (metrics, methodology)
โ–ก What is "normal"? (baseline comparison)
โ–ก What has changed? (code, config, traffic, hardware)
โ–ก Who is affected? (all users, subset, specific operations)
โ–ก What is the business impact? (SLA, revenue, experience)

Complete Project Specification

What Youโ€™re Building

A comprehensive performance toolkit called perf_manual that:

  1. Executes systematic methodology through guided workflows
  2. Collects all relevant data (metrics, profiles, traces)
  3. Produces structured reports for technical and non-technical audiences
  4. Stores investigation artifacts for future reference
  5. Tracks optimization experiments with before/after data

Functional Requirements

# Investigation workflow
perf_manual investigate --workload <name> --symptom "<description>"
perf_manual use-check --target <pid|service>
perf_manual red-check --endpoint <url> --duration <sec>

# Data collection
perf_manual collect --suite <name> --output <dir>
perf_manual profile --pid <pid> --duration <sec>
perf_manual trace --events <list> --pid <pid>

# Analysis
perf_manual analyze --data <dir> --output <report.md>
perf_manual diff --before <dir> --after <dir>
perf_manual root-cause --hypothesis "<text>" --evidence <dir>

# Reporting
perf_manual report --format <technical|executive> --output <file>
perf_manual present --slides --data <dir>

Investigation Report Structure

# Performance Investigation Report

## Executive Summary
- **Issue**: Request latency increased 3x over past week
- **Impact**: 5% of users affected, estimated $50K/day revenue impact
- **Root Cause**: Database query N+1 pattern in new feature
- **Fix**: Query optimization, estimated 80% latency reduction
- **Status**: Fix deployed, monitoring for 48 hours

## Timeline
- 2025-01-20 10:00: Latency SLO breach detected
- 2025-01-20 10:15: Investigation started
- 2025-01-20 12:30: Root cause identified
- 2025-01-20 15:00: Fix implemented in staging
- 2025-01-21 09:00: Fix deployed to production

## Technical Details

### Symptom Quantification
- Baseline (Jan 13-19): p99 = 50ms
- Current (Jan 20): p99 = 180ms
- Change: +260% (3.6x increase)

### Resource Analysis (USE Method)
| Resource | Utilization | Saturation | Errors |
|----------|-------------|------------|--------|
| CPU      | 45%         | Low        | None   |
| Memory   | 62%         | None       | None   |
| Disk     | 12%         | None       | None   |
| Database | 89%         | HIGH       | Timeouts |

Database saturation identified as primary bottleneck.

### Profile Evidence
[Flamegraph: before/after comparison]

Hotspot shift:
- `db_query()`: 15% โ†’ 62% of CPU time
- New function `fetch_related()` appears at 45%

### Root Cause
N+1 query pattern in `fetch_related()`:
- Main query returns 100 items
- Each item triggers 1 additional query
- 101 queries instead of 2 queries (JOIN)

### Fix Validation
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| p99    | 180ms  | 42ms  | -77%   |
| DB queries/req | 101 | 2 | -98% |
| DB CPU | 89% | 25% | -64% |

## Recommendations
1. [DONE] Optimize query with JOIN
2. [PLANNED] Add N+1 query detection to CI
3. [PLANNED] Database connection pool tuning

## Appendix
- Full flamegraph: attached
- Query plans: attached
- Benchmark data: attached

Solution Architecture

Toolkit Structure

perf_manual/
โ”œโ”€โ”€ bin/
โ”‚   โ””โ”€โ”€ perf_manual           # Main CLI
โ”œโ”€โ”€ lib/
โ”‚   โ”œโ”€โ”€ collectors/
โ”‚   โ”‚   โ”œโ”€โ”€ cpu.sh            # CPU metrics collection
โ”‚   โ”‚   โ”œโ”€โ”€ memory.sh         # Memory metrics
โ”‚   โ”‚   โ”œโ”€โ”€ disk.sh           # Disk I/O
โ”‚   โ”‚   โ”œโ”€โ”€ network.sh        # Network
โ”‚   โ”‚   โ””โ”€โ”€ application.sh    # Custom app metrics
โ”‚   โ”œโ”€โ”€ profilers/
โ”‚   โ”‚   โ”œโ”€โ”€ flamegraph.sh     # perf + flamegraph
โ”‚   โ”‚   โ”œโ”€โ”€ offcpu.sh         # Off-CPU profiling
โ”‚   โ”‚   โ””โ”€โ”€ memory.sh         # Memory profiling
โ”‚   โ”œโ”€โ”€ tracers/
โ”‚   โ”‚   โ”œโ”€โ”€ syscall.sh        # Syscall tracing
โ”‚   โ”‚   โ”œโ”€โ”€ lock.sh           # Lock tracing
โ”‚   โ”‚   โ””โ”€โ”€ io.sh             # I/O tracing
โ”‚   โ””โ”€โ”€ analyzers/
โ”‚       โ”œโ”€โ”€ use_check.py      # USE method analysis
โ”‚       โ”œโ”€โ”€ red_check.py      # RED method analysis
โ”‚       โ”œโ”€โ”€ regression.py     # Statistical analysis
โ”‚       โ””โ”€โ”€ root_cause.py     # Evidence correlation
โ”œโ”€โ”€ templates/
โ”‚   โ”œโ”€โ”€ technical_report.md   # Technical report template
โ”‚   โ”œโ”€โ”€ executive_report.md   # Executive summary template
โ”‚   โ””โ”€โ”€ checklist.md          # Investigation checklist
โ””โ”€โ”€ investigations/           # Stored investigations
    โ””โ”€โ”€ YYYY-MM-DD_<name>/
        โ”œโ”€โ”€ metadata.yaml
        โ”œโ”€โ”€ raw_data/
        โ”œโ”€โ”€ analysis/
        โ””โ”€โ”€ report.md

Data Collection Framework

#!/bin/bash
# collect_all.sh - Comprehensive data collection

OUTPUT_DIR=$1
DURATION=$2
PID=$3

mkdir -p "$OUTPUT_DIR"/{cpu,memory,disk,network,profile,trace}

# CPU metrics
mpstat -P ALL 1 $DURATION > "$OUTPUT_DIR/cpu/mpstat.txt" &

# Memory metrics
vmstat 1 $DURATION > "$OUTPUT_DIR/memory/vmstat.txt" &

# Disk I/O
iostat -xz 1 $DURATION > "$OUTPUT_DIR/disk/iostat.txt" &

# Network
sar -n DEV 1 $DURATION > "$OUTPUT_DIR/network/sar_dev.txt" &

# CPU profile (flamegraph)
perf record -F 99 -g -p $PID -o "$OUTPUT_DIR/profile/perf.data" -- sleep $DURATION &

# Syscall trace (sampling)
perf trace -p $PID --duration ${DURATION}000 2>&1 | head -10000 > "$OUTPUT_DIR/trace/syscalls.txt" &

# Wait for all collectors
wait

# Generate flamegraph
perf script -i "$OUTPUT_DIR/profile/perf.data" | \
    stackcollapse-perf.pl | \
    flamegraph.pl > "$OUTPUT_DIR/profile/flamegraph.svg"

echo "Collection complete: $OUTPUT_DIR"

USE Method Checker

#!/usr/bin/env python3
# use_check.py - Automated USE method analysis

import subprocess
import json

def check_cpu():
    """CPU utilization, saturation, errors"""
    # Utilization: CPU busy %
    result = subprocess.run(
        ['mpstat', '1', '1'], capture_output=True, text=True
    )
    # Parse idle%, calculate busy%

    # Saturation: run queue length
    with open('/proc/loadavg') as f:
        load = float(f.read().split()[0])
    num_cpus = os.cpu_count()
    saturation = load / num_cpus

    return {
        'resource': 'CPU',
        'utilization': busy_pct,
        'saturation': 'HIGH' if saturation > 1.0 else 'LOW',
        'errors': 'N/A'
    }

def check_memory():
    """Memory utilization, saturation, errors"""
    # Utilization: memory used %
    with open('/proc/meminfo') as f:
        meminfo = dict(line.split(':') for line in f)
    total = int(meminfo['MemTotal'].strip().split()[0])
    avail = int(meminfo['MemAvailable'].strip().split()[0])
    used_pct = (1 - avail/total) * 100

    # Saturation: swap usage
    with open('/proc/vmstat') as f:
        vmstat = dict(line.split() for line in f)
    swap_in = int(vmstat.get('pswpin', 0))
    swap_out = int(vmstat.get('pswpout', 0))

    # Errors: OOM events
    oom_count = 0  # Parse dmesg for OOM

    return {
        'resource': 'Memory',
        'utilization': f'{used_pct:.1f}%',
        'saturation': 'HIGH' if swap_out > 0 else 'LOW',
        'errors': oom_count
    }

def check_disk():
    """Disk I/O utilization, saturation, errors"""
    result = subprocess.run(
        ['iostat', '-xz', '1', '2'], capture_output=True, text=True
    )
    # Parse %util for each device
    # Saturation: avgqu-sz (average queue size)

    return {
        'resource': 'Disk',
        'utilization': max_util,
        'saturation': 'HIGH' if avg_queue > 1 else 'LOW',
        'errors': error_count
    }

def run_use_check():
    """Run complete USE check"""
    results = [
        check_cpu(),
        check_memory(),
        check_disk(),
        check_network(),
    ]

    print("\n=== USE Method Analysis ===\n")
    print(f"{'Resource':<12} {'Utilization':<15} {'Saturation':<12} {'Errors':<10}")
    print("-" * 50)
    for r in results:
        status = 'โš ๏ธ' if r['saturation'] == 'HIGH' or r['errors'] else 'โœ“'
        print(f"{r['resource']:<12} {r['utilization']:<15} {r['saturation']:<12} {r['errors']:<10} {status}")

if __name__ == '__main__':
    run_use_check()

Phased Implementation Guide

Phase 1: Data Collection Framework (Week 1)

Goal: Comprehensive, reproducible data collection.

Steps:

  1. Create modular collection scripts per resource
  2. Implement unified collection runner
  3. Design output directory structure
  4. Add metadata capture (timestamp, config, version)
  5. Test with sample workload

Validation: All collectors run, data parseable.

Phase 2: USE/RED Automation (Week 2)

Goal: Automated methodology checklists.

Steps:

  1. Implement USE checker for all resources
  2. Implement RED checker for HTTP endpoints
  3. Create severity classification
  4. Generate checklist reports
  5. Identify bottlenecks automatically

Validation: Correctly identifies simulated bottlenecks.

Phase 3: Profile Integration (Week 3)

Goal: Unified profiling with evidence collection.

Steps:

  1. Integrate CPU profiling (flamegraph)
  2. Add off-CPU profiling
  3. Implement differential flamegraph
  4. Add memory profiling
  5. Link profiles to investigation

Validation: Profiles captured and accessible.

Phase 4: Analysis and Reporting (Week 4)

Goal: Structured analysis with actionable reports.

Steps:

  1. Create report templates (technical, executive)
  2. Implement automated analysis summary
  3. Add root cause hypothesis tracking
  4. Generate before/after comparisons
  5. Build evidence linkage

Validation: Report accurately summarizes investigation.

Phase 5: Investigation Management (Week 5+)

Goal: Persistent investigation tracking.

Steps:

  1. Create investigation storage structure
  2. Implement timeline tracking
  3. Add hypothesis/evidence correlation
  4. Build search/retrieval for past investigations
  5. Export to knowledge base

Validation: Can retrieve and reference past investigations.


Testing Strategy

Synthetic Scenarios

1. CPU Bottleneck

// Infinite loop consuming CPU
while(1) { volatile int x = 0; for(int i=0;i<10000000;i++) x++; }

Expected: USE shows high CPU utilization.

2. Memory Pressure

// Allocate until swapping
while(1) { malloc(1024*1024); }

Expected: USE shows memory saturation (swapping).

3. I/O Bottleneck

// Synchronous writes
while(1) { write(fd, data, 4096); fsync(fd); }

Expected: USE shows disk saturation.

4. Contention

// All threads fighting for one lock
pthread_mutex_lock(&global_lock);
work();
pthread_mutex_unlock(&global_lock);

Expected: Profile shows lock contention.

End-to-End Tests

  1. Complete investigation: Symptom โ†’ Root cause โ†’ Fix
  2. Report generation: Data โ†’ Technical report โ†’ Executive summary
  3. Evidence linking: Profile โ†’ Hypothesis โ†’ Validation

Common Pitfalls and Debugging

Pitfall 1: Analysis Paralysis

Symptom: Collecting data forever, never concluding.

Solution: Time-box each phase:

  • 30 min: Quantify symptom
  • 1 hour: USE/RED check
  • 2 hours: Profile and drill down
  • If no progress, escalate or document uncertainty

Pitfall 2: Confirmation Bias

Symptom: Finding evidence for pre-existing belief.

Solution:

  • Collect data before forming hypothesis
  • Actively seek disconfirming evidence
  • Have peer review findings

Pitfall 3: Fixing Symptoms Not Causes

Symptom: Problem recurs after โ€œfixโ€.

Solution:

  • Ask โ€œwhyโ€ 5 times (root cause analysis)
  • Verify fix addresses root cause
  • Monitor for regression

Pitfall 4: Missing the Forest for Trees

Symptom: Optimizing 1% of runtime.

Solution:

  • Always start with highest-impact opportunity
  • Calculate theoretical maximum improvement
  • Stop when diminishing returns

Extensions and Challenges

Extension 1: Automated Incident Response

Integrate with monitoring:

  • Alert triggers investigation
  • Automatic data collection
  • Initial analysis before human involvement

Extension 2: Knowledge Base

Build searchable repository:

  • Past investigations indexed
  • Similar symptoms matched
  • Solutions suggested

Extension 3: Performance Review Process

Integrate with development:

  • Pre-merge performance check
  • Post-deploy monitoring
  • Quarterly performance review

Challenge: Real Incident

Apply full methodology to production incident:

  • Time-bounded (SLA pressure)
  • Incomplete information
  • Stakeholder communication
  • Post-mortem documentation

Real-World Connections

Industry Methodologies

  1. Googleโ€™s Data-Driven Approach: Everything measured, decisions justified
  2. Facebookโ€™s Performance Culture: Continuous profiling, efficiency focus
  3. Netflixโ€™s Chaos Engineering: Proactive performance testing
  4. Amazonโ€™s Two-Pizza Teams: Ownership of service performance

Career Application

This project demonstrates:

  • Systematic problem-solving
  • Communication at multiple levels
  • Evidence-based decision making
  • Tool building and automation

Self-Assessment Checklist

Before considering this project complete, verify:

  • You can execute complete investigation methodology
  • USE/RED checks work automatically
  • Data collection is comprehensive and reproducible
  • Reports are clear and actionable
  • Past investigations are searchable
  • Youโ€™ve practiced on real or realistic scenarios
  • Others can use your toolkit

Resources

Essential Reading

  • โ€œSystems Performanceโ€ by Brendan Gregg (entire book)
  • โ€œPerformance Analysis and Tuning on Modern CPUsโ€ by Bakhvalov
  • โ€œThe Linux Programming Interfaceโ€ by Kerrisk

Methodology

  • USE Method: Brendan Greggโ€™s blog
  • RED Method: Tom Wilkieโ€™s talk
  • Google SRE Book: Chapter 26 (Data Processing Pipelines)

Tools

  • perf: Linux performance tool
  • bpftrace: eBPF tracing
  • Flamegraph: Visualization
  • Grafana: Dashboarding

Communication

  • โ€œThinking, Fast and Slowโ€ by Kahneman (decision making)
  • โ€œThe Pyramid Principleโ€ by Minto (structured communication)