Project 10: End-to-End Performance Regression Dashboard

Project 10: End-to-End Performance Regression Dashboard

Project Overview

Attribute Details
Difficulty Advanced
Time Estimate 1 month+
Primary Language C
Alternative Languages Go, Rust, Python
Knowledge Area Performance Engineering Systems
Tools Required perf, flamegraphs, tracing tools, CI/CD
Primary Reference โ€œSystems Performanceโ€ by Brendan Gregg

Learning Objectives

By completing this project, you will be able to:

  1. Automate performance benchmarking as part of CI/CD pipelines
  2. Detect statistically significant regressions distinguishing from noise
  3. Attach profiling evidence to regression alerts
  4. Build performance trend dashboards with historical data
  5. Design alerting strategies that avoid false positives
  6. Implement performance budgets as deployment gates

Deep Theoretical Foundation

Why Regression Detection Is Hard

Performance is inherently noisy. Every benchmark run varies due to:

  • CPU frequency fluctuations
  • Scheduler decisions
  • Background processes
  • Cache state
  • Memory allocation patterns

A โ€œ5% regressionโ€ might be:

  1. Real code change that hurt performance
  2. Measurement noise
  3. Infrastructure change (different machine)
  4. Background activity on CI runner

Without statistical rigor, youโ€™ll either:

  • Alert on every run (alert fatigue)
  • Miss real regressions (silent degradation)

Statistical Foundations

Hypothesis Testing for Regression

Null hypothesis: No performance difference between versions Alternative hypothesis: Version B is slower than Version A

Hโ‚€: ฮผ_A = ฮผ_B
Hโ‚: ฮผ_A < ฮผ_B (one-tailed, looking for regressions)

Collect samples: A = [aโ‚, aโ‚‚, ..., aโ‚™], B = [bโ‚, bโ‚‚, ..., bโ‚˜]
Perform test: Mann-Whitney U or Welch's t-test
If p-value < ฮฑ (typically 0.05): Reject Hโ‚€, flag regression

Mann-Whitney U Test

Non-parametric test that works with non-normal distributions (common for latency):

  • Compares ranks rather than values
  • Robust to outliers
  • Doesnโ€™t assume equal variance

Effect Size (Cohenโ€™s d)

Statistical significance isnโ€™t enoughโ€”we need practical significance:

d = (ฮผ_B - ฮผ_A) / pooled_std

Interpretation:
  d < 0.2  Small effect (probably not worth investigating)
  d 0.2-0.5  Medium effect
  d > 0.8  Large effect

A 1% change might be statistically significant with enough samples, but not worth acting on.

Regression Detection Criteria

A robust regression detector requires multiple signals:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Regression Detection Checklist                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ˜‘ Magnitude: Change > threshold (e.g., 5% for latency)        โ”‚
โ”‚ โ˜‘ Significance: p-value < 0.05 (Mann-Whitney U)               โ”‚
โ”‚ โ˜‘ Consistency: Appears in 3+ consecutive runs                  โ”‚
โ”‚ โ˜‘ Baseline stable: Baseline CV < 15%                          โ”‚
โ”‚ โ˜ Environment: No infra changes during measurement            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Confidence: 4/5 criteria โ†’ HIGH, 3/5 โ†’ MEDIUM, <3 โ†’ LOW       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

CI/CD Integration Patterns

Pattern 1: Per-Commit Benchmarks (Fast Feedback)

Commit โ†’ Build โ†’ Quick benchmark (30s) โ†’ Report
Pros: Fast feedback, catches obvious regressions
Cons: High noise, limited coverage

Pattern 2: Nightly Performance Builds (Thorough)

Nightly โ†’ Full benchmark suite (1h+) โ†’ Statistical analysis โ†’ Alert
Pros: Low noise, comprehensive
Cons: Delayed feedback (next day)

Pattern 3: Pre-Merge Gates (Blocking)

PR โ†’ Merge blocked until perf tests pass โ†’ Manual override available
Pros: Prevents regressions reaching main
Cons: Slows development, requires high confidence

Recommended Approach: Layered

Per-commit:  Smoke test (catch obvious 50%+ regressions)
PR:          Targeted benchmarks (affected components only)
Nightly:     Full suite with statistical analysis
Weekly:      Long-running stress tests

Complete Project Specification

What Youโ€™re Building

A performance regression system called perf_dashboard that:

  1. Runs benchmarks with proper methodology (warmup, iterations, pinning)
  2. Stores results in time-series database with metadata
  3. Detects regressions using statistical tests
  4. Generates evidence (flamegraphs, counter diffs) for regressions
  5. Visualizes trends with historical context
  6. Integrates with CI/CD (GitHub Actions, Jenkins, etc.)

Functional Requirements

perf_dashboard run --suite <name> --commit <sha> --output <dir>
perf_dashboard compare --baseline <sha> --current <sha>
perf_dashboard analyze --range <sha1>..<sha2> --output <report.md>
perf_dashboard alert --threshold <pct> --confidence <level>
perf_dashboard dashboard --serve --port 8080

Example Output

Performance Regression Report
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Comparison: v1.4.2 (abc123) โ†’ v1.4.3 (def456)
Date: 2025-01-27

Summary:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Status: REGRESSION DETECTED (HIGH CONFIDENCE)

Benchmark Results:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Benchmark           Baseline    Current     Change    Status
                    (median)    (median)
request_latency     5.2 ms      6.1 ms      +17.3%    โš  REGRESSED
throughput          12,450/s    10,890/s    -12.5%    โš  REGRESSED
memory_usage        142 MB      144 MB      +1.4%     โœ“ OK
startup_time        1.2 s       1.2 s       +0.8%     โœ“ OK

Statistical Analysis (request_latency):
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Baseline samples: 50 runs, CV = 3.2%
Current samples: 50 runs, CV = 4.1%
p-value (Mann-Whitney): 0.00003 (significant at ฮฑ=0.05)
Effect size (Cohen's d): 1.24 (large)
Confidence: HIGH (5/5 criteria met)

Root Cause Evidence:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Flamegraph comparison attached: diff_flamegraph.svg

Hotspot changes:
  parse_input():      22% โ†’ 41%  (+19% โš )
  validate():         15% โ†’ 14%  (-1%)
  serialize():        18% โ†’ 12%  (-6%)

New hotspot detected: parse_input()
  โ†’ json_decode() added in commit def456
  โ†’ 1.2ms additional latency per request

Commits in range:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  def456  Add JSON parsing for new API format
  bca789  Update logging (no perf impact expected)

Recommendation:
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1. Investigate json_decode() in parse_input()
2. Consider lazy parsing or caching parsed results
3. If regression acceptable, document and update baseline

Artifacts:
  - Flamegraph diff: reports/def456/diff_flamegraph.svg
  - Raw data: reports/def456/benchmark_data.json
  - perf profiles: reports/def456/perf.data

Solution Architecture

System Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        CI/CD Pipeline                                โ”‚
โ”‚  (GitHub Actions / Jenkins / GitLab CI)                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Benchmark Runner                                โ”‚
โ”‚  - Workload execution                                                โ”‚
โ”‚  - Environment control (CPU pinning, frequency)                      โ”‚
โ”‚  - Data collection (timing, counters, profiles)                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Data Storage                                    โ”‚
โ”‚  - SQLite/PostgreSQL for results                                     โ”‚
โ”‚  - File storage for profiles/flamegraphs                             โ”‚
โ”‚  - Git SHA โ†’ Results mapping                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚                     โ”‚                     โ”‚
         โ–ผ                     โ–ผ                     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Statistical     โ”‚  โ”‚ Flamegraph      โ”‚  โ”‚ Dashboard       โ”‚
โ”‚ Analyzer        โ”‚  โ”‚ Generator       โ”‚  โ”‚ Server          โ”‚
โ”‚                 โ”‚  โ”‚                 โ”‚  โ”‚                 โ”‚
โ”‚ - Regression    โ”‚  โ”‚ - Capture       โ”‚  โ”‚ - Trend charts  โ”‚
โ”‚   detection     โ”‚  โ”‚ - Diff          โ”‚  โ”‚ - Alert history โ”‚
โ”‚ - Confidence    โ”‚  โ”‚ - Annotate      โ”‚  โ”‚ - Drill-down    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                     โ”‚                     โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Alert System                                    โ”‚
โ”‚  - Slack/Email notifications                                         โ”‚
โ”‚  - GitHub PR comments                                                โ”‚
โ”‚  - JIRA ticket creation                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Model

-- Benchmark runs
CREATE TABLE benchmark_runs (
    id INTEGER PRIMARY KEY,
    commit_sha TEXT NOT NULL,
    branch TEXT,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    machine_id TEXT,
    config_hash TEXT,  -- Detect config changes
    UNIQUE(commit_sha, machine_id, config_hash)
);

-- Individual measurements
CREATE TABLE measurements (
    id INTEGER PRIMARY KEY,
    run_id INTEGER REFERENCES benchmark_runs(id),
    benchmark_name TEXT NOT NULL,
    iteration INTEGER,
    latency_ns INTEGER,
    cpu_cycles INTEGER,
    instructions INTEGER,
    cache_misses INTEGER
);

-- Computed statistics
CREATE TABLE statistics (
    run_id INTEGER REFERENCES benchmark_runs(id),
    benchmark_name TEXT,
    sample_count INTEGER,
    median_ns INTEGER,
    p95_ns INTEGER,
    p99_ns INTEGER,
    cv_percent REAL,
    PRIMARY KEY (run_id, benchmark_name)
);

-- Regression alerts
CREATE TABLE alerts (
    id INTEGER PRIMARY KEY,
    detected_at DATETIME,
    baseline_sha TEXT,
    regression_sha TEXT,
    benchmark_name TEXT,
    change_percent REAL,
    confidence TEXT,  -- HIGH, MEDIUM, LOW
    status TEXT,  -- OPEN, ACKNOWLEDGED, RESOLVED, FALSE_POSITIVE
    evidence_path TEXT
);

Key Components

Benchmark Runner

typedef struct {
    const char *name;
    void (*setup)(void);
    uint64_t (*run)(void);  // Returns latency in ns
    void (*teardown)(void);
    int warmup_iterations;
    int measure_iterations;
} benchmark_t;

void run_benchmark(benchmark_t *bench, measurement_t *results) {
    // Environment control
    pin_to_cpu(2);
    set_cpu_governor("performance");

    // Setup
    if (bench->setup) bench->setup();

    // Warmup
    for (int i = 0; i < bench->warmup_iterations; i++) {
        bench->run();
    }

    // Measure
    for (int i = 0; i < bench->measure_iterations; i++) {
        uint64_t start = read_tsc();
        uint64_t result = bench->run();
        uint64_t end = read_tsc();

        results[i].latency_ns = tsc_to_ns(end - start);
        results[i].iteration = i;
    }

    // Teardown
    if (bench->teardown) bench->teardown();
}

Statistical Analyzer

import scipy.stats as stats
import numpy as np

def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
    """
    Detect if current is a regression from baseline.
    Returns: (is_regression, confidence, details)
    """
    # Check baseline stability
    baseline_cv = np.std(baseline) / np.mean(baseline)
    if baseline_cv > 0.15:
        return False, 'LOW', 'Baseline too noisy'

    # Magnitude check
    baseline_median = np.median(baseline)
    current_median = np.median(current)
    change_pct = (current_median - baseline_median) / baseline_median

    if abs(change_pct) < threshold:
        return False, 'LOW', f'Change {change_pct:.1%} below threshold'

    # Statistical significance (Mann-Whitney U)
    statistic, p_value = stats.mannwhitneyu(
        baseline, current, alternative='less'
    )

    if p_value > alpha:
        return False, 'MEDIUM', f'p={p_value:.4f} not significant'

    # Effect size
    pooled_std = np.sqrt((np.var(baseline) + np.var(current)) / 2)
    cohens_d = (current_median - baseline_median) / pooled_std

    # Confidence based on multiple criteria
    criteria_met = sum([
        abs(change_pct) > threshold,
        p_value < alpha,
        abs(cohens_d) > 0.5,
        baseline_cv < 0.1
    ])

    confidence = 'HIGH' if criteria_met >= 4 else 'MEDIUM'

    return True, confidence, {
        'change_pct': change_pct,
        'p_value': p_value,
        'cohens_d': cohens_d,
        'baseline_cv': baseline_cv
    }

Phased Implementation Guide

Phase 1: Benchmark Harness (Week 1)

Goal: Reliable benchmark execution with data collection.

Steps:

  1. Implement benchmark runner with environment control
  2. Create data collection for timing and counters
  3. Design storage format (SQLite + files)
  4. Build CLI for running benchmarks
  5. Verify reproducibility (CV < 5%)

Validation: Same benchmark, same CV across runs.

Phase 2: Statistical Analysis (Week 2)

Goal: Detect regressions with statistical rigor.

Steps:

  1. Implement Mann-Whitney U test
  2. Add effect size calculation
  3. Define regression criteria
  4. Build comparison CLI
  5. Test with known regressions

Validation: Detects synthetic 10% regression.

Phase 3: Evidence Collection (Week 3)

Goal: Attach profiling data to regressions.

Steps:

  1. Capture flamegraphs for each run
  2. Implement differential flamegraph
  3. Collect hardware counter data
  4. Link evidence to regression alerts
  5. Generate human-readable reports

Validation: Report shows correct hotspot change.

Phase 4: CI/CD Integration (Week 4)

Goal: Automated regression detection in pipeline.

Steps:

  1. Create GitHub Action workflow
  2. Implement baseline management (rolling, release)
  3. Add PR comment with results
  4. Configure alerting thresholds
  5. Test with real repository

Validation: PR blocked by synthetic regression.

Goal: Visual interface for performance history.

Steps:

  1. Build web dashboard (simple HTML or React)
  2. Display trend charts over time
  3. Show alert history and status
  4. Add drill-down to individual runs
  5. Implement search/filter

Validation: Can navigate performance history easily.


Testing Strategy

Synthetic Regressions

  1. 10% latency increase: Should detect with HIGH confidence
  2. 5% latency increase: Should detect with MEDIUM confidence
  3. 2% latency increase: Should NOT alert (below threshold)
  4. Baseline noise spike: Should NOT alert (baseline unstable)

False Positive Testing

  1. Environment change: Different machine, same code
  2. Background activity: CI runner under load
  3. Cold cache: First run after restart
  4. Flaky benchmark: High inherent variance

Integration Tests

  1. Full pipeline: Commit โ†’ Benchmark โ†’ Alert
  2. PR workflow: Open PR โ†’ Check โ†’ Comment
  3. Dashboard: Data โ†’ Visualization โ†’ Drill-down

Common Pitfalls and Debugging

Pitfall 1: Alert Fatigue

Symptom: Too many alerts, team ignores them.

Cause: Threshold too low or noise not controlled.

Solution:

  • Increase magnitude threshold (5% โ†’ 10%)
  • Require HIGH confidence for alerts
  • Add human review before escalation
  • Track false positive rate and tune

Pitfall 2: Missing Real Regressions

Symptom: Performance degrades unnoticed.

Cause: Threshold too high or coverage gaps.

Solution:

  • Monitor trend over time (gradual degradation)
  • Ensure critical paths have benchmarks
  • Review resolved alerts periodically
  • Compare release-to-release, not just commit-to-commit

Pitfall 3: Baseline Drift

Symptom: Baseline keeps moving, hard to compare.

Cause: Using rolling baseline that includes regressions.

Solution:

  • Use release versions as baselines
  • Require explicit baseline updates
  • Alert on baseline changes
  • Keep historical baselines for comparison

Pitfall 4: Infrastructure Variability

Symptom: Same code shows different performance.

Cause: Shared CI runners with variable load.

Solution:

  • Use dedicated performance machines
  • Pin to specific CPU cores
  • Lock CPU frequency
  • Run benchmarks in isolation
  • Record machine ID with results

Extensions and Challenges

Extension 1: Automatic Bisection

When regression detected:

  1. Binary search through commits in range
  2. Find exact commit that caused regression
  3. Annotate commit with performance impact

Extension 2: Performance Budgets

Define budgets per endpoint/component:

budgets:
  api_latency_p99:
    threshold: 100ms
    action: block_deploy
  startup_time:
    threshold: 5s
    action: warn

Extension 3: A/B Performance Testing

Compare two versions with statistical rigor:

  • Deploy both simultaneously
  • Route traffic proportionally
  • Measure real-world performance difference

Challenge: Multi-Dimensional Regression

Detect regressions across multiple metrics:

  • Latency improved but memory increased
  • Throughput same but CPU usage doubled
  • Trade-off analysis and alerting

Real-World Connections

Industry Systems

  1. Googleโ€™s PerfGate: Automated performance testing in CI
  2. Facebookโ€™s Jevons: Efficiency regression detection
  3. Netflixโ€™s Flamescope: Performance investigation tooling
  4. GitHubโ€™s Actions: CI/CD integration patterns

Best Practices

  • Start with most critical benchmarks
  • Require HIGH confidence before blocking
  • Always attach evidence to alerts
  • Track false positive rate
  • Review and tune regularly

Self-Assessment Checklist

Before considering this project complete, verify:

  • Benchmarks run with controlled environment
  • Statistical tests correctly detect 10% regression
  • False positives are rare (< 5% of alerts)
  • Evidence (flamegraphs) attached to alerts
  • CI/CD integration works end-to-end
  • Dashboard shows historical trends
  • Documentation explains how to respond to alerts

Resources

Essential Reading

  • โ€œSystems Performanceโ€ by Gregg, Chapter 2
  • โ€œHigh Performance Pythonโ€ by Gorelick & Ozsvald, Chapter 1
  • โ€œStatistics for Experimentersโ€ by Box, Hunter & Hunter

Tools

  • Continuous benchmarking: Bencher, Codspeed
  • Flamegraph diff: FlameGraph repoโ€™s difffolded.pl
  • Statistical testing: scipy.stats, R
  • Visualization: Grafana, custom dashboards

Reference

  • GitHub Actions performance testing examples
  • Googleโ€™s testing blog on performance
  • Netflix tech blog on performance