Project 10: End-to-End Performance Regression Dashboard
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1 month+ |
| Primary Language | C |
| Alternative Languages | Go, Rust, Python |
| Knowledge Area | Performance Engineering Systems |
| Tools Required | perf, flamegraphs, tracing tools, CI/CD |
| Primary Reference | “Systems Performance” by Brendan Gregg |
Learning Objectives
By completing this project, you will be able to:
- Automate performance benchmarking as part of CI/CD pipelines
- Detect statistically significant regressions distinguishing from noise
- Attach profiling evidence to regression alerts
- Build performance trend dashboards with historical data
- Design alerting strategies that avoid false positives
- Implement performance budgets as deployment gates
Deep Theoretical Foundation
Why Regression Detection Is Hard
Performance is inherently noisy. Every benchmark run varies due to:
- CPU frequency fluctuations
- Scheduler decisions
- Background processes
- Cache state
- Memory allocation patterns
A “5% regression” might be:
- Real code change that hurt performance
- Measurement noise
- Infrastructure change (different machine)
- Background activity on CI runner
Without statistical rigor, you’ll either:
- Alert on every run (alert fatigue)
- Miss real regressions (silent degradation)
Statistical Foundations
Hypothesis Testing for Regression
Null hypothesis: No performance difference between versions Alternative hypothesis: Version B is slower than Version A
H₀: μ_A = μ_B
H₁: μ_A < μ_B (one-tailed, looking for regressions)
Collect samples: A = [a₁, a₂, ..., aₙ], B = [b₁, b₂, ..., bₘ]
Perform test: Mann-Whitney U or Welch's t-test
If p-value < α (typically 0.05): Reject H₀, flag regression
Mann-Whitney U Test
Non-parametric test that works with non-normal distributions (common for latency):
- Compares ranks rather than values
- Robust to outliers
- Doesn’t assume equal variance
Effect Size (Cohen’s d)
Statistical significance isn’t enough—we need practical significance:
d = (μ_B - μ_A) / pooled_std
Interpretation:
d < 0.2 Small effect (probably not worth investigating)
d 0.2-0.5 Medium effect
d > 0.8 Large effect
A 1% change might be statistically significant with enough samples, but not worth acting on.
Regression Detection Criteria
A robust regression detector requires multiple signals:
┌────────────────────────────────────────────────────────────────┐
│ Regression Detection Checklist │
├────────────────────────────────────────────────────────────────┤
│ ☑ Magnitude: Change > threshold (e.g., 5% for latency) │
│ ☑ Significance: p-value < 0.05 (Mann-Whitney U) │
│ ☑ Consistency: Appears in 3+ consecutive runs │
│ ☑ Baseline stable: Baseline CV < 15% │
│ ☐ Environment: No infra changes during measurement │
├────────────────────────────────────────────────────────────────┤
│ Confidence: 4/5 criteria → HIGH, 3/5 → MEDIUM, <3 → LOW │
└────────────────────────────────────────────────────────────────┘
CI/CD Integration Patterns
Pattern 1: Per-Commit Benchmarks (Fast Feedback)
Commit → Build → Quick benchmark (30s) → Report
Pros: Fast feedback, catches obvious regressions
Cons: High noise, limited coverage
Pattern 2: Nightly Performance Builds (Thorough)
Nightly → Full benchmark suite (1h+) → Statistical analysis → Alert
Pros: Low noise, comprehensive
Cons: Delayed feedback (next day)
Pattern 3: Pre-Merge Gates (Blocking)
PR → Merge blocked until perf tests pass → Manual override available
Pros: Prevents regressions reaching main
Cons: Slows development, requires high confidence
Recommended Approach: Layered
Per-commit: Smoke test (catch obvious 50%+ regressions)
PR: Targeted benchmarks (affected components only)
Nightly: Full suite with statistical analysis
Weekly: Long-running stress tests
Complete Project Specification
What You’re Building
A performance regression system called perf_dashboard that:
- Runs benchmarks with proper methodology (warmup, iterations, pinning)
- Stores results in time-series database with metadata
- Detects regressions using statistical tests
- Generates evidence (flamegraphs, counter diffs) for regressions
- Visualizes trends with historical context
- Integrates with CI/CD (GitHub Actions, Jenkins, etc.)
Functional Requirements
perf_dashboard run --suite <name> --commit <sha> --output <dir>
perf_dashboard compare --baseline <sha> --current <sha>
perf_dashboard analyze --range <sha1>..<sha2> --output <report.md>
perf_dashboard alert --threshold <pct> --confidence <level>
perf_dashboard dashboard --serve --port 8080
Example Output
Performance Regression Report
═══════════════════════════════════════════════════
Comparison: v1.4.2 (abc123) → v1.4.3 (def456)
Date: 2025-01-27
Summary:
───────────────────────────────────────────────────
Status: REGRESSION DETECTED (HIGH CONFIDENCE)
Benchmark Results:
───────────────────────────────────────────────────
Benchmark Baseline Current Change Status
(median) (median)
request_latency 5.2 ms 6.1 ms +17.3% ⚠ REGRESSED
throughput 12,450/s 10,890/s -12.5% ⚠ REGRESSED
memory_usage 142 MB 144 MB +1.4% ✓ OK
startup_time 1.2 s 1.2 s +0.8% ✓ OK
Statistical Analysis (request_latency):
───────────────────────────────────────────────────
Baseline samples: 50 runs, CV = 3.2%
Current samples: 50 runs, CV = 4.1%
p-value (Mann-Whitney): 0.00003 (significant at α=0.05)
Effect size (Cohen's d): 1.24 (large)
Confidence: HIGH (5/5 criteria met)
Root Cause Evidence:
───────────────────────────────────────────────────
Flamegraph comparison attached: diff_flamegraph.svg
Hotspot changes:
parse_input(): 22% → 41% (+19% ⚠)
validate(): 15% → 14% (-1%)
serialize(): 18% → 12% (-6%)
New hotspot detected: parse_input()
→ json_decode() added in commit def456
→ 1.2ms additional latency per request
Commits in range:
───────────────────────────────────────────────────
def456 Add JSON parsing for new API format
bca789 Update logging (no perf impact expected)
Recommendation:
───────────────────────────────────────────────────
1. Investigate json_decode() in parse_input()
2. Consider lazy parsing or caching parsed results
3. If regression acceptable, document and update baseline
Artifacts:
- Flamegraph diff: reports/def456/diff_flamegraph.svg
- Raw data: reports/def456/benchmark_data.json
- perf profiles: reports/def456/perf.data
Solution Architecture
System Design
┌─────────────────────────────────────────────────────────────────────┐
│ CI/CD Pipeline │
│ (GitHub Actions / Jenkins / GitLab CI) │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Benchmark Runner │
│ - Workload execution │
│ - Environment control (CPU pinning, frequency) │
│ - Data collection (timing, counters, profiles) │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Data Storage │
│ - SQLite/PostgreSQL for results │
│ - File storage for profiles/flamegraphs │
│ - Git SHA → Results mapping │
└──────────────────────────────┬──────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Statistical │ │ Flamegraph │ │ Dashboard │
│ Analyzer │ │ Generator │ │ Server │
│ │ │ │ │ │
│ - Regression │ │ - Capture │ │ - Trend charts │
│ detection │ │ - Diff │ │ - Alert history │
│ - Confidence │ │ - Annotate │ │ - Drill-down │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Alert System │
│ - Slack/Email notifications │
│ - GitHub PR comments │
│ - JIRA ticket creation │
└─────────────────────────────────────────────────────────────────────┘
Data Model
-- Benchmark runs
CREATE TABLE benchmark_runs (
id INTEGER PRIMARY KEY,
commit_sha TEXT NOT NULL,
branch TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
machine_id TEXT,
config_hash TEXT, -- Detect config changes
UNIQUE(commit_sha, machine_id, config_hash)
);
-- Individual measurements
CREATE TABLE measurements (
id INTEGER PRIMARY KEY,
run_id INTEGER REFERENCES benchmark_runs(id),
benchmark_name TEXT NOT NULL,
iteration INTEGER,
latency_ns INTEGER,
cpu_cycles INTEGER,
instructions INTEGER,
cache_misses INTEGER
);
-- Computed statistics
CREATE TABLE statistics (
run_id INTEGER REFERENCES benchmark_runs(id),
benchmark_name TEXT,
sample_count INTEGER,
median_ns INTEGER,
p95_ns INTEGER,
p99_ns INTEGER,
cv_percent REAL,
PRIMARY KEY (run_id, benchmark_name)
);
-- Regression alerts
CREATE TABLE alerts (
id INTEGER PRIMARY KEY,
detected_at DATETIME,
baseline_sha TEXT,
regression_sha TEXT,
benchmark_name TEXT,
change_percent REAL,
confidence TEXT, -- HIGH, MEDIUM, LOW
status TEXT, -- OPEN, ACKNOWLEDGED, RESOLVED, FALSE_POSITIVE
evidence_path TEXT
);
Key Components
Benchmark Runner
typedef struct {
const char *name;
void (*setup)(void);
uint64_t (*run)(void); // Returns latency in ns
void (*teardown)(void);
int warmup_iterations;
int measure_iterations;
} benchmark_t;
void run_benchmark(benchmark_t *bench, measurement_t *results) {
// Environment control
pin_to_cpu(2);
set_cpu_governor("performance");
// Setup
if (bench->setup) bench->setup();
// Warmup
for (int i = 0; i < bench->warmup_iterations; i++) {
bench->run();
}
// Measure
for (int i = 0; i < bench->measure_iterations; i++) {
uint64_t start = read_tsc();
uint64_t result = bench->run();
uint64_t end = read_tsc();
results[i].latency_ns = tsc_to_ns(end - start);
results[i].iteration = i;
}
// Teardown
if (bench->teardown) bench->teardown();
}
Statistical Analyzer
import scipy.stats as stats
import numpy as np
def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
"""
Detect if current is a regression from baseline.
Returns: (is_regression, confidence, details)
"""
# Check baseline stability
baseline_cv = np.std(baseline) / np.mean(baseline)
if baseline_cv > 0.15:
return False, 'LOW', 'Baseline too noisy'
# Magnitude check
baseline_median = np.median(baseline)
current_median = np.median(current)
change_pct = (current_median - baseline_median) / baseline_median
if abs(change_pct) < threshold:
return False, 'LOW', f'Change {change_pct:.1%} below threshold'
# Statistical significance (Mann-Whitney U)
statistic, p_value = stats.mannwhitneyu(
baseline, current, alternative='less'
)
if p_value > alpha:
return False, 'MEDIUM', f'p={p_value:.4f} not significant'
# Effect size
pooled_std = np.sqrt((np.var(baseline) + np.var(current)) / 2)
cohens_d = (current_median - baseline_median) / pooled_std
# Confidence based on multiple criteria
criteria_met = sum([
abs(change_pct) > threshold,
p_value < alpha,
abs(cohens_d) > 0.5,
baseline_cv < 0.1
])
confidence = 'HIGH' if criteria_met >= 4 else 'MEDIUM'
return True, confidence, {
'change_pct': change_pct,
'p_value': p_value,
'cohens_d': cohens_d,
'baseline_cv': baseline_cv
}
Phased Implementation Guide
Phase 1: Benchmark Harness (Week 1)
Goal: Reliable benchmark execution with data collection.
Steps:
- Implement benchmark runner with environment control
- Create data collection for timing and counters
- Design storage format (SQLite + files)
- Build CLI for running benchmarks
- Verify reproducibility (CV < 5%)
Validation: Same benchmark, same CV across runs.
Phase 2: Statistical Analysis (Week 2)
Goal: Detect regressions with statistical rigor.
Steps:
- Implement Mann-Whitney U test
- Add effect size calculation
- Define regression criteria
- Build comparison CLI
- Test with known regressions
Validation: Detects synthetic 10% regression.
Phase 3: Evidence Collection (Week 3)
Goal: Attach profiling data to regressions.
Steps:
- Capture flamegraphs for each run
- Implement differential flamegraph
- Collect hardware counter data
- Link evidence to regression alerts
- Generate human-readable reports
Validation: Report shows correct hotspot change.
Phase 4: CI/CD Integration (Week 4)
Goal: Automated regression detection in pipeline.
Steps:
- Create GitHub Action workflow
- Implement baseline management (rolling, release)
- Add PR comment with results
- Configure alerting thresholds
- Test with real repository
Validation: PR blocked by synthetic regression.
Phase 5: Dashboard and Trends (Week 5+)
Goal: Visual interface for performance history.
Steps:
- Build web dashboard (simple HTML or React)
- Display trend charts over time
- Show alert history and status
- Add drill-down to individual runs
- Implement search/filter
Validation: Can navigate performance history easily.
Testing Strategy
Synthetic Regressions
- 10% latency increase: Should detect with HIGH confidence
- 5% latency increase: Should detect with MEDIUM confidence
- 2% latency increase: Should NOT alert (below threshold)
- Baseline noise spike: Should NOT alert (baseline unstable)
False Positive Testing
- Environment change: Different machine, same code
- Background activity: CI runner under load
- Cold cache: First run after restart
- Flaky benchmark: High inherent variance
Integration Tests
- Full pipeline: Commit → Benchmark → Alert
- PR workflow: Open PR → Check → Comment
- Dashboard: Data → Visualization → Drill-down
Common Pitfalls and Debugging
Pitfall 1: Alert Fatigue
Symptom: Too many alerts, team ignores them.
Cause: Threshold too low or noise not controlled.
Solution:
- Increase magnitude threshold (5% → 10%)
- Require HIGH confidence for alerts
- Add human review before escalation
- Track false positive rate and tune
Pitfall 2: Missing Real Regressions
Symptom: Performance degrades unnoticed.
Cause: Threshold too high or coverage gaps.
Solution:
- Monitor trend over time (gradual degradation)
- Ensure critical paths have benchmarks
- Review resolved alerts periodically
- Compare release-to-release, not just commit-to-commit
Pitfall 3: Baseline Drift
Symptom: Baseline keeps moving, hard to compare.
Cause: Using rolling baseline that includes regressions.
Solution:
- Use release versions as baselines
- Require explicit baseline updates
- Alert on baseline changes
- Keep historical baselines for comparison
Pitfall 4: Infrastructure Variability
Symptom: Same code shows different performance.
Cause: Shared CI runners with variable load.
Solution:
- Use dedicated performance machines
- Pin to specific CPU cores
- Lock CPU frequency
- Run benchmarks in isolation
- Record machine ID with results
Extensions and Challenges
Extension 1: Automatic Bisection
When regression detected:
- Binary search through commits in range
- Find exact commit that caused regression
- Annotate commit with performance impact
Extension 2: Performance Budgets
Define budgets per endpoint/component:
budgets:
api_latency_p99:
threshold: 100ms
action: block_deploy
startup_time:
threshold: 5s
action: warn
Extension 3: A/B Performance Testing
Compare two versions with statistical rigor:
- Deploy both simultaneously
- Route traffic proportionally
- Measure real-world performance difference
Challenge: Multi-Dimensional Regression
Detect regressions across multiple metrics:
- Latency improved but memory increased
- Throughput same but CPU usage doubled
- Trade-off analysis and alerting
Real-World Connections
Industry Systems
- Google’s PerfGate: Automated performance testing in CI
- Facebook’s Jevons: Efficiency regression detection
- Netflix’s Flamescope: Performance investigation tooling
- GitHub’s Actions: CI/CD integration patterns
Best Practices
- Start with most critical benchmarks
- Require HIGH confidence before blocking
- Always attach evidence to alerts
- Track false positive rate
- Review and tune regularly
Self-Assessment Checklist
Before considering this project complete, verify:
- Benchmarks run with controlled environment
- Statistical tests correctly detect 10% regression
- False positives are rare (< 5% of alerts)
- Evidence (flamegraphs) attached to alerts
- CI/CD integration works end-to-end
- Dashboard shows historical trends
- Documentation explains how to respond to alerts
Resources
Essential Reading
- “Systems Performance” by Gregg, Chapter 2
- “High Performance Python” by Gorelick & Ozsvald, Chapter 1
- “Statistics for Experimenters” by Box, Hunter & Hunter
Tools
- Continuous benchmarking: Bencher, Codspeed
- Flamegraph diff: FlameGraph repo’s difffolded.pl
- Statistical testing: scipy.stats, R
- Visualization: Grafana, custom dashboards
Reference
- GitHub Actions performance testing examples
- Google’s testing blog on performance
- Netflix tech blog on performance