Project 10: End-to-End Performance Regression Dashboard
Project 10: End-to-End Performance Regression Dashboard
Project Overview
| Attribute | Details |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1 month+ |
| Primary Language | C |
| Alternative Languages | Go, Rust, Python |
| Knowledge Area | Performance Engineering Systems |
| Tools Required | perf, flamegraphs, tracing tools, CI/CD |
| Primary Reference | โSystems Performanceโ by Brendan Gregg |
Learning Objectives
By completing this project, you will be able to:
- Automate performance benchmarking as part of CI/CD pipelines
- Detect statistically significant regressions distinguishing from noise
- Attach profiling evidence to regression alerts
- Build performance trend dashboards with historical data
- Design alerting strategies that avoid false positives
- Implement performance budgets as deployment gates
Deep Theoretical Foundation
Why Regression Detection Is Hard
Performance is inherently noisy. Every benchmark run varies due to:
- CPU frequency fluctuations
- Scheduler decisions
- Background processes
- Cache state
- Memory allocation patterns
A โ5% regressionโ might be:
- Real code change that hurt performance
- Measurement noise
- Infrastructure change (different machine)
- Background activity on CI runner
Without statistical rigor, youโll either:
- Alert on every run (alert fatigue)
- Miss real regressions (silent degradation)
Statistical Foundations
Hypothesis Testing for Regression
Null hypothesis: No performance difference between versions Alternative hypothesis: Version B is slower than Version A
Hโ: ฮผ_A = ฮผ_B
Hโ: ฮผ_A < ฮผ_B (one-tailed, looking for regressions)
Collect samples: A = [aโ, aโ, ..., aโ], B = [bโ, bโ, ..., bโ]
Perform test: Mann-Whitney U or Welch's t-test
If p-value < ฮฑ (typically 0.05): Reject Hโ, flag regression
Mann-Whitney U Test
Non-parametric test that works with non-normal distributions (common for latency):
- Compares ranks rather than values
- Robust to outliers
- Doesnโt assume equal variance
Effect Size (Cohenโs d)
Statistical significance isnโt enoughโwe need practical significance:
d = (ฮผ_B - ฮผ_A) / pooled_std
Interpretation:
d < 0.2 Small effect (probably not worth investigating)
d 0.2-0.5 Medium effect
d > 0.8 Large effect
A 1% change might be statistically significant with enough samples, but not worth acting on.
Regression Detection Criteria
A robust regression detector requires multiple signals:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Regression Detection Checklist โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ Magnitude: Change > threshold (e.g., 5% for latency) โ
โ โ Significance: p-value < 0.05 (Mann-Whitney U) โ
โ โ Consistency: Appears in 3+ consecutive runs โ
โ โ Baseline stable: Baseline CV < 15% โ
โ โ Environment: No infra changes during measurement โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Confidence: 4/5 criteria โ HIGH, 3/5 โ MEDIUM, <3 โ LOW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CI/CD Integration Patterns
Pattern 1: Per-Commit Benchmarks (Fast Feedback)
Commit โ Build โ Quick benchmark (30s) โ Report
Pros: Fast feedback, catches obvious regressions
Cons: High noise, limited coverage
Pattern 2: Nightly Performance Builds (Thorough)
Nightly โ Full benchmark suite (1h+) โ Statistical analysis โ Alert
Pros: Low noise, comprehensive
Cons: Delayed feedback (next day)
Pattern 3: Pre-Merge Gates (Blocking)
PR โ Merge blocked until perf tests pass โ Manual override available
Pros: Prevents regressions reaching main
Cons: Slows development, requires high confidence
Recommended Approach: Layered
Per-commit: Smoke test (catch obvious 50%+ regressions)
PR: Targeted benchmarks (affected components only)
Nightly: Full suite with statistical analysis
Weekly: Long-running stress tests
Complete Project Specification
What Youโre Building
A performance regression system called perf_dashboard that:
- Runs benchmarks with proper methodology (warmup, iterations, pinning)
- Stores results in time-series database with metadata
- Detects regressions using statistical tests
- Generates evidence (flamegraphs, counter diffs) for regressions
- Visualizes trends with historical context
- Integrates with CI/CD (GitHub Actions, Jenkins, etc.)
Functional Requirements
perf_dashboard run --suite <name> --commit <sha> --output <dir>
perf_dashboard compare --baseline <sha> --current <sha>
perf_dashboard analyze --range <sha1>..<sha2> --output <report.md>
perf_dashboard alert --threshold <pct> --confidence <level>
perf_dashboard dashboard --serve --port 8080
Example Output
Performance Regression Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Comparison: v1.4.2 (abc123) โ v1.4.3 (def456)
Date: 2025-01-27
Summary:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Status: REGRESSION DETECTED (HIGH CONFIDENCE)
Benchmark Results:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Benchmark Baseline Current Change Status
(median) (median)
request_latency 5.2 ms 6.1 ms +17.3% โ REGRESSED
throughput 12,450/s 10,890/s -12.5% โ REGRESSED
memory_usage 142 MB 144 MB +1.4% โ OK
startup_time 1.2 s 1.2 s +0.8% โ OK
Statistical Analysis (request_latency):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Baseline samples: 50 runs, CV = 3.2%
Current samples: 50 runs, CV = 4.1%
p-value (Mann-Whitney): 0.00003 (significant at ฮฑ=0.05)
Effect size (Cohen's d): 1.24 (large)
Confidence: HIGH (5/5 criteria met)
Root Cause Evidence:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Flamegraph comparison attached: diff_flamegraph.svg
Hotspot changes:
parse_input(): 22% โ 41% (+19% โ )
validate(): 15% โ 14% (-1%)
serialize(): 18% โ 12% (-6%)
New hotspot detected: parse_input()
โ json_decode() added in commit def456
โ 1.2ms additional latency per request
Commits in range:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def456 Add JSON parsing for new API format
bca789 Update logging (no perf impact expected)
Recommendation:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. Investigate json_decode() in parse_input()
2. Consider lazy parsing or caching parsed results
3. If regression acceptable, document and update baseline
Artifacts:
- Flamegraph diff: reports/def456/diff_flamegraph.svg
- Raw data: reports/def456/benchmark_data.json
- perf profiles: reports/def456/perf.data
Solution Architecture
System Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CI/CD Pipeline โ
โ (GitHub Actions / Jenkins / GitLab CI) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Benchmark Runner โ
โ - Workload execution โ
โ - Environment control (CPU pinning, frequency) โ
โ - Data collection (timing, counters, profiles) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Storage โ
โ - SQLite/PostgreSQL for results โ
โ - File storage for profiles/flamegraphs โ
โ - Git SHA โ Results mapping โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Statistical โ โ Flamegraph โ โ Dashboard โ
โ Analyzer โ โ Generator โ โ Server โ
โ โ โ โ โ โ
โ - Regression โ โ - Capture โ โ - Trend charts โ
โ detection โ โ - Diff โ โ - Alert history โ
โ - Confidence โ โ - Annotate โ โ - Drill-down โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Alert System โ
โ - Slack/Email notifications โ
โ - GitHub PR comments โ
โ - JIRA ticket creation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data Model
-- Benchmark runs
CREATE TABLE benchmark_runs (
id INTEGER PRIMARY KEY,
commit_sha TEXT NOT NULL,
branch TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
machine_id TEXT,
config_hash TEXT, -- Detect config changes
UNIQUE(commit_sha, machine_id, config_hash)
);
-- Individual measurements
CREATE TABLE measurements (
id INTEGER PRIMARY KEY,
run_id INTEGER REFERENCES benchmark_runs(id),
benchmark_name TEXT NOT NULL,
iteration INTEGER,
latency_ns INTEGER,
cpu_cycles INTEGER,
instructions INTEGER,
cache_misses INTEGER
);
-- Computed statistics
CREATE TABLE statistics (
run_id INTEGER REFERENCES benchmark_runs(id),
benchmark_name TEXT,
sample_count INTEGER,
median_ns INTEGER,
p95_ns INTEGER,
p99_ns INTEGER,
cv_percent REAL,
PRIMARY KEY (run_id, benchmark_name)
);
-- Regression alerts
CREATE TABLE alerts (
id INTEGER PRIMARY KEY,
detected_at DATETIME,
baseline_sha TEXT,
regression_sha TEXT,
benchmark_name TEXT,
change_percent REAL,
confidence TEXT, -- HIGH, MEDIUM, LOW
status TEXT, -- OPEN, ACKNOWLEDGED, RESOLVED, FALSE_POSITIVE
evidence_path TEXT
);
Key Components
Benchmark Runner
typedef struct {
const char *name;
void (*setup)(void);
uint64_t (*run)(void); // Returns latency in ns
void (*teardown)(void);
int warmup_iterations;
int measure_iterations;
} benchmark_t;
void run_benchmark(benchmark_t *bench, measurement_t *results) {
// Environment control
pin_to_cpu(2);
set_cpu_governor("performance");
// Setup
if (bench->setup) bench->setup();
// Warmup
for (int i = 0; i < bench->warmup_iterations; i++) {
bench->run();
}
// Measure
for (int i = 0; i < bench->measure_iterations; i++) {
uint64_t start = read_tsc();
uint64_t result = bench->run();
uint64_t end = read_tsc();
results[i].latency_ns = tsc_to_ns(end - start);
results[i].iteration = i;
}
// Teardown
if (bench->teardown) bench->teardown();
}
Statistical Analyzer
import scipy.stats as stats
import numpy as np
def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
"""
Detect if current is a regression from baseline.
Returns: (is_regression, confidence, details)
"""
# Check baseline stability
baseline_cv = np.std(baseline) / np.mean(baseline)
if baseline_cv > 0.15:
return False, 'LOW', 'Baseline too noisy'
# Magnitude check
baseline_median = np.median(baseline)
current_median = np.median(current)
change_pct = (current_median - baseline_median) / baseline_median
if abs(change_pct) < threshold:
return False, 'LOW', f'Change {change_pct:.1%} below threshold'
# Statistical significance (Mann-Whitney U)
statistic, p_value = stats.mannwhitneyu(
baseline, current, alternative='less'
)
if p_value > alpha:
return False, 'MEDIUM', f'p={p_value:.4f} not significant'
# Effect size
pooled_std = np.sqrt((np.var(baseline) + np.var(current)) / 2)
cohens_d = (current_median - baseline_median) / pooled_std
# Confidence based on multiple criteria
criteria_met = sum([
abs(change_pct) > threshold,
p_value < alpha,
abs(cohens_d) > 0.5,
baseline_cv < 0.1
])
confidence = 'HIGH' if criteria_met >= 4 else 'MEDIUM'
return True, confidence, {
'change_pct': change_pct,
'p_value': p_value,
'cohens_d': cohens_d,
'baseline_cv': baseline_cv
}
Phased Implementation Guide
Phase 1: Benchmark Harness (Week 1)
Goal: Reliable benchmark execution with data collection.
Steps:
- Implement benchmark runner with environment control
- Create data collection for timing and counters
- Design storage format (SQLite + files)
- Build CLI for running benchmarks
- Verify reproducibility (CV < 5%)
Validation: Same benchmark, same CV across runs.
Phase 2: Statistical Analysis (Week 2)
Goal: Detect regressions with statistical rigor.
Steps:
- Implement Mann-Whitney U test
- Add effect size calculation
- Define regression criteria
- Build comparison CLI
- Test with known regressions
Validation: Detects synthetic 10% regression.
Phase 3: Evidence Collection (Week 3)
Goal: Attach profiling data to regressions.
Steps:
- Capture flamegraphs for each run
- Implement differential flamegraph
- Collect hardware counter data
- Link evidence to regression alerts
- Generate human-readable reports
Validation: Report shows correct hotspot change.
Phase 4: CI/CD Integration (Week 4)
Goal: Automated regression detection in pipeline.
Steps:
- Create GitHub Action workflow
- Implement baseline management (rolling, release)
- Add PR comment with results
- Configure alerting thresholds
- Test with real repository
Validation: PR blocked by synthetic regression.
Phase 5: Dashboard and Trends (Week 5+)
Goal: Visual interface for performance history.
Steps:
- Build web dashboard (simple HTML or React)
- Display trend charts over time
- Show alert history and status
- Add drill-down to individual runs
- Implement search/filter
Validation: Can navigate performance history easily.
Testing Strategy
Synthetic Regressions
- 10% latency increase: Should detect with HIGH confidence
- 5% latency increase: Should detect with MEDIUM confidence
- 2% latency increase: Should NOT alert (below threshold)
- Baseline noise spike: Should NOT alert (baseline unstable)
False Positive Testing
- Environment change: Different machine, same code
- Background activity: CI runner under load
- Cold cache: First run after restart
- Flaky benchmark: High inherent variance
Integration Tests
- Full pipeline: Commit โ Benchmark โ Alert
- PR workflow: Open PR โ Check โ Comment
- Dashboard: Data โ Visualization โ Drill-down
Common Pitfalls and Debugging
Pitfall 1: Alert Fatigue
Symptom: Too many alerts, team ignores them.
Cause: Threshold too low or noise not controlled.
Solution:
- Increase magnitude threshold (5% โ 10%)
- Require HIGH confidence for alerts
- Add human review before escalation
- Track false positive rate and tune
Pitfall 2: Missing Real Regressions
Symptom: Performance degrades unnoticed.
Cause: Threshold too high or coverage gaps.
Solution:
- Monitor trend over time (gradual degradation)
- Ensure critical paths have benchmarks
- Review resolved alerts periodically
- Compare release-to-release, not just commit-to-commit
Pitfall 3: Baseline Drift
Symptom: Baseline keeps moving, hard to compare.
Cause: Using rolling baseline that includes regressions.
Solution:
- Use release versions as baselines
- Require explicit baseline updates
- Alert on baseline changes
- Keep historical baselines for comparison
Pitfall 4: Infrastructure Variability
Symptom: Same code shows different performance.
Cause: Shared CI runners with variable load.
Solution:
- Use dedicated performance machines
- Pin to specific CPU cores
- Lock CPU frequency
- Run benchmarks in isolation
- Record machine ID with results
Extensions and Challenges
Extension 1: Automatic Bisection
When regression detected:
- Binary search through commits in range
- Find exact commit that caused regression
- Annotate commit with performance impact
Extension 2: Performance Budgets
Define budgets per endpoint/component:
budgets:
api_latency_p99:
threshold: 100ms
action: block_deploy
startup_time:
threshold: 5s
action: warn
Extension 3: A/B Performance Testing
Compare two versions with statistical rigor:
- Deploy both simultaneously
- Route traffic proportionally
- Measure real-world performance difference
Challenge: Multi-Dimensional Regression
Detect regressions across multiple metrics:
- Latency improved but memory increased
- Throughput same but CPU usage doubled
- Trade-off analysis and alerting
Real-World Connections
Industry Systems
- Googleโs PerfGate: Automated performance testing in CI
- Facebookโs Jevons: Efficiency regression detection
- Netflixโs Flamescope: Performance investigation tooling
- GitHubโs Actions: CI/CD integration patterns
Best Practices
- Start with most critical benchmarks
- Require HIGH confidence before blocking
- Always attach evidence to alerts
- Track false positive rate
- Review and tune regularly
Self-Assessment Checklist
Before considering this project complete, verify:
- Benchmarks run with controlled environment
- Statistical tests correctly detect 10% regression
- False positives are rare (< 5% of alerts)
- Evidence (flamegraphs) attached to alerts
- CI/CD integration works end-to-end
- Dashboard shows historical trends
- Documentation explains how to respond to alerts
Resources
Essential Reading
- โSystems Performanceโ by Gregg, Chapter 2
- โHigh Performance Pythonโ by Gorelick & Ozsvald, Chapter 1
- โStatistics for Experimentersโ by Box, Hunter & Hunter
Tools
- Continuous benchmarking: Bencher, Codspeed
- Flamegraph diff: FlameGraph repoโs difffolded.pl
- Statistical testing: scipy.stats, R
- Visualization: Grafana, custom dashboards
Reference
- GitHub Actions performance testing examples
- Googleโs testing blog on performance
- Netflix tech blog on performance