Project 10: End-to-End Performance Regression Dashboard

Project Overview

Attribute	Details
Difficulty	Advanced
Time Estimate	1 month+
Primary Language	C
Alternative Languages	Go, Rust, Python
Knowledge Area	Performance Engineering Systems
Tools Required	perf, flamegraphs, tracing tools, CI/CD
Primary Reference	“Systems Performance” by Brendan Gregg

Learning Objectives

By completing this project, you will be able to:

Automate performance benchmarking as part of CI/CD pipelines
Detect statistically significant regressions distinguishing from noise
Attach profiling evidence to regression alerts
Build performance trend dashboards with historical data
Design alerting strategies that avoid false positives
Implement performance budgets as deployment gates

Deep Theoretical Foundation

Why Regression Detection Is Hard

Performance is inherently noisy. Every benchmark run varies due to:

CPU frequency fluctuations
Scheduler decisions
Background processes
Cache state
Memory allocation patterns

A “5% regression” might be:

Real code change that hurt performance
Measurement noise
Infrastructure change (different machine)
Background activity on CI runner

Without statistical rigor, you’ll either:

Alert on every run (alert fatigue)
Miss real regressions (silent degradation)

Statistical Foundations

Hypothesis Testing for Regression

Null hypothesis: No performance difference between versions Alternative hypothesis: Version B is slower than Version A

H₀: μ_A = μ_B
H₁: μ_A < μ_B (one-tailed, looking for regressions)

Collect samples: A = [a₁, a₂, ..., aₙ], B = [b₁, b₂, ..., bₘ]
Perform test: Mann-Whitney U or Welch's t-test
If p-value < α (typically 0.05): Reject H₀, flag regression

Mann-Whitney U Test

Non-parametric test that works with non-normal distributions (common for latency):

Compares ranks rather than values
Robust to outliers
Doesn’t assume equal variance

Effect Size (Cohen’s d)

Statistical significance isn’t enough—we need practical significance:

d = (μ_B - μ_A) / pooled_std

Interpretation:
  d < 0.2  Small effect (probably not worth investigating)
  d 0.2-0.5  Medium effect
  d > 0.8  Large effect

A 1% change might be statistically significant with enough samples, but not worth acting on.

Regression Detection Criteria

A robust regression detector requires multiple signals:

┌────────────────────────────────────────────────────────────────┐
│ Regression Detection Checklist                                 │
├────────────────────────────────────────────────────────────────┤
│ ☑ Magnitude: Change > threshold (e.g., 5% for latency)        │
│ ☑ Significance: p-value < 0.05 (Mann-Whitney U)               │
│ ☑ Consistency: Appears in 3+ consecutive runs                  │
│ ☑ Baseline stable: Baseline CV < 15%                          │
│ ☐ Environment: No infra changes during measurement            │
├────────────────────────────────────────────────────────────────┤
│ Confidence: 4/5 criteria → HIGH, 3/5 → MEDIUM, <3 → LOW       │
└────────────────────────────────────────────────────────────────┘

CI/CD Integration Patterns

Pattern 1: Per-Commit Benchmarks (Fast Feedback)

Commit → Build → Quick benchmark (30s) → Report
Pros: Fast feedback, catches obvious regressions
Cons: High noise, limited coverage

Pattern 2: Nightly Performance Builds (Thorough)

Nightly → Full benchmark suite (1h+) → Statistical analysis → Alert
Pros: Low noise, comprehensive
Cons: Delayed feedback (next day)

Pattern 3: Pre-Merge Gates (Blocking)

PR → Merge blocked until perf tests pass → Manual override available
Pros: Prevents regressions reaching main
Cons: Slows development, requires high confidence

Recommended Approach: Layered

Per-commit:  Smoke test (catch obvious 50%+ regressions)
PR:          Targeted benchmarks (affected components only)
Nightly:     Full suite with statistical analysis
Weekly:      Long-running stress tests

Complete Project Specification

What You’re Building

A performance regression system called perf_dashboard that:

Runs benchmarks with proper methodology (warmup, iterations, pinning)
Stores results in time-series database with metadata
Detects regressions using statistical tests
Generates evidence (flamegraphs, counter diffs) for regressions
Visualizes trends with historical context
Integrates with CI/CD (GitHub Actions, Jenkins, etc.)

Functional Requirements

perf_dashboard run --suite <name> --commit <sha> --output <dir>
perf_dashboard compare --baseline <sha> --current <sha>
perf_dashboard analyze --range <sha1>..<sha2> --output <report.md>
perf_dashboard alert --threshold <pct> --confidence <level>
perf_dashboard dashboard --serve --port 8080

Example Output

Performance Regression Report
═══════════════════════════════════════════════════
Comparison: v1.4.2 (abc123) → v1.4.3 (def456)
Date: 2025-01-27

Summary:
───────────────────────────────────────────────────
Status: REGRESSION DETECTED (HIGH CONFIDENCE)

Benchmark Results:
───────────────────────────────────────────────────
Benchmark           Baseline    Current     Change    Status
                    (median)    (median)
request_latency     5.2 ms      6.1 ms      +17.3%    ⚠ REGRESSED
throughput          12,450/s    10,890/s    -12.5%    ⚠ REGRESSED
memory_usage        142 MB      144 MB      +1.4%     ✓ OK
startup_time        1.2 s       1.2 s       +0.8%     ✓ OK

Statistical Analysis (request_latency):
───────────────────────────────────────────────────
Baseline samples: 50 runs, CV = 3.2%
Current samples: 50 runs, CV = 4.1%
p-value (Mann-Whitney): 0.00003 (significant at α=0.05)
Effect size (Cohen's d): 1.24 (large)
Confidence: HIGH (5/5 criteria met)

Root Cause Evidence:
───────────────────────────────────────────────────
Flamegraph comparison attached: diff_flamegraph.svg

Hotspot changes:
  parse_input():      22% → 41%  (+19% ⚠)
  validate():         15% → 14%  (-1%)
  serialize():        18% → 12%  (-6%)

New hotspot detected: parse_input()
  → json_decode() added in commit def456
  → 1.2ms additional latency per request

Commits in range:
───────────────────────────────────────────────────
  def456  Add JSON parsing for new API format
  bca789  Update logging (no perf impact expected)

Recommendation:
───────────────────────────────────────────────────
1. Investigate json_decode() in parse_input()
2. Consider lazy parsing or caching parsed results
3. If regression acceptable, document and update baseline

Artifacts:
  - Flamegraph diff: reports/def456/diff_flamegraph.svg
  - Raw data: reports/def456/benchmark_data.json
  - perf profiles: reports/def456/perf.data

Solution Architecture

System Design

┌─────────────────────────────────────────────────────────────────────┐
│                        CI/CD Pipeline                                │
│  (GitHub Actions / Jenkins / GitLab CI)                              │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Benchmark Runner                                │
│  - Workload execution                                                │
│  - Environment control (CPU pinning, frequency)                      │
│  - Data collection (timing, counters, profiles)                      │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Data Storage                                    │
│  - SQLite/PostgreSQL for results                                     │
│  - File storage for profiles/flamegraphs                             │
│  - Git SHA → Results mapping                                         │
└──────────────────────────────┬──────────────────────────────────────┘
                               │
         ┌─────────────────────┼─────────────────────┐
         │                     │                     │
         ▼                     ▼                     ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ Statistical     │  │ Flamegraph      │  │ Dashboard       │
│ Analyzer        │  │ Generator       │  │ Server          │
│                 │  │                 │  │                 │
│ - Regression    │  │ - Capture       │  │ - Trend charts  │
│   detection     │  │ - Diff          │  │ - Alert history │
│ - Confidence    │  │ - Annotate      │  │ - Drill-down    │
└────────┬────────┘  └────────┬────────┘  └────────┬────────┘
         │                     │                     │
         └─────────────────────┼─────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Alert System                                    │
│  - Slack/Email notifications                                         │
│  - GitHub PR comments                                                │
│  - JIRA ticket creation                                              │
└─────────────────────────────────────────────────────────────────────┘

Data Model

-- Benchmark runs
CREATE TABLE benchmark_runs (
    id INTEGER PRIMARY KEY,
    commit_sha TEXT NOT NULL,
    branch TEXT,
    timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
    machine_id TEXT,
    config_hash TEXT,  -- Detect config changes
    UNIQUE(commit_sha, machine_id, config_hash)
);

-- Individual measurements
CREATE TABLE measurements (
    id INTEGER PRIMARY KEY,
    run_id INTEGER REFERENCES benchmark_runs(id),
    benchmark_name TEXT NOT NULL,
    iteration INTEGER,
    latency_ns INTEGER,
    cpu_cycles INTEGER,
    instructions INTEGER,
    cache_misses INTEGER
);

-- Computed statistics
CREATE TABLE statistics (
    run_id INTEGER REFERENCES benchmark_runs(id),
    benchmark_name TEXT,
    sample_count INTEGER,
    median_ns INTEGER,
    p95_ns INTEGER,
    p99_ns INTEGER,
    cv_percent REAL,
    PRIMARY KEY (run_id, benchmark_name)
);

-- Regression alerts
CREATE TABLE alerts (
    id INTEGER PRIMARY KEY,
    detected_at DATETIME,
    baseline_sha TEXT,
    regression_sha TEXT,
    benchmark_name TEXT,
    change_percent REAL,
    confidence TEXT,  -- HIGH, MEDIUM, LOW
    status TEXT,  -- OPEN, ACKNOWLEDGED, RESOLVED, FALSE_POSITIVE
    evidence_path TEXT
);

Key Components

Benchmark Runner

typedef struct {
    const char *name;
    void (*setup)(void);
    uint64_t (*run)(void);  // Returns latency in ns
    void (*teardown)(void);
    int warmup_iterations;
    int measure_iterations;
} benchmark_t;

void run_benchmark(benchmark_t *bench, measurement_t *results) {
    // Environment control
    pin_to_cpu(2);
    set_cpu_governor("performance");

    // Setup
    if (bench->setup) bench->setup();

    // Warmup
    for (int i = 0; i < bench->warmup_iterations; i++) {
        bench->run();
    }

    // Measure
    for (int i = 0; i < bench->measure_iterations; i++) {
        uint64_t start = read_tsc();
        uint64_t result = bench->run();
        uint64_t end = read_tsc();

        results[i].latency_ns = tsc_to_ns(end - start);
        results[i].iteration = i;
    }

    // Teardown
    if (bench->teardown) bench->teardown();
}

Statistical Analyzer

import scipy.stats as stats
import numpy as np

def detect_regression(baseline, current, threshold=0.05, alpha=0.05):
    """
    Detect if current is a regression from baseline.
    Returns: (is_regression, confidence, details)
    """
    # Check baseline stability
    baseline_cv = np.std(baseline) / np.mean(baseline)
    if baseline_cv > 0.15:
        return False, 'LOW', 'Baseline too noisy'

    # Magnitude check
    baseline_median = np.median(baseline)
    current_median = np.median(current)
    change_pct = (current_median - baseline_median) / baseline_median

    if abs(change_pct) < threshold:
        return False, 'LOW', f'Change {change_pct:.1%} below threshold'

    # Statistical significance (Mann-Whitney U)
    statistic, p_value = stats.mannwhitneyu(
        baseline, current, alternative='less'
    )

    if p_value > alpha:
        return False, 'MEDIUM', f'p={p_value:.4f} not significant'

    # Effect size
    pooled_std = np.sqrt((np.var(baseline) + np.var(current)) / 2)
    cohens_d = (current_median - baseline_median) / pooled_std

    # Confidence based on multiple criteria
    criteria_met = sum([
        abs(change_pct) > threshold,
        p_value < alpha,
        abs(cohens_d) > 0.5,
        baseline_cv < 0.1
    ])

    confidence = 'HIGH' if criteria_met >= 4 else 'MEDIUM'

    return True, confidence, {
        'change_pct': change_pct,
        'p_value': p_value,
        'cohens_d': cohens_d,
        'baseline_cv': baseline_cv
    }

Phased Implementation Guide

Phase 1: Benchmark Harness (Week 1)

Goal: Reliable benchmark execution with data collection.

Steps:

Implement benchmark runner with environment control
Create data collection for timing and counters
Design storage format (SQLite + files)
Build CLI for running benchmarks
Verify reproducibility (CV < 5%)

Validation: Same benchmark, same CV across runs.

Phase 2: Statistical Analysis (Week 2)

Goal: Detect regressions with statistical rigor.

Steps:

Implement Mann-Whitney U test
Add effect size calculation
Define regression criteria
Build comparison CLI
Test with known regressions

Validation: Detects synthetic 10% regression.

Phase 3: Evidence Collection (Week 3)

Goal: Attach profiling data to regressions.

Steps:

Capture flamegraphs for each run
Implement differential flamegraph
Collect hardware counter data
Link evidence to regression alerts
Generate human-readable reports

Validation: Report shows correct hotspot change.

Phase 4: CI/CD Integration (Week 4)

Goal: Automated regression detection in pipeline.

Steps:

Create GitHub Action workflow
Implement baseline management (rolling, release)
Add PR comment with results
Configure alerting thresholds
Test with real repository

Validation: PR blocked by synthetic regression.

Phase 5: Dashboard and Trends (Week 5+)

Goal: Visual interface for performance history.

Steps:

Build web dashboard (simple HTML or React)
Display trend charts over time
Show alert history and status
Add drill-down to individual runs
Implement search/filter

Validation: Can navigate performance history easily.

Testing Strategy

Synthetic Regressions

10% latency increase: Should detect with HIGH confidence
5% latency increase: Should detect with MEDIUM confidence
2% latency increase: Should NOT alert (below threshold)
Baseline noise spike: Should NOT alert (baseline unstable)

False Positive Testing

Environment change: Different machine, same code
Background activity: CI runner under load
Cold cache: First run after restart
Flaky benchmark: High inherent variance

Integration Tests

Full pipeline: Commit → Benchmark → Alert
PR workflow: Open PR → Check → Comment
Dashboard: Data → Visualization → Drill-down

Common Pitfalls and Debugging

Pitfall 1: Alert Fatigue

Symptom: Too many alerts, team ignores them.

Cause: Threshold too low or noise not controlled.

Solution:

Increase magnitude threshold (5% → 10%)
Require HIGH confidence for alerts
Add human review before escalation
Track false positive rate and tune

Pitfall 2: Missing Real Regressions

Symptom: Performance degrades unnoticed.

Cause: Threshold too high or coverage gaps.

Solution:

Monitor trend over time (gradual degradation)
Ensure critical paths have benchmarks
Review resolved alerts periodically
Compare release-to-release, not just commit-to-commit

Pitfall 3: Baseline Drift

Symptom: Baseline keeps moving, hard to compare.

Cause: Using rolling baseline that includes regressions.

Solution:

Use release versions as baselines
Require explicit baseline updates
Alert on baseline changes
Keep historical baselines for comparison

Pitfall 4: Infrastructure Variability

Symptom: Same code shows different performance.

Cause: Shared CI runners with variable load.

Solution:

Use dedicated performance machines
Pin to specific CPU cores
Lock CPU frequency
Run benchmarks in isolation
Record machine ID with results

Extensions and Challenges

Extension 1: Automatic Bisection

When regression detected:

Binary search through commits in range
Find exact commit that caused regression
Annotate commit with performance impact

Extension 2: Performance Budgets

Define budgets per endpoint/component:

budgets:
  api_latency_p99:
    threshold: 100ms
    action: block_deploy
  startup_time:
    threshold: 5s
    action: warn

Extension 3: A/B Performance Testing

Compare two versions with statistical rigor:

Deploy both simultaneously
Route traffic proportionally
Measure real-world performance difference

Challenge: Multi-Dimensional Regression

Detect regressions across multiple metrics:

Latency improved but memory increased
Throughput same but CPU usage doubled
Trade-off analysis and alerting

Real-World Connections

Industry Systems

Google’s PerfGate: Automated performance testing in CI
Facebook’s Jevons: Efficiency regression detection
Netflix’s Flamescope: Performance investigation tooling
GitHub’s Actions: CI/CD integration patterns

Best Practices

Start with most critical benchmarks
Require HIGH confidence before blocking
Always attach evidence to alerts
Track false positive rate
Review and tune regularly

Self-Assessment Checklist

Before considering this project complete, verify:

Benchmarks run with controlled environment
Statistical tests correctly detect 10% regression
False positives are rare (< 5% of alerts)
Evidence (flamegraphs) attached to alerts
CI/CD integration works end-to-end
Dashboard shows historical trends
Documentation explains how to respond to alerts

Resources

Essential Reading

“Systems Performance” by Gregg, Chapter 2
“High Performance Python” by Gorelick & Ozsvald, Chapter 1
“Statistics for Experimenters” by Box, Hunter & Hunter

Tools

Continuous benchmarking: Bencher, Codspeed
Flamegraph diff: FlameGraph repo’s difffolded.pl
Statistical testing: scipy.stats, R
Visualization: Grafana, custom dashboards

Reference

GitHub Actions performance testing examples
Google’s testing blog on performance
Netflix tech blog on performance