Project 7: Service Level Expectation (SLE) Agreement

Create formalized “Service Level Expectations” between teams with live dashboards tracking compliance.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1 Week (15-20 hours)
Primary Language Markdown / Prometheus
Alternative Languages Python, Terraform
Prerequisites Basic monitoring concepts, metrics
Key Topics SLOs, SLIs, SLAs, Error Budgets

1. Learning Objectives

By completing this project, you will:

  1. Design internal service contracts between teams
  2. Define meaningful SLIs (Service Level Indicators)
  3. Set realistic SLO targets (Service Level Objectives)
  4. Implement live dashboards tracking compliance
  5. Use error budgets to balance reliability and velocity

2. Theoretical Foundation

2.1 Core Concepts

The SLI/SLO/SLA Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                         SLA                                     │
│           (Service Level Agreement - Contract)                  │
│    "If we miss this, there are financial/legal consequences"   │
└───────────────────────────────────────────────────────────────── │
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         SLO                                     │
│           (Service Level Objective - Target)                    │
│    "We aim to achieve this level of reliability"               │
│    Example: 99.9% of requests succeed within 200ms             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         SLI                                     │
│           (Service Level Indicator - Measurement)               │
│    "How we measure the thing we care about"                    │
│    Example: (Successful requests / Total requests) * 100       │
└─────────────────────────────────────────────────────────────────┘

Service Level Expectations (SLE) - Internal Focus

SLEs are like SLOs but for internal team-to-team relationships:

EXTERNAL SLA                    INTERNAL SLE
(Company ↔ Customer)            (Team ↔ Team)

┌─────────────────────┐        ┌─────────────────────┐
│ 99.95% uptime       │        │ Ticket response:    │
│ or refund           │        │ < 4 hours           │
│                     │        │                     │
│ Legal contract      │        │ Operating agreement │
│ Financial penalty   │        │ Trust & reputation  │
└─────────────────────┘        └─────────────────────┘

Error Budgets

The concept that allows teams to balance reliability with velocity:

100% reliability = 0% error budget = no room for risk

99.9% SLO = 0.1% error budget = 43 minutes of downtime/month allowed

┌──────────────────────────────────────────────────────────────────┐
│                      30-DAY ERROR BUDGET                         │
│                                                                  │
│ Budget: 43 minutes                                              │
│ Used:   ████████████░░░░░░░░░░░░░░░░░░░░ 28 min (65%)          │
│ Status: 🟢 Safe to ship new features                            │
│                                                                  │
│ If budget exhausted:                                            │
│ Status: 🔴 Focus on reliability, no new features                │
└──────────────────────────────────────────────────────────────────┘

The Four Golden Signals

Google SRE’s recommended metrics for any service:

Signal What It Measures Example SLI
Latency Time to serve a request 95th percentile response time
Traffic Demand on the system Requests per second
Errors Rate of failed requests % of 5xx responses
Saturation How “full” the service is CPU utilization, queue depth

2.2 Why This Matters

Most team frustrations come from mismatched expectations:

  • Team A expects an answer in 10 minutes
  • Team B thinks 2 days is reasonable
  • Result: Friction, resentment, escalations

SLEs make expectations explicit:

  • “We will respond to tickets within 4 business hours”
  • “We will review PRs within 1 business day”
  • “Our API will be available 99.9% of the time”

2.3 Historical Context

  • ITIL SLAs (1980s): Formalized service agreements in IT
  • Google SRE Book (2016): Introduced SLO/SLI framework to mainstream
  • Error Budgets (2016): Revolutionary concept linking reliability to development

2.4 Common Misconceptions

Misconception Reality
“100% is the right target” 100% is impossible and paralyzes development
“More 9s is always better” Each 9 is 10x harder and more expensive
“SLOs are set once” SLOs evolve based on customer needs and capability
“Missing SLO = failure” Error budgets exist for a reason

3. Project Specification

3.1 What You Will Build

  1. SLE Agreement Template: Document format for team-to-team contracts
  2. SLI Definitions: Concrete measurements for each commitment
  3. Monitoring Dashboard: Live view of SLE compliance
  4. Error Budget Tracker: Shows remaining budget and trend

3.2 Functional Requirements

  1. Agreement Structure
    • Provider team and consumer team
    • Services covered
    • SLIs with measurement methodology
    • SLO targets with time windows
  2. Measurement
    • Automated collection of SLI data
    • Rolling windows (7-day, 30-day)
    • Alert on approaching threshold
  3. Dashboard
    • Current compliance status
    • Historical trend
    • Error budget remaining
    • Breakdown by SLI
  4. Reporting
    • Weekly summary email
    • Monthly review document
    • Action items when SLO missed

3.3 Non-Functional Requirements

  • Metrics collection must be automated (no manual entry)
  • Dashboard must update at least every 5 minutes
  • Agreement documents must be version-controlled

3.4 Example Usage / Output

SLE Agreement Document:

# Service Level Expectation: Platform Team → Application Teams

## Overview
This SLE defines the expectations for services provided by the Platform
Team to all application teams at Acme Corp.

**Provider**: Platform Team
**Consumers**: All application teams
**Effective Date**: 2025-01-01
**Review Cadence**: Quarterly

---

## Covered Services

### 1. Kubernetes Cluster (Production)

#### SLI: Availability
- **Definition**: Percentage of minutes where the Kubernetes API server
  responds successfully to health checks
- **Measurement**: Synthetic probe every 30 seconds
- **Formula**: (Successful probes / Total probes) * 100

#### SLO: 99.9% Availability
- **Target**: 99.9% over 30-day rolling window
- **Error Budget**: 43 minutes/month
- **Violation Action**: Platform team halts feature work, focuses on reliability

---

### 2. CI/CD Pipelines

#### SLI: Build Queue Time
- **Definition**: Time from job submission to job start
- **Measurement**: Jenkins metrics
- **Formula**: P95 of queue wait time

#### SLO: P95 < 5 minutes
- **Target**: 95th percentile queue time under 5 minutes
- **Violation Action**: Platform team adds build capacity

---

### 3. Ticket Response

#### SLI: Time to First Response
- **Definition**: Time from ticket creation to first human response
- **Measurement**: Jira workflow timestamps
- **Formula**: P90 of response times during business hours

#### SLO: P90 < 4 business hours
- **Target**: 90% of tickets get first response within 4 hours
- **Violation Action**: Review staffing and priorities

---

## Escalation Path

If SLO is violated for 2 consecutive weeks:
1. Platform Team Lead notified
2. Joint review meeting with affected teams
3. Action plan created within 5 business days

---

## Review and Amendments

- SLEs reviewed quarterly
- Changes require agreement from both parties
- Historical data preserved for trend analysis

Dashboard Output (Grafana):

┌────────────────────────────────────────────────────────────────────┐
│                   PLATFORM SLE DASHBOARD                           │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  KUBERNETES AVAILABILITY (30-day)                                  │
│  ┌────────────────────────────────────────────┐                   │
│  │ Current: 99.94%  │  Target: 99.9%  │ 🟢   │                   │
│  ├────────────────────────────────────────────┤                   │
│  │ Error Budget: 43 min                       │                   │
│  │ Used:         24 min (56%)                 │                   │
│  │ Remaining:    19 min                       │                   │
│  └────────────────────────────────────────────┘                   │
│                                                                    │
│  CI/CD QUEUE TIME (P95)                                           │
│  ┌────────────────────────────────────────────┐                   │
│  │ Current: 3.2 min │  Target: 5 min  │ 🟢   │                   │
│  └────────────────────────────────────────────┘                   │
│                                                                    │
│  TICKET RESPONSE (P90)                                            │
│  ┌────────────────────────────────────────────┐                   │
│  │ Current: 4.8 hrs │  Target: 4 hrs  │ 🔴   │                   │
│  │ VIOLATION - Action required                │                   │
│  └────────────────────────────────────────────┘                   │
│                                                                    │
│  7-DAY TREND                                                      │
│  Mon   Tue   Wed   Thu   Fri   Sat   Sun                         │
│   🟢    🟢    🟢    🟡    🔴    🟢    🟢                          │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

3.5 Real World Outcome

After implementing SLEs:

  • Teams have clear expectations (no more “we always argue about response times”)
  • Platform team can prove their value (dashboard shows 99.9% uptime)
  • Violations are addressed systematically (not in angry Slack threads)
  • Error budgets allow velocity while maintaining reliability

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                        SLE SYSTEM                               │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  AGREEMENT    │     │  MEASUREMENT  │     │  VISUALIZATION│
│  DOCUMENTS    │     │               │     │               │
│               │     │  Prometheus   │     │  Grafana      │
│  Markdown in  │     │  Jira API     │     │  Dashboard    │
│  Git repo     │     │  Custom       │     │               │
└───────────────┘     └───────────────┘     └───────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  ALERTING         │
                    │                   │
                    │  - Slack          │
                    │  - PagerDuty      │
                    │  - Email          │
                    └───────────────────┘

4.2 Key Components

  1. Agreement Registry: Markdown files defining SLEs
  2. Metrics Collection: Prometheus for technical, Jira for process
  3. SLO Calculator: Computes compliance from raw metrics
  4. Dashboard: Grafana panels showing status
  5. Alerter: Notifies on budget burn rate

4.3 Data Structures

# sle-definition.yaml
sle:
  id: platform-to-apps-k8s
  provider: team-platform
  consumers:
    - team-checkout
    - team-payments
    - all  # Or list specific teams

  services:
    - name: kubernetes-production
      slis:
        - name: availability
          type: ratio
          good_events: sum(rate(probe_success[5m]))
          total_events: sum(rate(probe_total[5m]))
          target: 0.999
          window: 30d

        - name: latency
          type: percentile
          metric: kubernetes_api_latency_seconds
          percentile: 95
          target: 0.5  # 500ms
          window: 7d

  escalation:
    - threshold: 50  # % of error budget consumed
      action: alert_team_lead
    - threshold: 80
      action: alert_director
    - threshold: 100
      action: incident_declared
# Prometheus queries for SLIs

# Availability SLI
(
  sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI (P95)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Error Budget Remaining
(
  (1 - 0.999) * 30 * 24 * 60  # Total budget in minutes
  -
  sum(rate(http_requests_total{status=~"5.."}[30d])) / 60  # Used
)

4.4 Algorithm Overview

def calculate_slo_compliance(sli_data: TimeSeries, slo: SLO) -> Compliance:
    # Get data points for window
    window_data = sli_data.filter(time >= now - slo.window)

    if slo.type == "ratio":
        # Availability-style SLI
        good = sum(window_data.good_events)
        total = sum(window_data.total_events)
        actual = good / total
        compliant = actual >= slo.target

    elif slo.type == "percentile":
        # Latency-style SLI
        actual = percentile(window_data.values, slo.percentile)
        compliant = actual <= slo.target

    # Calculate error budget
    total_budget = (1 - slo.target) * slo.window_minutes
    used_budget = calculate_bad_minutes(window_data)
    remaining_budget = total_budget - used_budget

    return Compliance(
        slo_id=slo.id,
        actual=actual,
        target=slo.target,
        compliant=compliant,
        error_budget_remaining=remaining_budget,
        error_budget_percent=(remaining_budget / total_budget) * 100
    )

5. Implementation Guide

5.1 Development Environment Setup

# For local development with Prometheus/Grafana
docker-compose up -d prometheus grafana

# Or use existing monitoring stack
# Just need write access to Grafana

5.2 Project Structure

sle-agreements/
├── agreements/
│   ├── platform-to-apps.md
│   ├── data-to-analytics.md
│   └── template.md
├── definitions/
│   ├── platform-slis.yaml
│   └── data-slis.yaml
├── dashboards/
│   ├── platform-sle.json    # Grafana dashboard
│   └── data-sle.json
├── alerts/
│   ├── platform-alerts.yaml # Prometheus alerting rules
│   └── data-alerts.yaml
└── reports/
    ├── 2025-01-weekly.md
    └── 2025-01-monthly.md

5.3 The Core Question You’re Answering

“What is the ‘Contract’ between our teams, and how do we know if we’re breaking it?”

Most team frustrations come from mismatched expectations. An SLE makes the operating model explicit and measurable.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. The Four Golden Signals
    • What are Latency, Traffic, Errors, Saturation?
    • Which are most important for your services?
    • Book Reference: “SRE Book” Ch. 6
  2. Error Budgets
    • How do you calculate remaining budget?
    • What happens when budget is exhausted?
    • Book Reference: “SRE Book” Ch. 3
  3. Percentiles vs. Averages
    • Why is P99 more meaningful than average latency?
    • Reference: Any observability guide

5.5 Questions to Guide Your Design

Before implementing, think through these:

User Focus

  • Who is your “user” in this context?
  • What does that user actually care about? (Uptime? Latency? Response time?)
  • How will you know if the SLE matters to them?

Measurement

  • Can you measure the SLI automatically?
  • What’s the data source? (Prometheus, Datadog, Jira?)
  • What’s the measurement frequency?

Consequences

  • What happens if the SLE is missed?
  • Who gets notified?
  • Does a manager get paged? Does the team change priorities?

5.6 Thinking Exercise

The “No-Phone” Week

Imagine your team is forbidden from using Slack or Zoom for one week. You can only communicate via documented SLEs and tickets.

Questions:

  1. Does your SLE define what happens when a ticket is “Urgent”?
  2. Does it define where to find documentation?
  3. If the other team “fails” their SLE, do you have an escalation path without a Zoom call?

Write down:

  • 3 situations where you’d normally Slack someone
  • For each, what SLE commitment would replace that Slack message?

5.7 Hints in Layers

Hint 1: Start with the Pain Ask: “What is the one thing we always argue about with Team X?” That’s your first SLI.

Hint 2: Define Availability Precisely “Availability” is vague. Be specific:

  • “The API returns 200 for /health endpoint”
  • “Response time is under 500ms”
  • “No 5xx errors”

Hint 3: Set Aspirational but Achievable Targets Don’t aim for 100%. Aim for what’s sufficient. If a developer can wait 4 hours for a PR review, 95% in < 4 hours is fine.

Hint 4: Make the Dashboard Visible Put the SLE dashboard on a TV screen. Make it impossible to ignore.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is the difference between an SLO and an SLA?”
    • SLO = internal target. SLA = external contract with consequences.
  2. “How do you handle a situation where a team is consistently missing their SLOs?”
    • Investigate root cause, adjust target or invest in reliability, use error budget policy
  3. “Explain the concept of an Error Budget.”
    • The acceptable amount of unreliability. Allows velocity while maintaining reliability.
  4. “What are the Four Golden Signals?”
    • Latency, Traffic, Errors, Saturation
  5. “How do you choose the right SLI for a non-technical team (like HR)?”
    • Focus on what the customer cares about (response time, accuracy, availability)

5.9 Books That Will Help

Topic Book Chapter
SLOs “SRE Book” (Google) Ch. 4: Service Level Objectives
Monitoring “SRE Book” (Google) Ch. 6: Monitoring Distributed Systems
Error Budgets “SRE Workbook” Ch. 2: Implementing SLOs

5.10 Implementation Phases

Phase 1: Agreement Design (3-4 hours)

  1. Identify provider and consumer teams
  2. List services covered
  3. Define 2-3 SLIs per service
  4. Set initial SLO targets

Phase 2: Measurement Setup (4-5 hours)

  1. Write Prometheus queries for each SLI
  2. Verify data is being collected
  3. Test calculations manually

Phase 3: Dashboard Build (3-4 hours)

  1. Create Grafana dashboard
  2. Add panels for each SLI
  3. Add error budget visualization
  4. Add 7-day trend

Phase 4: Alerting (2-3 hours)

  1. Set up alerts for budget burn rate
  2. Configure Slack/email notifications
  3. Test alerting

5.11 Key Implementation Decisions

Decision Option A Option B Recommendation
SLI source Prometheus Datadog Use what you have
Dashboard Grafana Custom Grafana (faster)
Alerting Prometheus Alertmanager PagerDuty Alertmanager for internal
Window 7-day rolling Calendar month 30-day rolling

6. Testing Strategy

SLI Validation

# Verify query returns expected data type
# Should be between 0 and 1 for ratios
(sum(rate(http_requests_total{status=~"2.."}[5m])) /
 sum(rate(http_requests_total[5m])))

Dashboard Testing

  • Verify all panels load without errors
  • Verify colors change at correct thresholds
  • Test with synthetic data at edge cases

Alert Testing

  • Manually trigger alert conditions
  • Verify notifications are received
  • Verify alert routing is correct

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
SLI always at 100% Dashboard shows unrealistic perfection Measuring wrong thing Verify query captures failures
No data in dashboard Empty panels Query syntax error or no data Check Prometheus targets
Alert fatigue Too many notifications Thresholds too tight Adjust to realistic targets
SLE ignored Teams don’t care No consequences Add to team reviews, make visible

8. Extensions & Challenges

Extension 1: Composite SLOs

Combine multiple SLIs into a single “service health” score.

Extension 2: SLO Burndown

Show projected SLO compliance based on current burn rate.

Extension 3: Automated Reporting

Generate weekly email with SLE status for all agreements.

Extension 4: SLE Library

Create reusable SLE templates for common patterns.


9. Real-World Connections

How Big Tech Does This:

  • Google: Publishes external SLOs for Cloud products
  • AWS: Service Health Dashboard with SLA credits
  • Datadog: SLO tracking feature built into platform

Tools:


10. Resources

SRE Resources

Prometheus


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • I can explain SLI, SLO, SLA, and Error Budget
  • Agreement document covers at least 3 SLIs
  • Each SLI has a measurable, automated query
  • Dashboard shows current compliance and trend
  • Error budget visualization is present
  • At least one alert is configured
  • Both provider and consumer teams have reviewed

12. Submission / Completion Criteria

This project is complete when you have:

  1. SLE Agreement markdown document
  2. SLI definitions in YAML format
  3. Prometheus queries for each SLI
  4. Grafana dashboard showing compliance
  5. Alerting rules for budget burn
  6. Sign-off from both provider and consumer teams

Previous Project: P06: Cognitive Load Survey & Heatmap Next Project: P08: Dependency Spaghetti Visualizer