Project 7: Service Level Expectation (SLE) Agreement

Create formalized “Service Level Expectations” between teams with live dashboards tracking compliance.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1 Week (15-20 hours)
Primary Language	Markdown / Prometheus
Alternative Languages	Python, Terraform
Prerequisites	Basic monitoring concepts, metrics
Key Topics	SLOs, SLIs, SLAs, Error Budgets

1. Learning Objectives

By completing this project, you will:

Design internal service contracts between teams
Define meaningful SLIs (Service Level Indicators)
Set realistic SLO targets (Service Level Objectives)
Implement live dashboards tracking compliance
Use error budgets to balance reliability and velocity

2. Theoretical Foundation

2.1 Core Concepts

The SLI/SLO/SLA Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                         SLA                                     │
│           (Service Level Agreement - Contract)                  │
│    "If we miss this, there are financial/legal consequences"   │
└───────────────────────────────────────────────────────────────── │
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         SLO                                     │
│           (Service Level Objective - Target)                    │
│    "We aim to achieve this level of reliability"               │
│    Example: 99.9% of requests succeed within 200ms             │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                         SLI                                     │
│           (Service Level Indicator - Measurement)               │
│    "How we measure the thing we care about"                    │
│    Example: (Successful requests / Total requests) * 100       │
└─────────────────────────────────────────────────────────────────┘

Service Level Expectations (SLE) - Internal Focus

SLEs are like SLOs but for internal team-to-team relationships:

EXTERNAL SLA                    INTERNAL SLE
(Company ↔ Customer)            (Team ↔ Team)

┌─────────────────────┐        ┌─────────────────────┐
│ 99.95% uptime       │        │ Ticket response:    │
│ or refund           │        │ < 4 hours           │
│                     │        │                     │
│ Legal contract      │        │ Operating agreement │
│ Financial penalty   │        │ Trust & reputation  │
└─────────────────────┘        └─────────────────────┘

Error Budgets

The concept that allows teams to balance reliability with velocity:

100% reliability = 0% error budget = no room for risk

99.9% SLO = 0.1% error budget = 43 minutes of downtime/month allowed

┌──────────────────────────────────────────────────────────────────┐
│                      30-DAY ERROR BUDGET                         │
│                                                                  │
│ Budget: 43 minutes                                              │
│ Used:   ████████████░░░░░░░░░░░░░░░░░░░░ 28 min (65%)          │
│ Status: 🟢 Safe to ship new features                            │
│                                                                  │
│ If budget exhausted:                                            │
│ Status: 🔴 Focus on reliability, no new features                │
└──────────────────────────────────────────────────────────────────┘

The Four Golden Signals

Google SRE’s recommended metrics for any service:

Signal	What It Measures	Example SLI
Latency	Time to serve a request	95th percentile response time
Traffic	Demand on the system	Requests per second
Errors	Rate of failed requests	% of 5xx responses
Saturation	How “full” the service is	CPU utilization, queue depth

2.2 Why This Matters

Most team frustrations come from mismatched expectations:

Team A expects an answer in 10 minutes
Team B thinks 2 days is reasonable
Result: Friction, resentment, escalations

SLEs make expectations explicit:

“We will respond to tickets within 4 business hours”
“We will review PRs within 1 business day”
“Our API will be available 99.9% of the time”

2.3 Historical Context

ITIL SLAs (1980s): Formalized service agreements in IT
Google SRE Book (2016): Introduced SLO/SLI framework to mainstream
Error Budgets (2016): Revolutionary concept linking reliability to development

2.4 Common Misconceptions

Misconception	Reality
“100% is the right target”	100% is impossible and paralyzes development
“More 9s is always better”	Each 9 is 10x harder and more expensive
“SLOs are set once”	SLOs evolve based on customer needs and capability
“Missing SLO = failure”	Error budgets exist for a reason

3. Project Specification

3.1 What You Will Build

SLE Agreement Template: Document format for team-to-team contracts
SLI Definitions: Concrete measurements for each commitment
Monitoring Dashboard: Live view of SLE compliance
Error Budget Tracker: Shows remaining budget and trend

3.2 Functional Requirements

Agreement Structure
- Provider team and consumer team
- Services covered
- SLIs with measurement methodology
- SLO targets with time windows
Measurement
- Automated collection of SLI data
- Rolling windows (7-day, 30-day)
- Alert on approaching threshold
Dashboard
- Current compliance status
- Historical trend
- Error budget remaining
- Breakdown by SLI
Reporting
- Weekly summary email
- Monthly review document
- Action items when SLO missed

3.3 Non-Functional Requirements

Metrics collection must be automated (no manual entry)
Dashboard must update at least every 5 minutes
Agreement documents must be version-controlled

3.4 Example Usage / Output

SLE Agreement Document:

# Service Level Expectation: Platform Team → Application Teams

## Overview
This SLE defines the expectations for services provided by the Platform
Team to all application teams at Acme Corp.

**Provider**: Platform Team
**Consumers**: All application teams
**Effective Date**: 2025-01-01
**Review Cadence**: Quarterly

---

## Covered Services

### 1. Kubernetes Cluster (Production)

#### SLI: Availability
- **Definition**: Percentage of minutes where the Kubernetes API server
  responds successfully to health checks
- **Measurement**: Synthetic probe every 30 seconds
- **Formula**: (Successful probes / Total probes) * 100

#### SLO: 99.9% Availability
- **Target**: 99.9% over 30-day rolling window
- **Error Budget**: 43 minutes/month
- **Violation Action**: Platform team halts feature work, focuses on reliability

---

### 2. CI/CD Pipelines

#### SLI: Build Queue Time
- **Definition**: Time from job submission to job start
- **Measurement**: Jenkins metrics
- **Formula**: P95 of queue wait time

#### SLO: P95 < 5 minutes
- **Target**: 95th percentile queue time under 5 minutes
- **Violation Action**: Platform team adds build capacity

---

### 3. Ticket Response

#### SLI: Time to First Response
- **Definition**: Time from ticket creation to first human response
- **Measurement**: Jira workflow timestamps
- **Formula**: P90 of response times during business hours

#### SLO: P90 < 4 business hours
- **Target**: 90% of tickets get first response within 4 hours
- **Violation Action**: Review staffing and priorities

---

## Escalation Path

If SLO is violated for 2 consecutive weeks:
1. Platform Team Lead notified
2. Joint review meeting with affected teams
3. Action plan created within 5 business days

---

## Review and Amendments

- SLEs reviewed quarterly
- Changes require agreement from both parties
- Historical data preserved for trend analysis

Dashboard Output (Grafana):

┌────────────────────────────────────────────────────────────────────┐
│                   PLATFORM SLE DASHBOARD                           │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  KUBERNETES AVAILABILITY (30-day)                                  │
│  ┌────────────────────────────────────────────┐                   │
│  │ Current: 99.94%  │  Target: 99.9%  │ 🟢   │                   │
│  ├────────────────────────────────────────────┤                   │
│  │ Error Budget: 43 min                       │                   │
│  │ Used:         24 min (56%)                 │                   │
│  │ Remaining:    19 min                       │                   │
│  └────────────────────────────────────────────┘                   │
│                                                                    │
│  CI/CD QUEUE TIME (P95)                                           │
│  ┌────────────────────────────────────────────┐                   │
│  │ Current: 3.2 min │  Target: 5 min  │ 🟢   │                   │
│  └────────────────────────────────────────────┘                   │
│                                                                    │
│  TICKET RESPONSE (P90)                                            │
│  ┌────────────────────────────────────────────┐                   │
│  │ Current: 4.8 hrs │  Target: 4 hrs  │ 🔴   │                   │
│  │ VIOLATION - Action required                │                   │
│  └────────────────────────────────────────────┘                   │
│                                                                    │
│  7-DAY TREND                                                      │
│  Mon   Tue   Wed   Thu   Fri   Sat   Sun                         │
│   🟢    🟢    🟢    🟡    🔴    🟢    🟢                          │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

3.5 Real World Outcome

After implementing SLEs:

Teams have clear expectations (no more “we always argue about response times”)
Platform team can prove their value (dashboard shows 99.9% uptime)
Violations are addressed systematically (not in angry Slack threads)
Error budgets allow velocity while maintaining reliability

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                        SLE SYSTEM                               │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  AGREEMENT    │     │  MEASUREMENT  │     │  VISUALIZATION│
│  DOCUMENTS    │     │               │     │               │
│               │     │  Prometheus   │     │  Grafana      │
│  Markdown in  │     │  Jira API     │     │  Dashboard    │
│  Git repo     │     │  Custom       │     │               │
└───────────────┘     └───────────────┘     └───────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  ALERTING         │
                    │                   │
                    │  - Slack          │
                    │  - PagerDuty      │
                    │  - Email          │
                    └───────────────────┘

4.2 Key Components

Agreement Registry: Markdown files defining SLEs
Metrics Collection: Prometheus for technical, Jira for process
SLO Calculator: Computes compliance from raw metrics
Dashboard: Grafana panels showing status
Alerter: Notifies on budget burn rate

4.3 Data Structures

# sle-definition.yaml
sle:
  id: platform-to-apps-k8s
  provider: team-platform
  consumers:
    - team-checkout
    - team-payments
    - all  # Or list specific teams

  services:
    - name: kubernetes-production
      slis:
        - name: availability
          type: ratio
          good_events: sum(rate(probe_success[5m]))
          total_events: sum(rate(probe_total[5m]))
          target: 0.999
          window: 30d

        - name: latency
          type: percentile
          metric: kubernetes_api_latency_seconds
          percentile: 95
          target: 0.5  # 500ms
          window: 7d

  escalation:
    - threshold: 50  # % of error budget consumed
      action: alert_team_lead
    - threshold: 80
      action: alert_director
    - threshold: 100
      action: incident_declared

# Prometheus queries for SLIs

# Availability SLI
(
  sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

# Latency SLI (P95)
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Error Budget Remaining
(
  (1 - 0.999) * 30 * 24 * 60  # Total budget in minutes
  -
  sum(rate(http_requests_total{status=~"5.."}[30d])) / 60  # Used
)

4.4 Algorithm Overview

def calculate_slo_compliance(sli_data: TimeSeries, slo: SLO) -> Compliance:
    # Get data points for window
    window_data = sli_data.filter(time >= now - slo.window)

    if slo.type == "ratio":
        # Availability-style SLI
        good = sum(window_data.good_events)
        total = sum(window_data.total_events)
        actual = good / total
        compliant = actual >= slo.target

    elif slo.type == "percentile":
        # Latency-style SLI
        actual = percentile(window_data.values, slo.percentile)
        compliant = actual <= slo.target

    # Calculate error budget
    total_budget = (1 - slo.target) * slo.window_minutes
    used_budget = calculate_bad_minutes(window_data)
    remaining_budget = total_budget - used_budget

    return Compliance(
        slo_id=slo.id,
        actual=actual,
        target=slo.target,
        compliant=compliant,
        error_budget_remaining=remaining_budget,
        error_budget_percent=(remaining_budget / total_budget) * 100
    )

5. Implementation Guide

5.1 Development Environment Setup

# For local development with Prometheus/Grafana
docker-compose up -d prometheus grafana

# Or use existing monitoring stack
# Just need write access to Grafana

5.2 Project Structure

sle-agreements/
├── agreements/
│   ├── platform-to-apps.md
│   ├── data-to-analytics.md
│   └── template.md
├── definitions/
│   ├── platform-slis.yaml
│   └── data-slis.yaml
├── dashboards/
│   ├── platform-sle.json    # Grafana dashboard
│   └── data-sle.json
├── alerts/
│   ├── platform-alerts.yaml # Prometheus alerting rules
│   └── data-alerts.yaml
└── reports/
    ├── 2025-01-weekly.md
    └── 2025-01-monthly.md

5.3 The Core Question You’re Answering

“What is the ‘Contract’ between our teams, and how do we know if we’re breaking it?”

Most team frustrations come from mismatched expectations. An SLE makes the operating model explicit and measurable.

5.4 Concepts You Must Understand First

Stop and research these before coding:

The Four Golden Signals
- What are Latency, Traffic, Errors, Saturation?
- Which are most important for your services?
- Book Reference: “SRE Book” Ch. 6
Error Budgets
- How do you calculate remaining budget?
- What happens when budget is exhausted?
- Book Reference: “SRE Book” Ch. 3
Percentiles vs. Averages
- Why is P99 more meaningful than average latency?
- Reference: Any observability guide

5.5 Questions to Guide Your Design

Before implementing, think through these:

User Focus

Who is your “user” in this context?
What does that user actually care about? (Uptime? Latency? Response time?)
How will you know if the SLE matters to them?

Measurement

Can you measure the SLI automatically?
What’s the data source? (Prometheus, Datadog, Jira?)
What’s the measurement frequency?

Consequences

What happens if the SLE is missed?
Who gets notified?
Does a manager get paged? Does the team change priorities?

5.6 Thinking Exercise

The “No-Phone” Week

Imagine your team is forbidden from using Slack or Zoom for one week. You can only communicate via documented SLEs and tickets.

Questions:

Does your SLE define what happens when a ticket is “Urgent”?
Does it define where to find documentation?
If the other team “fails” their SLE, do you have an escalation path without a Zoom call?

Write down:

3 situations where you’d normally Slack someone
For each, what SLE commitment would replace that Slack message?

5.7 Hints in Layers

Hint 1: Start with the Pain Ask: “What is the one thing we always argue about with Team X?” That’s your first SLI.

Hint 2: Define Availability Precisely “Availability” is vague. Be specific:

“The API returns 200 for /health endpoint”
“Response time is under 500ms”
“No 5xx errors”

Hint 3: Set Aspirational but Achievable Targets Don’t aim for 100%. Aim for what’s sufficient. If a developer can wait 4 hours for a PR review, 95% in < 4 hours is fine.

Hint 4: Make the Dashboard Visible Put the SLE dashboard on a TV screen. Make it impossible to ignore.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

“What is the difference between an SLO and an SLA?”
- SLO = internal target. SLA = external contract with consequences.
“How do you handle a situation where a team is consistently missing their SLOs?”
- Investigate root cause, adjust target or invest in reliability, use error budget policy
“Explain the concept of an Error Budget.”
- The acceptable amount of unreliability. Allows velocity while maintaining reliability.
“What are the Four Golden Signals?”
- Latency, Traffic, Errors, Saturation
“How do you choose the right SLI for a non-technical team (like HR)?”
- Focus on what the customer cares about (response time, accuracy, availability)

5.9 Books That Will Help

Topic	Book	Chapter
SLOs	“SRE Book” (Google)	Ch. 4: Service Level Objectives
Monitoring	“SRE Book” (Google)	Ch. 6: Monitoring Distributed Systems
Error Budgets	“SRE Workbook”	Ch. 2: Implementing SLOs

5.10 Implementation Phases

Phase 1: Agreement Design (3-4 hours)

Identify provider and consumer teams
List services covered
Define 2-3 SLIs per service
Set initial SLO targets

Phase 2: Measurement Setup (4-5 hours)

Write Prometheus queries for each SLI
Verify data is being collected
Test calculations manually

Phase 3: Dashboard Build (3-4 hours)

Create Grafana dashboard
Add panels for each SLI
Add error budget visualization
Add 7-day trend

Phase 4: Alerting (2-3 hours)

Set up alerts for budget burn rate
Configure Slack/email notifications
Test alerting

5.11 Key Implementation Decisions

Decision	Option A	Option B	Recommendation
SLI source	Prometheus	Datadog	Use what you have
Dashboard	Grafana	Custom	Grafana (faster)
Alerting	Prometheus Alertmanager	PagerDuty	Alertmanager for internal
Window	7-day rolling	Calendar month	30-day rolling

6. Testing Strategy

SLI Validation

# Verify query returns expected data type
# Should be between 0 and 1 for ratios
(sum(rate(http_requests_total{status=~"2.."}[5m])) /
 sum(rate(http_requests_total[5m])))

Dashboard Testing

Verify all panels load without errors
Verify colors change at correct thresholds
Test with synthetic data at edge cases

Alert Testing

Manually trigger alert conditions
Verify notifications are received
Verify alert routing is correct

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
SLI always at 100%	Dashboard shows unrealistic perfection	Measuring wrong thing	Verify query captures failures
No data in dashboard	Empty panels	Query syntax error or no data	Check Prometheus targets
Alert fatigue	Too many notifications	Thresholds too tight	Adjust to realistic targets
SLE ignored	Teams don’t care	No consequences	Add to team reviews, make visible

8. Extensions & Challenges

Extension 1: Composite SLOs

Combine multiple SLIs into a single “service health” score.

Extension 2: SLO Burndown

Show projected SLO compliance based on current burn rate.

Extension 3: Automated Reporting

Generate weekly email with SLE status for all agreements.

Extension 4: SLE Library

Create reusable SLE templates for common patterns.

9. Real-World Connections

How Big Tech Does This:

Google: Publishes external SLOs for Cloud products
AWS: Service Health Dashboard with SLA credits
Datadog: SLO tracking feature built into platform

Tools:

10. Resources

SRE Resources

Prometheus

P02: Team Service Interface - Define team as service
P04: Escalation Logic Tree - When SLOs are violated

11. Self-Assessment Checklist

Before considering this project complete, verify:

I can explain SLI, SLO, SLA, and Error Budget
Agreement document covers at least 3 SLIs
Each SLI has a measurable, automated query
Dashboard shows current compliance and trend
Error budget visualization is present
At least one alert is configured
Both provider and consumer teams have reviewed

12. Submission / Completion Criteria

This project is complete when you have:

SLE Agreement markdown document
SLI definitions in YAML format
Prometheus queries for each SLI
Grafana dashboard showing compliance
Alerting rules for budget burn
Sign-off from both provider and consumer teams

Previous Project: P06: Cognitive Load Survey & Heatmap Next Project: P08: Dependency Spaghetti Visualizer