Project 7: Service Level Expectation (SLE) Agreement
Create formalized “Service Level Expectations” between teams with live dashboards tracking compliance.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 Week (15-20 hours) |
| Primary Language | Markdown / Prometheus |
| Alternative Languages | Python, Terraform |
| Prerequisites | Basic monitoring concepts, metrics |
| Key Topics | SLOs, SLIs, SLAs, Error Budgets |
1. Learning Objectives
By completing this project, you will:
- Design internal service contracts between teams
- Define meaningful SLIs (Service Level Indicators)
- Set realistic SLO targets (Service Level Objectives)
- Implement live dashboards tracking compliance
- Use error budgets to balance reliability and velocity
2. Theoretical Foundation
2.1 Core Concepts
The SLI/SLO/SLA Hierarchy
┌─────────────────────────────────────────────────────────────────┐
│ SLA │
│ (Service Level Agreement - Contract) │
│ "If we miss this, there are financial/legal consequences" │
└───────────────────────────────────────────────────────────────── │
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SLO │
│ (Service Level Objective - Target) │
│ "We aim to achieve this level of reliability" │
│ Example: 99.9% of requests succeed within 200ms │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SLI │
│ (Service Level Indicator - Measurement) │
│ "How we measure the thing we care about" │
│ Example: (Successful requests / Total requests) * 100 │
└─────────────────────────────────────────────────────────────────┘
Service Level Expectations (SLE) - Internal Focus
SLEs are like SLOs but for internal team-to-team relationships:
EXTERNAL SLA INTERNAL SLE
(Company ↔ Customer) (Team ↔ Team)
┌─────────────────────┐ ┌─────────────────────┐
│ 99.95% uptime │ │ Ticket response: │
│ or refund │ │ < 4 hours │
│ │ │ │
│ Legal contract │ │ Operating agreement │
│ Financial penalty │ │ Trust & reputation │
└─────────────────────┘ └─────────────────────┘
Error Budgets
The concept that allows teams to balance reliability with velocity:
100% reliability = 0% error budget = no room for risk
99.9% SLO = 0.1% error budget = 43 minutes of downtime/month allowed
┌──────────────────────────────────────────────────────────────────┐
│ 30-DAY ERROR BUDGET │
│ │
│ Budget: 43 minutes │
│ Used: ████████████░░░░░░░░░░░░░░░░░░░░ 28 min (65%) │
│ Status: 🟢 Safe to ship new features │
│ │
│ If budget exhausted: │
│ Status: 🔴 Focus on reliability, no new features │
└──────────────────────────────────────────────────────────────────┘
The Four Golden Signals
Google SRE’s recommended metrics for any service:
| Signal | What It Measures | Example SLI |
|---|---|---|
| Latency | Time to serve a request | 95th percentile response time |
| Traffic | Demand on the system | Requests per second |
| Errors | Rate of failed requests | % of 5xx responses |
| Saturation | How “full” the service is | CPU utilization, queue depth |
2.2 Why This Matters
Most team frustrations come from mismatched expectations:
- Team A expects an answer in 10 minutes
- Team B thinks 2 days is reasonable
- Result: Friction, resentment, escalations
SLEs make expectations explicit:
- “We will respond to tickets within 4 business hours”
- “We will review PRs within 1 business day”
- “Our API will be available 99.9% of the time”
2.3 Historical Context
- ITIL SLAs (1980s): Formalized service agreements in IT
- Google SRE Book (2016): Introduced SLO/SLI framework to mainstream
- Error Budgets (2016): Revolutionary concept linking reliability to development
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “100% is the right target” | 100% is impossible and paralyzes development |
| “More 9s is always better” | Each 9 is 10x harder and more expensive |
| “SLOs are set once” | SLOs evolve based on customer needs and capability |
| “Missing SLO = failure” | Error budgets exist for a reason |
3. Project Specification
3.1 What You Will Build
- SLE Agreement Template: Document format for team-to-team contracts
- SLI Definitions: Concrete measurements for each commitment
- Monitoring Dashboard: Live view of SLE compliance
- Error Budget Tracker: Shows remaining budget and trend
3.2 Functional Requirements
- Agreement Structure
- Provider team and consumer team
- Services covered
- SLIs with measurement methodology
- SLO targets with time windows
- Measurement
- Automated collection of SLI data
- Rolling windows (7-day, 30-day)
- Alert on approaching threshold
- Dashboard
- Current compliance status
- Historical trend
- Error budget remaining
- Breakdown by SLI
- Reporting
- Weekly summary email
- Monthly review document
- Action items when SLO missed
3.3 Non-Functional Requirements
- Metrics collection must be automated (no manual entry)
- Dashboard must update at least every 5 minutes
- Agreement documents must be version-controlled
3.4 Example Usage / Output
SLE Agreement Document:
# Service Level Expectation: Platform Team → Application Teams
## Overview
This SLE defines the expectations for services provided by the Platform
Team to all application teams at Acme Corp.
**Provider**: Platform Team
**Consumers**: All application teams
**Effective Date**: 2025-01-01
**Review Cadence**: Quarterly
---
## Covered Services
### 1. Kubernetes Cluster (Production)
#### SLI: Availability
- **Definition**: Percentage of minutes where the Kubernetes API server
responds successfully to health checks
- **Measurement**: Synthetic probe every 30 seconds
- **Formula**: (Successful probes / Total probes) * 100
#### SLO: 99.9% Availability
- **Target**: 99.9% over 30-day rolling window
- **Error Budget**: 43 minutes/month
- **Violation Action**: Platform team halts feature work, focuses on reliability
---
### 2. CI/CD Pipelines
#### SLI: Build Queue Time
- **Definition**: Time from job submission to job start
- **Measurement**: Jenkins metrics
- **Formula**: P95 of queue wait time
#### SLO: P95 < 5 minutes
- **Target**: 95th percentile queue time under 5 minutes
- **Violation Action**: Platform team adds build capacity
---
### 3. Ticket Response
#### SLI: Time to First Response
- **Definition**: Time from ticket creation to first human response
- **Measurement**: Jira workflow timestamps
- **Formula**: P90 of response times during business hours
#### SLO: P90 < 4 business hours
- **Target**: 90% of tickets get first response within 4 hours
- **Violation Action**: Review staffing and priorities
---
## Escalation Path
If SLO is violated for 2 consecutive weeks:
1. Platform Team Lead notified
2. Joint review meeting with affected teams
3. Action plan created within 5 business days
---
## Review and Amendments
- SLEs reviewed quarterly
- Changes require agreement from both parties
- Historical data preserved for trend analysis
Dashboard Output (Grafana):
┌────────────────────────────────────────────────────────────────────┐
│ PLATFORM SLE DASHBOARD │
├────────────────────────────────────────────────────────────────────┤
│ │
│ KUBERNETES AVAILABILITY (30-day) │
│ ┌────────────────────────────────────────────┐ │
│ │ Current: 99.94% │ Target: 99.9% │ 🟢 │ │
│ ├────────────────────────────────────────────┤ │
│ │ Error Budget: 43 min │ │
│ │ Used: 24 min (56%) │ │
│ │ Remaining: 19 min │ │
│ └────────────────────────────────────────────┘ │
│ │
│ CI/CD QUEUE TIME (P95) │
│ ┌────────────────────────────────────────────┐ │
│ │ Current: 3.2 min │ Target: 5 min │ 🟢 │ │
│ └────────────────────────────────────────────┘ │
│ │
│ TICKET RESPONSE (P90) │
│ ┌────────────────────────────────────────────┐ │
│ │ Current: 4.8 hrs │ Target: 4 hrs │ 🔴 │ │
│ │ VIOLATION - Action required │ │
│ └────────────────────────────────────────────┘ │
│ │
│ 7-DAY TREND │
│ Mon Tue Wed Thu Fri Sat Sun │
│ 🟢 🟢 🟢 🟡 🔴 🟢 🟢 │
│ │
└────────────────────────────────────────────────────────────────────┘
3.5 Real World Outcome
After implementing SLEs:
- Teams have clear expectations (no more “we always argue about response times”)
- Platform team can prove their value (dashboard shows 99.9% uptime)
- Violations are addressed systematically (not in angry Slack threads)
- Error budgets allow velocity while maintaining reliability
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ SLE SYSTEM │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ AGREEMENT │ │ MEASUREMENT │ │ VISUALIZATION│
│ DOCUMENTS │ │ │ │ │
│ │ │ Prometheus │ │ Grafana │
│ Markdown in │ │ Jira API │ │ Dashboard │
│ Git repo │ │ Custom │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
│
▼
┌───────────────────┐
│ ALERTING │
│ │
│ - Slack │
│ - PagerDuty │
│ - Email │
└───────────────────┘
4.2 Key Components
- Agreement Registry: Markdown files defining SLEs
- Metrics Collection: Prometheus for technical, Jira for process
- SLO Calculator: Computes compliance from raw metrics
- Dashboard: Grafana panels showing status
- Alerter: Notifies on budget burn rate
4.3 Data Structures
# sle-definition.yaml
sle:
id: platform-to-apps-k8s
provider: team-platform
consumers:
- team-checkout
- team-payments
- all # Or list specific teams
services:
- name: kubernetes-production
slis:
- name: availability
type: ratio
good_events: sum(rate(probe_success[5m]))
total_events: sum(rate(probe_total[5m]))
target: 0.999
window: 30d
- name: latency
type: percentile
metric: kubernetes_api_latency_seconds
percentile: 95
target: 0.5 # 500ms
window: 7d
escalation:
- threshold: 50 # % of error budget consumed
action: alert_team_lead
- threshold: 80
action: alert_director
- threshold: 100
action: incident_declared
# Prometheus queries for SLIs
# Availability SLI
(
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# Latency SLI (P95)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error Budget Remaining
(
(1 - 0.999) * 30 * 24 * 60 # Total budget in minutes
-
sum(rate(http_requests_total{status=~"5.."}[30d])) / 60 # Used
)
4.4 Algorithm Overview
def calculate_slo_compliance(sli_data: TimeSeries, slo: SLO) -> Compliance:
# Get data points for window
window_data = sli_data.filter(time >= now - slo.window)
if slo.type == "ratio":
# Availability-style SLI
good = sum(window_data.good_events)
total = sum(window_data.total_events)
actual = good / total
compliant = actual >= slo.target
elif slo.type == "percentile":
# Latency-style SLI
actual = percentile(window_data.values, slo.percentile)
compliant = actual <= slo.target
# Calculate error budget
total_budget = (1 - slo.target) * slo.window_minutes
used_budget = calculate_bad_minutes(window_data)
remaining_budget = total_budget - used_budget
return Compliance(
slo_id=slo.id,
actual=actual,
target=slo.target,
compliant=compliant,
error_budget_remaining=remaining_budget,
error_budget_percent=(remaining_budget / total_budget) * 100
)
5. Implementation Guide
5.1 Development Environment Setup
# For local development with Prometheus/Grafana
docker-compose up -d prometheus grafana
# Or use existing monitoring stack
# Just need write access to Grafana
5.2 Project Structure
sle-agreements/
├── agreements/
│ ├── platform-to-apps.md
│ ├── data-to-analytics.md
│ └── template.md
├── definitions/
│ ├── platform-slis.yaml
│ └── data-slis.yaml
├── dashboards/
│ ├── platform-sle.json # Grafana dashboard
│ └── data-sle.json
├── alerts/
│ ├── platform-alerts.yaml # Prometheus alerting rules
│ └── data-alerts.yaml
└── reports/
├── 2025-01-weekly.md
└── 2025-01-monthly.md
5.3 The Core Question You’re Answering
“What is the ‘Contract’ between our teams, and how do we know if we’re breaking it?”
Most team frustrations come from mismatched expectations. An SLE makes the operating model explicit and measurable.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- The Four Golden Signals
- What are Latency, Traffic, Errors, Saturation?
- Which are most important for your services?
- Book Reference: “SRE Book” Ch. 6
- Error Budgets
- How do you calculate remaining budget?
- What happens when budget is exhausted?
- Book Reference: “SRE Book” Ch. 3
- Percentiles vs. Averages
- Why is P99 more meaningful than average latency?
- Reference: Any observability guide
5.5 Questions to Guide Your Design
Before implementing, think through these:
User Focus
- Who is your “user” in this context?
- What does that user actually care about? (Uptime? Latency? Response time?)
- How will you know if the SLE matters to them?
Measurement
- Can you measure the SLI automatically?
- What’s the data source? (Prometheus, Datadog, Jira?)
- What’s the measurement frequency?
Consequences
- What happens if the SLE is missed?
- Who gets notified?
- Does a manager get paged? Does the team change priorities?
5.6 Thinking Exercise
The “No-Phone” Week
Imagine your team is forbidden from using Slack or Zoom for one week. You can only communicate via documented SLEs and tickets.
Questions:
- Does your SLE define what happens when a ticket is “Urgent”?
- Does it define where to find documentation?
- If the other team “fails” their SLE, do you have an escalation path without a Zoom call?
Write down:
- 3 situations where you’d normally Slack someone
- For each, what SLE commitment would replace that Slack message?
5.7 Hints in Layers
Hint 1: Start with the Pain Ask: “What is the one thing we always argue about with Team X?” That’s your first SLI.
Hint 2: Define Availability Precisely “Availability” is vague. Be specific:
- “The API returns 200 for /health endpoint”
- “Response time is under 500ms”
- “No 5xx errors”
Hint 3: Set Aspirational but Achievable Targets Don’t aim for 100%. Aim for what’s sufficient. If a developer can wait 4 hours for a PR review, 95% in < 4 hours is fine.
Hint 4: Make the Dashboard Visible Put the SLE dashboard on a TV screen. Make it impossible to ignore.
5.8 The Interview Questions They’ll Ask
Prepare to answer these:
- “What is the difference between an SLO and an SLA?”
- SLO = internal target. SLA = external contract with consequences.
- “How do you handle a situation where a team is consistently missing their SLOs?”
- Investigate root cause, adjust target or invest in reliability, use error budget policy
- “Explain the concept of an Error Budget.”
- The acceptable amount of unreliability. Allows velocity while maintaining reliability.
- “What are the Four Golden Signals?”
- Latency, Traffic, Errors, Saturation
- “How do you choose the right SLI for a non-technical team (like HR)?”
- Focus on what the customer cares about (response time, accuracy, availability)
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| SLOs | “SRE Book” (Google) | Ch. 4: Service Level Objectives |
| Monitoring | “SRE Book” (Google) | Ch. 6: Monitoring Distributed Systems |
| Error Budgets | “SRE Workbook” | Ch. 2: Implementing SLOs |
5.10 Implementation Phases
Phase 1: Agreement Design (3-4 hours)
- Identify provider and consumer teams
- List services covered
- Define 2-3 SLIs per service
- Set initial SLO targets
Phase 2: Measurement Setup (4-5 hours)
- Write Prometheus queries for each SLI
- Verify data is being collected
- Test calculations manually
Phase 3: Dashboard Build (3-4 hours)
- Create Grafana dashboard
- Add panels for each SLI
- Add error budget visualization
- Add 7-day trend
Phase 4: Alerting (2-3 hours)
- Set up alerts for budget burn rate
- Configure Slack/email notifications
- Test alerting
5.11 Key Implementation Decisions
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| SLI source | Prometheus | Datadog | Use what you have |
| Dashboard | Grafana | Custom | Grafana (faster) |
| Alerting | Prometheus Alertmanager | PagerDuty | Alertmanager for internal |
| Window | 7-day rolling | Calendar month | 30-day rolling |
6. Testing Strategy
SLI Validation
# Verify query returns expected data type
# Should be between 0 and 1 for ratios
(sum(rate(http_requests_total{status=~"2.."}[5m])) /
sum(rate(http_requests_total[5m])))
Dashboard Testing
- Verify all panels load without errors
- Verify colors change at correct thresholds
- Test with synthetic data at edge cases
Alert Testing
- Manually trigger alert conditions
- Verify notifications are received
- Verify alert routing is correct
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| SLI always at 100% | Dashboard shows unrealistic perfection | Measuring wrong thing | Verify query captures failures |
| No data in dashboard | Empty panels | Query syntax error or no data | Check Prometheus targets |
| Alert fatigue | Too many notifications | Thresholds too tight | Adjust to realistic targets |
| SLE ignored | Teams don’t care | No consequences | Add to team reviews, make visible |
8. Extensions & Challenges
Extension 1: Composite SLOs
Combine multiple SLIs into a single “service health” score.
Extension 2: SLO Burndown
Show projected SLO compliance based on current burn rate.
Extension 3: Automated Reporting
Generate weekly email with SLE status for all agreements.
Extension 4: SLE Library
Create reusable SLE templates for common patterns.
9. Real-World Connections
How Big Tech Does This:
- Google: Publishes external SLOs for Cloud products
- AWS: Service Health Dashboard with SLA credits
- Datadog: SLO tracking feature built into platform
Tools:
- Prometheus + Grafana
- Nobl9 - SLO platform
- Datadog SLOs
10. Resources
SRE Resources
- Google SRE Book - Free online
- SRE Workbook
- The Art of SLOs
Prometheus
Related Projects
- P02: Team Service Interface - Define team as service
- P04: Escalation Logic Tree - When SLOs are violated
11. Self-Assessment Checklist
Before considering this project complete, verify:
- I can explain SLI, SLO, SLA, and Error Budget
- Agreement document covers at least 3 SLIs
- Each SLI has a measurable, automated query
- Dashboard shows current compliance and trend
- Error budget visualization is present
- At least one alert is configured
- Both provider and consumer teams have reviewed
12. Submission / Completion Criteria
This project is complete when you have:
- SLE Agreement markdown document
- SLI definitions in YAML format
- Prometheus queries for each SLI
- Grafana dashboard showing compliance
- Alerting rules for budget burn
- Sign-off from both provider and consumer teams
Previous Project: P06: Cognitive Load Survey & Heatmap Next Project: P08: Dependency Spaghetti Visualizer