Project 4: The Escalation Logic Tree (Incident Design)
Build a programmable escalation engine that determines WHO gets paged based on service boundaries, time of day, and failure type.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 1 Week (15-20 hours) |
| Primary Language | Python |
| Alternative Languages | JavaScript, Go |
| Prerequisites | Basic programming, understanding of on-call |
| Key Topics | Incident Response, SRE, Rule Engines |
1. Learning Objectives
By completing this project, you will:
- Design escalation paths that reflect organizational boundaries
- Implement a decision tree for routing alerts to correct teams
- Handle edge cases (timeouts, dependencies, cross-team outages)
- Reduce Mean Time to Acknowledge (MTTA) through automation
- Model the human side of incident response in code
2. Theoretical Foundation
2.1 Core Concepts
The Escalation Problem
MANUAL ESCALATION AUTOMATED ESCALATION
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ Alert fires │ │ Alert fires │
│ → Who owns this? │ │ → Lookup service owner │
│ → Slack around asking │ │ → Check time/severity │
│ → Page someone who guesses │ │ → Page correct on-call │
│ → They redirect to actual │ │ → Notify dependencies │
│ owner │ │ → Done (2 minutes) │
│ → 30 minutes later... │ │ │
└─────────────────────────────┘ └─────────────────────────────┘
Escalation Hierarchy
┌─────────────────────────┐
│ INCIDENT COMMANDER │
│ (Cross-team P1s) │
└───────────┬─────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Service Owner │ │ Dependency │ │ Communications│
│ (Primary) │ │ Owners │ │ Lead │
│ │ │ (Informed) │ │ │
└───────────────┘ └───────────────┘ └───────────────┘
│
┌───────┴───────┐
▼ ▼
┌────────┐ ┌────────┐
│ Primary│ │ Backup │
│ On-call│ │ On-call│
└────────┘ └────────┘
Key Metrics
| Metric | Definition | Target |
|---|---|---|
| MTTA | Mean Time to Acknowledge | < 5 minutes |
| MTTR | Mean Time to Resolve | Service dependent |
| Reassignment Rate | % of alerts reassigned | < 5% |
| False Positive Rate | Alerts that weren’t real | < 10% |
2.2 Why This Matters
Escalation is where team interfaces are tested under pressure. When systems fail:
- Clear boundaries mean fast resolution
- Unclear boundaries mean chaos and blame
The cost of slow escalation:
- 1 minute of downtime for a tier-1 service = $10,000+ (typical e-commerce)
- 30 minutes of searching for the owner = $300,000 in lost revenue
2.3 Historical Context
- ITIL Incident Management (1980s): Formalized IT operations
- PagerDuty/OpsGenie (2010s): On-call automation at scale
- SRE Movement (2016): “Site Reliability Engineering” book formalized practices
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “More alerts = more safety” | Too many alerts cause fatigue; people ignore them |
| “Everyone should be paged” | Page the right person, not everyone |
| “Escalation = punishment” | Escalation is process, not blame |
| “We’ll figure it out in the moment” | 3 AM is not the time to design process |
3. Project Specification
3.1 What You Will Build
A decision-tree simulator where you input an incident event and it outputs:
- Who to page (with specific contact info)
- Who to notify (dependencies, stakeholders)
- The reasoning behind the decision
3.2 Functional Requirements
- Service Metadata Input
- Service ID, owning team, criticality tier
- Dependencies (what this service calls)
- On-call schedule link
- Decision Rules
- Severity-based routing (P1 vs P4)
- Time-based routing (business hours vs after hours)
- Dependency-aware notification
- Escalation Tiers
- Primary: First responder
- Secondary: If primary doesn’t respond in X minutes
- Incident Commander: If P1 crosses team boundaries
- Output
- Specific person/schedule to page
- Slack channels to notify
- Runbook links to include in alert
3.3 Non-Functional Requirements
- Decision must complete in < 100ms
- Rules must be version-controlled (YAML)
- Must be testable with synthetic events
- Must log all decisions for post-mortem analysis
3.4 Example Usage / Output
Input Event:
{
"service": "payment-gateway",
"error_type": "5xx_errors",
"severity": "P1",
"timestamp": "2025-01-15T03:30:00Z",
"region": "us-east-1",
"metrics": {
"error_rate": 0.45,
"latency_p99": 5200
}
}
Output:
$ ./escalate --event event.json
=== ESCALATION DECISION ===
Timestamp: 2025-01-15 03:30:00 UTC (AFTER HOURS)
Service: payment-gateway
Severity: P1 (Critical)
[STEP 1] Lookup Owner
→ Service Owner: team-payments
→ Status: Active
[STEP 2] Severity Check
→ P1 detected → Notify Primary On-call immediately
→ Primary On-call: @jane-doe (PagerDuty: payments-oncall)
[STEP 3] Time Check
→ After hours (03:30 UTC) → Include backup
→ Backup On-call: @bob-smith
[STEP 4] Dependency Check
→ payment-gateway depends on: auth-service, user-db
→ Notifying downstream owners as INFORMED:
- auth-service → team-identity (#auth-oncall)
- user-db → team-data (#data-oncall)
[STEP 5] Criticality Check
→ Service is Tier-1 → Notify Incident Commander pool
→ IC Pool: #incident-commanders
=== ACTIONS ===
PAGING:
1. PagerDuty → payments-oncall (Primary: @jane-doe)
NOTIFYING (Slack):
1. #incidents-active → New P1: payment-gateway 5xx errors
2. #payments-oncall → Your service, please join incident
3. #auth-oncall → FYI: payment-gateway (depends on you) is down
4. #incident-commanders → P1 Alert: Tier-1 service down
ATTACHING:
- Runbook: https://wiki.example.com/runbooks/payment-5xx
- Dashboard: https://grafana.example.com/d/payments
=== REASONING LOG ===
[03:30:00] Event received: P1 on payment-gateway
[03:30:00] Owner lookup: team-payments (ACTIVE)
[03:30:00] Severity P1 + After Hours → Aggressive escalation
[03:30:00] Tier-1 service → IC notification required
[03:30:00] 2 dependencies found → Notify as INFORMED
3.5 Real World Outcome
After implementing this system:
- MTTA drops from 15 minutes to < 2 minutes
- Reassignment rate drops from 30% to < 5%
- On-call engineers get context (runbook, dashboard) in the alert
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ ESCALATION ENGINE │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Event Input │ │ Service │ │ Rules Engine │
│ (Alert) │ │ Registry │ │ (Decision │
│ │ │ (Metadata) │ │ Logic) │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
▼
┌───────────────────┐
│ Decision Engine │
│ │
│ 1. Lookup owner │
│ 2. Apply rules │
│ 3. Build actions │
└─────────┬─────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ PagerDuty │ │ Slack │ │ Audit Log │
│ API │ │ Webhook │ │ (Decisions) │
└───────────────┘ └───────────────┘ └───────────────┘
4.2 Key Components
- Event Parser: Normalizes incoming alert data
- Service Registry: Metadata about services (owner, tier, deps)
- Rules Engine: Evaluates conditions and selects actions
- Action Executor: Sends pages, posts to Slack
- Audit Logger: Records all decisions for review
4.3 Data Structures
# services.yaml
services:
- id: payment-gateway
name: Payment Gateway
owner: team-payments
tier: 1 # 1 = critical, 2 = important, 3 = normal
oncall_schedule: payments-oncall
slack_channel: "#payments-alerts"
dependencies:
- auth-service
- user-db
runbook: https://wiki.example.com/runbooks/payment
dashboard: https://grafana.example.com/d/payments
# rules.yaml
rules:
- name: p1-after-hours
conditions:
- severity: P1
- time: outside_business_hours
actions:
- page: primary_oncall
- page: backup_oncall
- notify: incident_commanders
- notify: dependencies_informed
- name: p1-business-hours
conditions:
- severity: P1
- time: business_hours
actions:
- page: primary_oncall
- notify: team_channel
- notify: dependencies_informed
- name: tier1-always
conditions:
- tier: 1
actions:
- notify: incident_commanders
- escalation_timeout: 5m → secondary_oncall
# teams.yaml
teams:
- id: team-payments
name: Payments Team
primary_oncall: payments-oncall # PagerDuty schedule ID
backup_oncall: payments-backup
slack: "#team-payments"
lead: "@payments-lead"
4.4 Algorithm Overview
def escalate(event: Event) -> EscalationPlan:
# Step 1: Get service metadata
service = registry.lookup(event.service_id)
if not service:
return default_escalation(event)
# Step 2: Get owning team
team = registry.get_team(service.owner)
if team.status != "active":
team = registry.get_team(team.merged_into)
# Step 3: Evaluate rules
context = build_context(event, service, team)
matched_rules = rules_engine.evaluate(context)
# Step 4: Build action plan
actions = []
for rule in matched_rules:
actions.extend(rule.actions)
# Step 5: Resolve actions to specific targets
plan = EscalationPlan()
for action in actions:
if action.type == "page":
schedule = resolve_schedule(action.target, team)
plan.pages.append(schedule)
elif action.type == "notify":
channel = resolve_channel(action.target, service, team)
plan.notifications.append(channel)
# Step 6: Add context
plan.runbook = service.runbook
plan.dashboard = service.dashboard
# Step 7: Log decision
audit_log.record(event, plan, matched_rules)
return plan
5. Implementation Guide
5.1 Development Environment Setup
# Create project
mkdir escalation-engine && cd escalation-engine
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install pyyaml click requests pytz
# For testing PagerDuty/Slack integration
pip install python-dotenv
5.2 Project Structure
escalation-engine/
├── data/
│ ├── services.yaml
│ ├── teams.yaml
│ └── rules.yaml
├── src/
│ ├── __init__.py
│ ├── models.py # Event, Service, Team, Action
│ ├── registry.py # Service/Team lookup
│ ├── rules.py # Rules engine
│ ├── engine.py # Main escalation logic
│ ├── actions.py # PagerDuty/Slack integration
│ └── cli.py # Command-line interface
├── tests/
│ ├── test_rules.py
│ ├── test_engine.py
│ └── fixtures/
│ └── events/
│ ├── p1_after_hours.json
│ └── p3_business_hours.json
└── escalate # Entry point
5.3 The Core Question You’re Answering
“In a crisis, does the system know how to find the right human without a human having to look it up?”
Manual escalation during a P0 incident is a failure of the operating model. The model should have pre-defined paths for every foreseeable failure.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Mean Time to Acknowledge (MTTA)
- How does escalation logic impact this metric?
- What’s the industry benchmark for MTTA?
- Book Reference: “Accelerate” by Nicole Forsgren
- The Incident Commander Role
- When does a team-level issue become an org-level incident?
- What’s the difference between IC and first responder?
- Book Reference: “The Site Reliability Workbook” Ch. 9
- Alert Fatigue
- What happens when on-call gets too many alerts?
- How do you design escalation to prevent burnout?
- Reference: PagerDuty operations guides
5.5 Questions to Guide Your Design
Before implementing, think through these:
Dependencies
- If Service A depends on Service B, and A is failing, should B’s owner be paged?
- What if B is also failing? Do you page both or just B?
- How do you avoid “Cascade Paging” (paging everyone)?
Time and Context
- Does the escalation change at 2 PM Tuesday vs. 2 AM Sunday?
- What if the primary responder doesn’t answer in 15 minutes?
- How do you handle holidays and PTO?
Failure Modes
- What if PagerDuty is down?
- What if the service metadata is missing?
- What if the owning team no longer exists?
5.6 Thinking Exercise
The “Blame Game” Simulation
Trace a failure in a shared component (like a Load Balancer).
Scenario:
- Load Balancer starts dropping connections
- App Team A sees errors, pages LB team
- LB Team says “It’s your app generating bad requests”
- App Team A says “No, it’s the LB”
- 45 minutes later, still arguing
Questions:
- Who is responsible for the Load Balancer?
- If the app team pages the LB team, and the LB team says “it’s your app,” who breaks the tie?
- Does your escalation logic include a “Final Arbiter” (like a CTO or Architect)?
- How would you model this in your rules?
5.7 Hints in Layers
Hint 1: Map the Metadata Every service needs at minimum:
service:
owner_team_id: string
criticality_tier: 1 | 2 | 3
oncall_schedule: string
Hint 2: Define Rules as Data Rules should be YAML, not hardcoded:
rule:
conditions:
- severity: P1
- time: outside_business_hours
actions:
- page: primary
- page: backup
Hint 3: Handle Dependencies
Add depends_on to service metadata:
service:
depends_on:
- service_id: auth-service
notify_on_failure: true # INFORM, don't PAGE
Hint 4: Build Safety Rules What if metadata is missing?
rule:
name: fallback-unknown-service
conditions:
- service_metadata: missing
actions:
- page: platform_oncall
- notify: "#unknown-service-alerts"
5.8 The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you design an on-call rotation that doesn’t burn people out?”
- Follow-the-sun, secondary backups, max pages per shift, compensation
- “What is the difference between an alert and an incident?”
- Alert = signal. Incident = declared event requiring coordination.
- “Explain the ‘Secondary’ escalation layer.”
- Backup responder if primary doesn’t acknowledge within timeout
- “How do you handle ‘Silent Failures’ where no one gets paged?”
- Canary alerts, synthetic monitoring, dead-man switches
- “What are the common pitfalls of automated escalation?”
- Over-paging, alert fatigue, wrong ownership metadata, missing fallbacks
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Incident Management | “SRE Book” (Google) | Ch. 14: Managing Incidents |
| Response Strategies | “The Site Reliability Workbook” | Ch. 9: Incident Response |
| On-Call Best Practices | “Accelerate” | Ch. 7 |
5.10 Implementation Phases
Phase 1: Data Model (3-4 hours)
- Define Service, Team, Rule, Event dataclasses
- Create sample services.yaml and teams.yaml
- Write loader functions
Phase 2: Rules Engine (4-5 hours)
- Implement condition matching
- Implement action resolution
- Handle rule priority/ordering
Phase 3: Decision Engine (3-4 hours)
- Combine registry + rules
- Build EscalationPlan output
- Add reasoning log
Phase 4: CLI & Testing (3-4 hours)
- Create CLI with click
- Add test events
- Verify output format
5.11 Key Implementation Decisions
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Rule format | YAML | Python DSL | YAML (non-dev editable) |
| Time handling | UTC only | Timezone-aware | Timezone-aware (follow team TZ) |
| Dependency notification | Always notify | Only if healthy | Only if related to failure |
| PagerDuty integration | Real API | Mock | Mock first, real integration later |
6. Testing Strategy
Unit Tests
def test_p1_after_hours_pages_backup():
event = Event(service="pay", severity="P1", time="03:00 UTC")
plan = engine.escalate(event)
assert len(plan.pages) == 2 # primary + backup
def test_missing_service_uses_fallback():
event = Event(service="unknown", severity="P1")
plan = engine.escalate(event)
assert "platform-oncall" in plan.pages
def test_dependency_notified_not_paged():
event = Event(service="checkout", severity="P1")
plan = engine.escalate(event)
# checkout depends on payments
assert "payments" in plan.notifications
assert "payments" not in plan.pages
Integration Tests
- Load real YAML files
- Process 100 sample events
- Verify all produce valid plans
Chaos Tests
- Remove service from registry mid-run
- Provide malformed event
- Simulate PagerDuty timeout
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Wrong person paged | Reassignment rate > 20% | Stale ownership data | Automate registry sync |
| Everyone paged | 10+ pages for one incident | Cascade from dependencies | Add “related_incident” dedup |
| No one paged | Silent failures | Missing fallback rules | Add catch-all rule |
| Slow escalation | MTTA > 10 min | Complex rule evaluation | Optimize rule engine, add caching |
8. Extensions & Challenges
Extension 1: PagerDuty Integration
Actually page people using PagerDuty API.
import requests
def page_pagerduty(schedule_id: str, message: str):
requests.post(
"https://api.pagerduty.com/incidents",
json={
"incident": {
"type": "incident",
"title": message,
"service": {"id": schedule_id},
}
},
headers={"Authorization": f"Token token={API_KEY}"}
)
Extension 2: Slack Bot
Post escalation decisions to Slack in real-time.
Extension 3: Escalation Timeout
If primary doesn’t acknowledge in 5 minutes, auto-escalate to secondary.
Extension 4: Post-Incident Analysis
Generate report showing all escalation decisions for a time period.
9. Real-World Connections
How Big Tech Does This:
- PagerDuty: Event Intelligence for automatic routing
- Google: Cascading on-call with automatic escalation
- Netflix: “PagerDuty + Slack + custom routing”
Open Source:
- Opsgenie: Similar routing rules
- Grafana OnCall: Open-source on-call management
10. Resources
PagerDuty
Articles
Related Projects
- P03: Ownership Boundary Mapper - Define who owns what
- P10: Incident Battle Cards - Crisis protocols
11. Self-Assessment Checklist
Before considering this project complete, verify:
- I can explain MTTA and MTTR
- services.yaml has at least 5 services with dependencies
- rules.yaml covers P1-P4 for business/after hours
- Engine handles missing service gracefully (fallback)
- Output includes reasoning log
- I’ve tested with at least 10 different event scenarios
- All pages include runbook and dashboard links
12. Submission / Completion Criteria
This project is complete when you have:
- services.yaml with 5+ services including dependencies
- teams.yaml with 3+ teams and on-call schedules
- rules.yaml with 8+ rules covering severity/time combinations
- CLI tool that processes events and outputs escalation plan
- Test suite with 10+ event fixtures
- Reasoning log showing decision trace
Previous Project: P03: Ownership Boundary Mapper Next Project: P05: Platform-as-a-Product Blueprint