Project 4: The Escalation Logic Tree (Incident Design)

Build a programmable escalation engine that determines WHO gets paged based on service boundaries, time of day, and failure type.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 1 Week (15-20 hours)
Primary Language Python
Alternative Languages JavaScript, Go
Prerequisites Basic programming, understanding of on-call
Key Topics Incident Response, SRE, Rule Engines

1. Learning Objectives

By completing this project, you will:

  1. Design escalation paths that reflect organizational boundaries
  2. Implement a decision tree for routing alerts to correct teams
  3. Handle edge cases (timeouts, dependencies, cross-team outages)
  4. Reduce Mean Time to Acknowledge (MTTA) through automation
  5. Model the human side of incident response in code

2. Theoretical Foundation

2.1 Core Concepts

The Escalation Problem

MANUAL ESCALATION                   AUTOMATED ESCALATION
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ Alert fires                 │    │ Alert fires                 │
│ → Who owns this?            │    │ → Lookup service owner      │
│ → Slack around asking       │    │ → Check time/severity       │
│ → Page someone who guesses  │    │ → Page correct on-call      │
│ → They redirect to actual   │    │ → Notify dependencies       │
│   owner                     │    │ → Done (2 minutes)          │
│ → 30 minutes later...       │    │                             │
└─────────────────────────────┘    └─────────────────────────────┘

Escalation Hierarchy

                    ┌─────────────────────────┐
                    │    INCIDENT COMMANDER   │
                    │    (Cross-team P1s)     │
                    └───────────┬─────────────┘
                                │
            ┌───────────────────┼───────────────────┐
            │                   │                   │
            ▼                   ▼                   ▼
    ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
    │ Service Owner │   │ Dependency    │   │ Communications│
    │ (Primary)     │   │ Owners        │   │ Lead          │
    │               │   │ (Informed)    │   │               │
    └───────────────┘   └───────────────┘   └───────────────┘
            │
    ┌───────┴───────┐
    ▼               ▼
┌────────┐    ┌────────┐
│ Primary│    │ Backup │
│ On-call│    │ On-call│
└────────┘    └────────┘

Key Metrics

Metric Definition Target
MTTA Mean Time to Acknowledge < 5 minutes
MTTR Mean Time to Resolve Service dependent
Reassignment Rate % of alerts reassigned < 5%
False Positive Rate Alerts that weren’t real < 10%

2.2 Why This Matters

Escalation is where team interfaces are tested under pressure. When systems fail:

  • Clear boundaries mean fast resolution
  • Unclear boundaries mean chaos and blame

The cost of slow escalation:

  • 1 minute of downtime for a tier-1 service = $10,000+ (typical e-commerce)
  • 30 minutes of searching for the owner = $300,000 in lost revenue

2.3 Historical Context

  • ITIL Incident Management (1980s): Formalized IT operations
  • PagerDuty/OpsGenie (2010s): On-call automation at scale
  • SRE Movement (2016): “Site Reliability Engineering” book formalized practices

2.4 Common Misconceptions

Misconception Reality
“More alerts = more safety” Too many alerts cause fatigue; people ignore them
“Everyone should be paged” Page the right person, not everyone
“Escalation = punishment” Escalation is process, not blame
“We’ll figure it out in the moment” 3 AM is not the time to design process

3. Project Specification

3.1 What You Will Build

A decision-tree simulator where you input an incident event and it outputs:

  1. Who to page (with specific contact info)
  2. Who to notify (dependencies, stakeholders)
  3. The reasoning behind the decision

3.2 Functional Requirements

  1. Service Metadata Input
    • Service ID, owning team, criticality tier
    • Dependencies (what this service calls)
    • On-call schedule link
  2. Decision Rules
    • Severity-based routing (P1 vs P4)
    • Time-based routing (business hours vs after hours)
    • Dependency-aware notification
  3. Escalation Tiers
    • Primary: First responder
    • Secondary: If primary doesn’t respond in X minutes
    • Incident Commander: If P1 crosses team boundaries
  4. Output
    • Specific person/schedule to page
    • Slack channels to notify
    • Runbook links to include in alert

3.3 Non-Functional Requirements

  • Decision must complete in < 100ms
  • Rules must be version-controlled (YAML)
  • Must be testable with synthetic events
  • Must log all decisions for post-mortem analysis

3.4 Example Usage / Output

Input Event:

{
  "service": "payment-gateway",
  "error_type": "5xx_errors",
  "severity": "P1",
  "timestamp": "2025-01-15T03:30:00Z",
  "region": "us-east-1",
  "metrics": {
    "error_rate": 0.45,
    "latency_p99": 5200
  }
}

Output:

$ ./escalate --event event.json

=== ESCALATION DECISION ===

Timestamp: 2025-01-15 03:30:00 UTC (AFTER HOURS)
Service: payment-gateway
Severity: P1 (Critical)

[STEP 1] Lookup Owner
  → Service Owner: team-payments
  → Status: Active

[STEP 2] Severity Check
  → P1 detected → Notify Primary On-call immediately
  → Primary On-call: @jane-doe (PagerDuty: payments-oncall)

[STEP 3] Time Check
  → After hours (03:30 UTC) → Include backup
  → Backup On-call: @bob-smith

[STEP 4] Dependency Check
  → payment-gateway depends on: auth-service, user-db
  → Notifying downstream owners as INFORMED:
    - auth-service → team-identity (#auth-oncall)
    - user-db → team-data (#data-oncall)

[STEP 5] Criticality Check
  → Service is Tier-1 → Notify Incident Commander pool
  → IC Pool: #incident-commanders

=== ACTIONS ===

PAGING:
  1. PagerDuty → payments-oncall (Primary: @jane-doe)

NOTIFYING (Slack):
  1. #incidents-active → New P1: payment-gateway 5xx errors
  2. #payments-oncall → Your service, please join incident
  3. #auth-oncall → FYI: payment-gateway (depends on you) is down
  4. #incident-commanders → P1 Alert: Tier-1 service down

ATTACHING:
  - Runbook: https://wiki.example.com/runbooks/payment-5xx
  - Dashboard: https://grafana.example.com/d/payments

=== REASONING LOG ===
[03:30:00] Event received: P1 on payment-gateway
[03:30:00] Owner lookup: team-payments (ACTIVE)
[03:30:00] Severity P1 + After Hours → Aggressive escalation
[03:30:00] Tier-1 service → IC notification required
[03:30:00] 2 dependencies found → Notify as INFORMED

3.5 Real World Outcome

After implementing this system:

  • MTTA drops from 15 minutes to < 2 minutes
  • Reassignment rate drops from 30% to < 5%
  • On-call engineers get context (runbook, dashboard) in the alert

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                     ESCALATION ENGINE                           │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  Event Input  │     │ Service       │     │ Rules Engine  │
│  (Alert)      │     │ Registry      │     │ (Decision     │
│               │     │ (Metadata)    │     │  Logic)       │
└───────────────┘     └───────────────┘     └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  Decision Engine  │
                    │                   │
                    │  1. Lookup owner  │
                    │  2. Apply rules   │
                    │  3. Build actions │
                    └─────────┬─────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  PagerDuty    │     │  Slack        │     │  Audit Log    │
│  API          │     │  Webhook      │     │  (Decisions)  │
└───────────────┘     └───────────────┘     └───────────────┘

4.2 Key Components

  1. Event Parser: Normalizes incoming alert data
  2. Service Registry: Metadata about services (owner, tier, deps)
  3. Rules Engine: Evaluates conditions and selects actions
  4. Action Executor: Sends pages, posts to Slack
  5. Audit Logger: Records all decisions for review

4.3 Data Structures

# services.yaml
services:
  - id: payment-gateway
    name: Payment Gateway
    owner: team-payments
    tier: 1  # 1 = critical, 2 = important, 3 = normal
    oncall_schedule: payments-oncall
    slack_channel: "#payments-alerts"
    dependencies:
      - auth-service
      - user-db
    runbook: https://wiki.example.com/runbooks/payment
    dashboard: https://grafana.example.com/d/payments

# rules.yaml
rules:
  - name: p1-after-hours
    conditions:
      - severity: P1
      - time: outside_business_hours
    actions:
      - page: primary_oncall
      - page: backup_oncall
      - notify: incident_commanders
      - notify: dependencies_informed

  - name: p1-business-hours
    conditions:
      - severity: P1
      - time: business_hours
    actions:
      - page: primary_oncall
      - notify: team_channel
      - notify: dependencies_informed

  - name: tier1-always
    conditions:
      - tier: 1
    actions:
      - notify: incident_commanders
      - escalation_timeout: 5m → secondary_oncall

# teams.yaml
teams:
  - id: team-payments
    name: Payments Team
    primary_oncall: payments-oncall  # PagerDuty schedule ID
    backup_oncall: payments-backup
    slack: "#team-payments"
    lead: "@payments-lead"

4.4 Algorithm Overview

def escalate(event: Event) -> EscalationPlan:
    # Step 1: Get service metadata
    service = registry.lookup(event.service_id)
    if not service:
        return default_escalation(event)

    # Step 2: Get owning team
    team = registry.get_team(service.owner)
    if team.status != "active":
        team = registry.get_team(team.merged_into)

    # Step 3: Evaluate rules
    context = build_context(event, service, team)
    matched_rules = rules_engine.evaluate(context)

    # Step 4: Build action plan
    actions = []
    for rule in matched_rules:
        actions.extend(rule.actions)

    # Step 5: Resolve actions to specific targets
    plan = EscalationPlan()
    for action in actions:
        if action.type == "page":
            schedule = resolve_schedule(action.target, team)
            plan.pages.append(schedule)
        elif action.type == "notify":
            channel = resolve_channel(action.target, service, team)
            plan.notifications.append(channel)

    # Step 6: Add context
    plan.runbook = service.runbook
    plan.dashboard = service.dashboard

    # Step 7: Log decision
    audit_log.record(event, plan, matched_rules)

    return plan

5. Implementation Guide

5.1 Development Environment Setup

# Create project
mkdir escalation-engine && cd escalation-engine
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install pyyaml click requests pytz

# For testing PagerDuty/Slack integration
pip install python-dotenv

5.2 Project Structure

escalation-engine/
├── data/
│   ├── services.yaml
│   ├── teams.yaml
│   └── rules.yaml
├── src/
│   ├── __init__.py
│   ├── models.py       # Event, Service, Team, Action
│   ├── registry.py     # Service/Team lookup
│   ├── rules.py        # Rules engine
│   ├── engine.py       # Main escalation logic
│   ├── actions.py      # PagerDuty/Slack integration
│   └── cli.py          # Command-line interface
├── tests/
│   ├── test_rules.py
│   ├── test_engine.py
│   └── fixtures/
│       └── events/
│           ├── p1_after_hours.json
│           └── p3_business_hours.json
└── escalate           # Entry point

5.3 The Core Question You’re Answering

“In a crisis, does the system know how to find the right human without a human having to look it up?”

Manual escalation during a P0 incident is a failure of the operating model. The model should have pre-defined paths for every foreseeable failure.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Mean Time to Acknowledge (MTTA)
    • How does escalation logic impact this metric?
    • What’s the industry benchmark for MTTA?
    • Book Reference: “Accelerate” by Nicole Forsgren
  2. The Incident Commander Role
    • When does a team-level issue become an org-level incident?
    • What’s the difference between IC and first responder?
    • Book Reference: “The Site Reliability Workbook” Ch. 9
  3. Alert Fatigue
    • What happens when on-call gets too many alerts?
    • How do you design escalation to prevent burnout?
    • Reference: PagerDuty operations guides

5.5 Questions to Guide Your Design

Before implementing, think through these:

Dependencies

  • If Service A depends on Service B, and A is failing, should B’s owner be paged?
  • What if B is also failing? Do you page both or just B?
  • How do you avoid “Cascade Paging” (paging everyone)?

Time and Context

  • Does the escalation change at 2 PM Tuesday vs. 2 AM Sunday?
  • What if the primary responder doesn’t answer in 15 minutes?
  • How do you handle holidays and PTO?

Failure Modes

  • What if PagerDuty is down?
  • What if the service metadata is missing?
  • What if the owning team no longer exists?

5.6 Thinking Exercise

The “Blame Game” Simulation

Trace a failure in a shared component (like a Load Balancer).

Scenario:

  1. Load Balancer starts dropping connections
  2. App Team A sees errors, pages LB team
  3. LB Team says “It’s your app generating bad requests”
  4. App Team A says “No, it’s the LB”
  5. 45 minutes later, still arguing

Questions:

  • Who is responsible for the Load Balancer?
  • If the app team pages the LB team, and the LB team says “it’s your app,” who breaks the tie?
  • Does your escalation logic include a “Final Arbiter” (like a CTO or Architect)?
  • How would you model this in your rules?

5.7 Hints in Layers

Hint 1: Map the Metadata Every service needs at minimum:

service:
  owner_team_id: string
  criticality_tier: 1 | 2 | 3
  oncall_schedule: string

Hint 2: Define Rules as Data Rules should be YAML, not hardcoded:

rule:
  conditions:
    - severity: P1
    - time: outside_business_hours
  actions:
    - page: primary
    - page: backup

Hint 3: Handle Dependencies Add depends_on to service metadata:

service:
  depends_on:
    - service_id: auth-service
      notify_on_failure: true  # INFORM, don't PAGE

Hint 4: Build Safety Rules What if metadata is missing?

rule:
  name: fallback-unknown-service
  conditions:
    - service_metadata: missing
  actions:
    - page: platform_oncall
    - notify: "#unknown-service-alerts"

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you design an on-call rotation that doesn’t burn people out?”
    • Follow-the-sun, secondary backups, max pages per shift, compensation
  2. “What is the difference between an alert and an incident?”
    • Alert = signal. Incident = declared event requiring coordination.
  3. “Explain the ‘Secondary’ escalation layer.”
    • Backup responder if primary doesn’t acknowledge within timeout
  4. “How do you handle ‘Silent Failures’ where no one gets paged?”
    • Canary alerts, synthetic monitoring, dead-man switches
  5. “What are the common pitfalls of automated escalation?”
    • Over-paging, alert fatigue, wrong ownership metadata, missing fallbacks

5.9 Books That Will Help

Topic Book Chapter
Incident Management “SRE Book” (Google) Ch. 14: Managing Incidents
Response Strategies “The Site Reliability Workbook” Ch. 9: Incident Response
On-Call Best Practices “Accelerate” Ch. 7

5.10 Implementation Phases

Phase 1: Data Model (3-4 hours)

  1. Define Service, Team, Rule, Event dataclasses
  2. Create sample services.yaml and teams.yaml
  3. Write loader functions

Phase 2: Rules Engine (4-5 hours)

  1. Implement condition matching
  2. Implement action resolution
  3. Handle rule priority/ordering

Phase 3: Decision Engine (3-4 hours)

  1. Combine registry + rules
  2. Build EscalationPlan output
  3. Add reasoning log

Phase 4: CLI & Testing (3-4 hours)

  1. Create CLI with click
  2. Add test events
  3. Verify output format

5.11 Key Implementation Decisions

Decision Option A Option B Recommendation
Rule format YAML Python DSL YAML (non-dev editable)
Time handling UTC only Timezone-aware Timezone-aware (follow team TZ)
Dependency notification Always notify Only if healthy Only if related to failure
PagerDuty integration Real API Mock Mock first, real integration later

6. Testing Strategy

Unit Tests

def test_p1_after_hours_pages_backup():
    event = Event(service="pay", severity="P1", time="03:00 UTC")
    plan = engine.escalate(event)
    assert len(plan.pages) == 2  # primary + backup

def test_missing_service_uses_fallback():
    event = Event(service="unknown", severity="P1")
    plan = engine.escalate(event)
    assert "platform-oncall" in plan.pages

def test_dependency_notified_not_paged():
    event = Event(service="checkout", severity="P1")
    plan = engine.escalate(event)
    # checkout depends on payments
    assert "payments" in plan.notifications
    assert "payments" not in plan.pages

Integration Tests

  • Load real YAML files
  • Process 100 sample events
  • Verify all produce valid plans

Chaos Tests

  • Remove service from registry mid-run
  • Provide malformed event
  • Simulate PagerDuty timeout

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Wrong person paged Reassignment rate > 20% Stale ownership data Automate registry sync
Everyone paged 10+ pages for one incident Cascade from dependencies Add “related_incident” dedup
No one paged Silent failures Missing fallback rules Add catch-all rule
Slow escalation MTTA > 10 min Complex rule evaluation Optimize rule engine, add caching

8. Extensions & Challenges

Extension 1: PagerDuty Integration

Actually page people using PagerDuty API.

import requests

def page_pagerduty(schedule_id: str, message: str):
    requests.post(
        "https://api.pagerduty.com/incidents",
        json={
            "incident": {
                "type": "incident",
                "title": message,
                "service": {"id": schedule_id},
            }
        },
        headers={"Authorization": f"Token token={API_KEY}"}
    )

Extension 2: Slack Bot

Post escalation decisions to Slack in real-time.

Extension 3: Escalation Timeout

If primary doesn’t acknowledge in 5 minutes, auto-escalate to secondary.

Extension 4: Post-Incident Analysis

Generate report showing all escalation decisions for a time period.


9. Real-World Connections

How Big Tech Does This:

  • PagerDuty: Event Intelligence for automatic routing
  • Google: Cascading on-call with automatic escalation
  • Netflix: “PagerDuty + Slack + custom routing”

Open Source:

  • Opsgenie: Similar routing rules
  • Grafana OnCall: Open-source on-call management

10. Resources

PagerDuty

Articles


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • I can explain MTTA and MTTR
  • services.yaml has at least 5 services with dependencies
  • rules.yaml covers P1-P4 for business/after hours
  • Engine handles missing service gracefully (fallback)
  • Output includes reasoning log
  • I’ve tested with at least 10 different event scenarios
  • All pages include runbook and dashboard links

12. Submission / Completion Criteria

This project is complete when you have:

  1. services.yaml with 5+ services including dependencies
  2. teams.yaml with 3+ teams and on-call schedules
  3. rules.yaml with 8+ rules covering severity/time combinations
  4. CLI tool that processes events and outputs escalation plan
  5. Test suite with 10+ event fixtures
  6. Reasoning log showing decision trace

Previous Project: P03: Ownership Boundary Mapper Next Project: P05: Platform-as-a-Product Blueprint