Project 4: The Escalation Logic Tree (Incident Design)

Build a programmable escalation engine that determines WHO gets paged based on service boundaries, time of day, and failure type.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	1 Week (15-20 hours)
Primary Language	Python
Alternative Languages	JavaScript, Go
Prerequisites	Basic programming, understanding of on-call
Key Topics	Incident Response, SRE, Rule Engines

1. Learning Objectives

By completing this project, you will:

Design escalation paths that reflect organizational boundaries
Implement a decision tree for routing alerts to correct teams
Handle edge cases (timeouts, dependencies, cross-team outages)
Reduce Mean Time to Acknowledge (MTTA) through automation
Model the human side of incident response in code

2. Theoretical Foundation

2.1 Core Concepts

The Escalation Problem

MANUAL ESCALATION                   AUTOMATED ESCALATION
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ Alert fires                 │    │ Alert fires                 │
│ → Who owns this?            │    │ → Lookup service owner      │
│ → Slack around asking       │    │ → Check time/severity       │
│ → Page someone who guesses  │    │ → Page correct on-call      │
│ → They redirect to actual   │    │ → Notify dependencies       │
│   owner                     │    │ → Done (2 minutes)          │
│ → 30 minutes later...       │    │                             │
└─────────────────────────────┘    └─────────────────────────────┘

Escalation Hierarchy

                    ┌─────────────────────────┐
                    │    INCIDENT COMMANDER   │
                    │    (Cross-team P1s)     │
                    └───────────┬─────────────┘
                                │
            ┌───────────────────┼───────────────────┐
            │                   │                   │
            ▼                   ▼                   ▼
    ┌───────────────┐   ┌───────────────┐   ┌───────────────┐
    │ Service Owner │   │ Dependency    │   │ Communications│
    │ (Primary)     │   │ Owners        │   │ Lead          │
    │               │   │ (Informed)    │   │               │
    └───────────────┘   └───────────────┘   └───────────────┘
            │
    ┌───────┴───────┐
    ▼               ▼
┌────────┐    ┌────────┐
│ Primary│    │ Backup │
│ On-call│    │ On-call│
└────────┘    └────────┘

Key Metrics

Metric	Definition	Target
MTTA	Mean Time to Acknowledge	< 5 minutes
MTTR	Mean Time to Resolve	Service dependent
Reassignment Rate	% of alerts reassigned	< 5%
False Positive Rate	Alerts that weren’t real	< 10%

2.2 Why This Matters

Escalation is where team interfaces are tested under pressure. When systems fail:

Clear boundaries mean fast resolution
Unclear boundaries mean chaos and blame

The cost of slow escalation:

1 minute of downtime for a tier-1 service = $10,000+ (typical e-commerce)
30 minutes of searching for the owner = $300,000 in lost revenue

2.3 Historical Context

ITIL Incident Management (1980s): Formalized IT operations
PagerDuty/OpsGenie (2010s): On-call automation at scale
SRE Movement (2016): “Site Reliability Engineering” book formalized practices

2.4 Common Misconceptions

Misconception	Reality
“More alerts = more safety”	Too many alerts cause fatigue; people ignore them
“Everyone should be paged”	Page the right person, not everyone
“Escalation = punishment”	Escalation is process, not blame
“We’ll figure it out in the moment”	3 AM is not the time to design process

3. Project Specification

3.1 What You Will Build

A decision-tree simulator where you input an incident event and it outputs:

Who to page (with specific contact info)
Who to notify (dependencies, stakeholders)
The reasoning behind the decision

3.2 Functional Requirements

Service Metadata Input
- Service ID, owning team, criticality tier
- Dependencies (what this service calls)
- On-call schedule link
Decision Rules
- Severity-based routing (P1 vs P4)
- Time-based routing (business hours vs after hours)
- Dependency-aware notification
Escalation Tiers
- Primary: First responder
- Secondary: If primary doesn’t respond in X minutes
- Incident Commander: If P1 crosses team boundaries
Output
- Specific person/schedule to page
- Slack channels to notify
- Runbook links to include in alert

3.3 Non-Functional Requirements

Decision must complete in < 100ms
Rules must be version-controlled (YAML)
Must be testable with synthetic events
Must log all decisions for post-mortem analysis

3.4 Example Usage / Output

Input Event:

{
  "service": "payment-gateway",
  "error_type": "5xx_errors",
  "severity": "P1",
  "timestamp": "2025-01-15T03:30:00Z",
  "region": "us-east-1",
  "metrics": {
    "error_rate": 0.45,
    "latency_p99": 5200
  }
}

Output:

$ ./escalate --event event.json

=== ESCALATION DECISION ===

Timestamp: 2025-01-15 03:30:00 UTC (AFTER HOURS)
Service: payment-gateway
Severity: P1 (Critical)

[STEP 1] Lookup Owner
  → Service Owner: team-payments
  → Status: Active

[STEP 2] Severity Check
  → P1 detected → Notify Primary On-call immediately
  → Primary On-call: @jane-doe (PagerDuty: payments-oncall)

[STEP 3] Time Check
  → After hours (03:30 UTC) → Include backup
  → Backup On-call: @bob-smith

[STEP 4] Dependency Check
  → payment-gateway depends on: auth-service, user-db
  → Notifying downstream owners as INFORMED:
    - auth-service → team-identity (#auth-oncall)
    - user-db → team-data (#data-oncall)

[STEP 5] Criticality Check
  → Service is Tier-1 → Notify Incident Commander pool
  → IC Pool: #incident-commanders

=== ACTIONS ===

PAGING:
  1. PagerDuty → payments-oncall (Primary: @jane-doe)

NOTIFYING (Slack):
  1. #incidents-active → New P1: payment-gateway 5xx errors
  2. #payments-oncall → Your service, please join incident
  3. #auth-oncall → FYI: payment-gateway (depends on you) is down
  4. #incident-commanders → P1 Alert: Tier-1 service down

ATTACHING:
  - Runbook: https://wiki.example.com/runbooks/payment-5xx
  - Dashboard: https://grafana.example.com/d/payments

=== REASONING LOG ===
[03:30:00] Event received: P1 on payment-gateway
[03:30:00] Owner lookup: team-payments (ACTIVE)
[03:30:00] Severity P1 + After Hours → Aggressive escalation
[03:30:00] Tier-1 service → IC notification required
[03:30:00] 2 dependencies found → Notify as INFORMED

3.5 Real World Outcome

After implementing this system:

MTTA drops from 15 minutes to < 2 minutes
Reassignment rate drops from 30% to < 5%
On-call engineers get context (runbook, dashboard) in the alert

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                     ESCALATION ENGINE                           │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  Event Input  │     │ Service       │     │ Rules Engine  │
│  (Alert)      │     │ Registry      │     │ (Decision     │
│               │     │ (Metadata)    │     │  Logic)       │
└───────────────┘     └───────────────┘     └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  Decision Engine  │
                    │                   │
                    │  1. Lookup owner  │
                    │  2. Apply rules   │
                    │  3. Build actions │
                    └─────────┬─────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  PagerDuty    │     │  Slack        │     │  Audit Log    │
│  API          │     │  Webhook      │     │  (Decisions)  │
└───────────────┘     └───────────────┘     └───────────────┘

4.2 Key Components

Event Parser: Normalizes incoming alert data
Service Registry: Metadata about services (owner, tier, deps)
Rules Engine: Evaluates conditions and selects actions
Action Executor: Sends pages, posts to Slack
Audit Logger: Records all decisions for review

4.3 Data Structures

# services.yaml
services:
  - id: payment-gateway
    name: Payment Gateway
    owner: team-payments
    tier: 1  # 1 = critical, 2 = important, 3 = normal
    oncall_schedule: payments-oncall
    slack_channel: "#payments-alerts"
    dependencies:
      - auth-service
      - user-db
    runbook: https://wiki.example.com/runbooks/payment
    dashboard: https://grafana.example.com/d/payments

# rules.yaml
rules:
  - name: p1-after-hours
    conditions:
      - severity: P1
      - time: outside_business_hours
    actions:
      - page: primary_oncall
      - page: backup_oncall
      - notify: incident_commanders
      - notify: dependencies_informed

  - name: p1-business-hours
    conditions:
      - severity: P1
      - time: business_hours
    actions:
      - page: primary_oncall
      - notify: team_channel
      - notify: dependencies_informed

  - name: tier1-always
    conditions:
      - tier: 1
    actions:
      - notify: incident_commanders
      - escalation_timeout: 5m → secondary_oncall

# teams.yaml
teams:
  - id: team-payments
    name: Payments Team
    primary_oncall: payments-oncall  # PagerDuty schedule ID
    backup_oncall: payments-backup
    slack: "#team-payments"
    lead: "@payments-lead"

4.4 Algorithm Overview

def escalate(event: Event) -> EscalationPlan:
    # Step 1: Get service metadata
    service = registry.lookup(event.service_id)
    if not service:
        return default_escalation(event)

    # Step 2: Get owning team
    team = registry.get_team(service.owner)
    if team.status != "active":
        team = registry.get_team(team.merged_into)

    # Step 3: Evaluate rules
    context = build_context(event, service, team)
    matched_rules = rules_engine.evaluate(context)

    # Step 4: Build action plan
    actions = []
    for rule in matched_rules:
        actions.extend(rule.actions)

    # Step 5: Resolve actions to specific targets
    plan = EscalationPlan()
    for action in actions:
        if action.type == "page":
            schedule = resolve_schedule(action.target, team)
            plan.pages.append(schedule)
        elif action.type == "notify":
            channel = resolve_channel(action.target, service, team)
            plan.notifications.append(channel)

    # Step 6: Add context
    plan.runbook = service.runbook
    plan.dashboard = service.dashboard

    # Step 7: Log decision
    audit_log.record(event, plan, matched_rules)

    return plan

5. Implementation Guide

5.1 Development Environment Setup

# Create project
mkdir escalation-engine && cd escalation-engine
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install pyyaml click requests pytz

# For testing PagerDuty/Slack integration
pip install python-dotenv

5.2 Project Structure

escalation-engine/
├── data/
│   ├── services.yaml
│   ├── teams.yaml
│   └── rules.yaml
├── src/
│   ├── __init__.py
│   ├── models.py       # Event, Service, Team, Action
│   ├── registry.py     # Service/Team lookup
│   ├── rules.py        # Rules engine
│   ├── engine.py       # Main escalation logic
│   ├── actions.py      # PagerDuty/Slack integration
│   └── cli.py          # Command-line interface
├── tests/
│   ├── test_rules.py
│   ├── test_engine.py
│   └── fixtures/
│       └── events/
│           ├── p1_after_hours.json
│           └── p3_business_hours.json
└── escalate           # Entry point

5.3 The Core Question You’re Answering

“In a crisis, does the system know how to find the right human without a human having to look it up?”

Manual escalation during a P0 incident is a failure of the operating model. The model should have pre-defined paths for every foreseeable failure.

5.4 Concepts You Must Understand First

Stop and research these before coding:

Mean Time to Acknowledge (MTTA)
- How does escalation logic impact this metric?
- What’s the industry benchmark for MTTA?
- Book Reference: “Accelerate” by Nicole Forsgren
The Incident Commander Role
- When does a team-level issue become an org-level incident?
- What’s the difference between IC and first responder?
- Book Reference: “The Site Reliability Workbook” Ch. 9
Alert Fatigue
- What happens when on-call gets too many alerts?
- How do you design escalation to prevent burnout?
- Reference: PagerDuty operations guides

5.5 Questions to Guide Your Design

Before implementing, think through these:

Dependencies

If Service A depends on Service B, and A is failing, should B’s owner be paged?
What if B is also failing? Do you page both or just B?
How do you avoid “Cascade Paging” (paging everyone)?

Time and Context

Does the escalation change at 2 PM Tuesday vs. 2 AM Sunday?
What if the primary responder doesn’t answer in 15 minutes?
How do you handle holidays and PTO?

Failure Modes

What if PagerDuty is down?
What if the service metadata is missing?
What if the owning team no longer exists?

5.6 Thinking Exercise

The “Blame Game” Simulation

Trace a failure in a shared component (like a Load Balancer).

Scenario:

Load Balancer starts dropping connections
App Team A sees errors, pages LB team
LB Team says “It’s your app generating bad requests”
App Team A says “No, it’s the LB”
45 minutes later, still arguing

Questions:

Who is responsible for the Load Balancer?
If the app team pages the LB team, and the LB team says “it’s your app,” who breaks the tie?
Does your escalation logic include a “Final Arbiter” (like a CTO or Architect)?
How would you model this in your rules?

5.7 Hints in Layers

Hint 1: Map the Metadata Every service needs at minimum:

service:
  owner_team_id: string
  criticality_tier: 1 | 2 | 3
  oncall_schedule: string

Hint 2: Define Rules as Data Rules should be YAML, not hardcoded:

rule:
  conditions:
    - severity: P1
    - time: outside_business_hours
  actions:
    - page: primary
    - page: backup

Hint 3: Handle Dependencies Add depends_on to service metadata:

service:
  depends_on:
    - service_id: auth-service
      notify_on_failure: true  # INFORM, don't PAGE

Hint 4: Build Safety Rules What if metadata is missing?

rule:
  name: fallback-unknown-service
  conditions:
    - service_metadata: missing
  actions:
    - page: platform_oncall
    - notify: "#unknown-service-alerts"

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

“How do you design an on-call rotation that doesn’t burn people out?”
- Follow-the-sun, secondary backups, max pages per shift, compensation
“What is the difference between an alert and an incident?”
- Alert = signal. Incident = declared event requiring coordination.
“Explain the ‘Secondary’ escalation layer.”
- Backup responder if primary doesn’t acknowledge within timeout
“How do you handle ‘Silent Failures’ where no one gets paged?”
- Canary alerts, synthetic monitoring, dead-man switches
“What are the common pitfalls of automated escalation?”
- Over-paging, alert fatigue, wrong ownership metadata, missing fallbacks

5.9 Books That Will Help

Topic	Book	Chapter
Incident Management	“SRE Book” (Google)	Ch. 14: Managing Incidents
Response Strategies	“The Site Reliability Workbook”	Ch. 9: Incident Response
On-Call Best Practices	“Accelerate”	Ch. 7

5.10 Implementation Phases

Phase 1: Data Model (3-4 hours)

Define Service, Team, Rule, Event dataclasses
Create sample services.yaml and teams.yaml
Write loader functions

Phase 2: Rules Engine (4-5 hours)

Implement condition matching
Implement action resolution
Handle rule priority/ordering

Phase 3: Decision Engine (3-4 hours)

Combine registry + rules
Build EscalationPlan output
Add reasoning log

Phase 4: CLI & Testing (3-4 hours)

Create CLI with click
Add test events
Verify output format

5.11 Key Implementation Decisions

Decision	Option A	Option B	Recommendation
Rule format	YAML	Python DSL	YAML (non-dev editable)
Time handling	UTC only	Timezone-aware	Timezone-aware (follow team TZ)
Dependency notification	Always notify	Only if healthy	Only if related to failure
PagerDuty integration	Real API	Mock	Mock first, real integration later

6. Testing Strategy

Unit Tests

def test_p1_after_hours_pages_backup():
    event = Event(service="pay", severity="P1", time="03:00 UTC")
    plan = engine.escalate(event)
    assert len(plan.pages) == 2  # primary + backup

def test_missing_service_uses_fallback():
    event = Event(service="unknown", severity="P1")
    plan = engine.escalate(event)
    assert "platform-oncall" in plan.pages

def test_dependency_notified_not_paged():
    event = Event(service="checkout", severity="P1")
    plan = engine.escalate(event)
    # checkout depends on payments
    assert "payments" in plan.notifications
    assert "payments" not in plan.pages

Integration Tests

Load real YAML files
Process 100 sample events
Verify all produce valid plans

Chaos Tests

Remove service from registry mid-run
Provide malformed event
Simulate PagerDuty timeout

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Wrong person paged	Reassignment rate > 20%	Stale ownership data	Automate registry sync
Everyone paged	10+ pages for one incident	Cascade from dependencies	Add “related_incident” dedup
No one paged	Silent failures	Missing fallback rules	Add catch-all rule
Slow escalation	MTTA > 10 min	Complex rule evaluation	Optimize rule engine, add caching

8. Extensions & Challenges

Extension 1: PagerDuty Integration

Actually page people using PagerDuty API.

import requests

def page_pagerduty(schedule_id: str, message: str):
    requests.post(
        "https://api.pagerduty.com/incidents",
        json={
            "incident": {
                "type": "incident",
                "title": message,
                "service": {"id": schedule_id},
            }
        },
        headers={"Authorization": f"Token token={API_KEY}"}
    )

Extension 2: Slack Bot

Post escalation decisions to Slack in real-time.

Extension 3: Escalation Timeout

If primary doesn’t acknowledge in 5 minutes, auto-escalate to secondary.

Extension 4: Post-Incident Analysis

Generate report showing all escalation decisions for a time period.

9. Real-World Connections

How Big Tech Does This:

PagerDuty: Event Intelligence for automatic routing
Google: Cascading on-call with automatic escalation
Netflix: “PagerDuty + Slack + custom routing”

Open Source:

Opsgenie: Similar routing rules
Grafana OnCall: Open-source on-call management

10. Resources

PagerDuty

Articles

P03: Ownership Boundary Mapper - Define who owns what
P10: Incident Battle Cards - Crisis protocols

11. Self-Assessment Checklist

Before considering this project complete, verify:

I can explain MTTA and MTTR
services.yaml has at least 5 services with dependencies
rules.yaml covers P1-P4 for business/after hours
Engine handles missing service gracefully (fallback)
Output includes reasoning log
I’ve tested with at least 10 different event scenarios
All pages include runbook and dashboard links

12. Submission / Completion Criteria

This project is complete when you have:

services.yaml with 5+ services including dependencies
teams.yaml with 3+ teams and on-call schedules
rules.yaml with 8+ rules covering severity/time combinations
CLI tool that processes events and outputs escalation plan
Test suite with 10+ event fixtures
Reasoning log showing decision trace

Previous Project: P03: Ownership Boundary Mapper Next Project: P05: Platform-as-a-Product Blueprint