Project 10: Incident Response Battle Cards

Create highly condensed 1-page guides for specific incident scenarios that define exactly who leads, who to notify, and the first diagnostic steps.

Quick Reference

Attribute Value
Difficulty Beginner
Time Estimate Weekend (8-12 hours)
Primary Language Markdown / HTML
Alternative Languages JavaScript (for interactive cards)
Prerequisites Basic incident management understanding
Key Topics Incident Command System, OODA Loop, Crisis Communication

1. Learning Objectives

By completing this project, you will:

  1. Codify incident response patterns into reusable templates
  2. Define clear roles for crisis situations
  3. Reduce cognitive load during high-stress moments
  4. Create a scalable system for documenting runbooks
  5. Enable consistent response regardless of who is on-call

2. Theoretical Foundation

2.1 Core Concepts

The Incident Response Problem

WITHOUT BATTLE CARDS                WITH BATTLE CARDS
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ 3:00 AM: Alert fires        │    │ 3:00 AM: Alert fires        │
│ 3:05 AM: "What do I do?"   │    │ 3:02 AM: Pull battle card   │
│ 3:10 AM: Searching Slack   │    │ 3:03 AM: Follow Step 1      │
│ 3:20 AM: "Who do I page?"  │    │ 3:05 AM: Page correct team  │
│ 3:35 AM: Wrong team paged  │    │ 3:10 AM: Issue identified   │
│ 4:00 AM: Still searching   │    │ 3:20 AM: Incident resolved  │
│ 4:30 AM: Finally start fix │    │                             │
└─────────────────────────────┘    └─────────────────────────────┘

The OODA Loop

Military decision-making framework applied to incidents:

        ┌───────────────────────────────────────────────┐
        │                                               │
        ▼                                               │
   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
   │ OBSERVE │───►│ ORIENT  │───►│ DECIDE  │───►│   ACT   │
   │         │    │         │    │         │    │         │
   │ What's  │    │ What    │    │ What    │    │ Execute │
   │ happen- │    │ does it │    │ should  │    │ the     │
   │ ing?    │    │ mean?   │    │ I do?   │    │ action  │
   └─────────┘    └─────────┘    └─────────┘    └─────────┘
        ▲                                               │
        │                                               │
        └───────────────────────────────────────────────┘

Battle Cards accelerate ORIENT and DECIDE phases.

Incident Command System (ICS)

Standardized roles during incidents:

Role Responsibility
Incident Commander (IC) Coordinates response, makes decisions
Communications Lead Updates stakeholders, manages channels
Operations Lead Executes technical fixes
Scribe Documents timeline and actions
                    ┌─────────────────────┐
                    │ INCIDENT COMMANDER  │
                    │ (Decision Maker)    │
                    └─────────┬───────────┘
                              │
          ┌───────────────────┼───────────────────┐
          │                   │                   │
          ▼                   ▼                   ▼
   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
   │    COMMS    │     │   OPS LEAD  │     │   SCRIBE    │
   │   (Updates) │     │ (Fix it)    │     │  (Document) │
   └─────────────┘     └─────────────┘     └─────────────┘

2.2 Why This Matters

At 3 AM, under stress:

  • Memory fails
  • Judgment is impaired
  • Procedures are forgotten
  • Panic spreads

Battle Cards provide:

  • Immediate starting point
  • Reduced decision fatigue
  • Consistent response quality
  • Faster Mean Time to Resolve (MTTR)

2.3 Historical Context

  • Incident Command System: Developed for firefighting (1970s)
  • Aviation Checklists: Proven to prevent errors
  • DevOps Runbooks: Digital evolution of checklists
  • Battle Cards: Compact, scannable format for high-stress situations

2.4 Common Misconceptions

Misconception Reality
“We know what to do” You won’t at 3 AM after 4 hours of sleep
“Every incident is unique” 80% follow predictable patterns
“Documentation slows us down” Searching slows you down more
“Senior devs don’t need this” Senior devs are the source of knowledge—capture it

3. Project Specification

3.1 What You Will Build

  1. Battle Card Template: Standardized format for incident scenarios
  2. Card Collection: 5-10 cards for common incidents
  3. Quick Access System: Physical deck or digital lookup
  4. Update Process: How cards get revised after incidents

3.2 Functional Requirements

  1. Card Format
    • Scenario title (what’s happening)
    • Primary lead (who runs point)
    • First 3 diagnostic steps
    • Notification list
    • Escalation criteria
    • Key links (dashboards, runbooks)
  2. Accessibility
    • Accessible in < 30 seconds
    • Works offline (if possible)
    • Scannable at a glance
    • No login required during incident
  3. Maintenance
    • Review after every major incident
    • Quarterly audit for accuracy
    • Clear ownership of each card

3.3 Non-Functional Requirements

  • Each card must fit on one page (or one screen)
  • Text must be readable under stress (large fonts, clear hierarchy)
  • Cards must be available in incident Slack channel
  • Must support version history

3.4 Example Usage / Output

Battle Card Example:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  🔴 BATTLE CARD: Checkout Service 5xx Errors                    ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃                                                                 ┃
┃  SCENARIO                                                       ┃
┃  Checkout service returning HTTP 500 errors to customers.       ┃
┃  Customers cannot complete purchases.                           ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  ROLES                                                          ┃
┃  ─────                                                          ┃
┃  Primary Lead:  Checkout Team On-call                           ┃
┃  Comms Lead:    SRE Lead (or IC on rotation)                    ┃
┃  Escalation:    If no response in 10 min → Page Manager         ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  FIRST 3 STEPS (Do these in order)                              ┃
┃  ─────────────────────────────────                              ┃
┃  1. Check RDS Latency                                           ┃
┃     📊 grafana.internal/d/checkout-rds                          ┃
┃     ✓ Normal: < 50ms  ⚠️ High: 50-200ms  🔴 Critical: > 200ms   ┃
┃                                                                 ┃
┃  2. Verify Redis Connection Pool                                ┃
┃     📊 grafana.internal/d/checkout-redis                        ┃
┃     ✓ Available: > 50  ⚠️ Low: 10-50  🔴 Exhausted: < 10        ┃
┃                                                                 ┃
┃  3. Check Recent Deployments                                    ┃
┃     📊 argocd.internal/applications/checkout                    ┃
┃     → If deployment in last 2 hours: ROLLBACK FIRST             ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  IF DB LATENCY > 200ms                                          ┃
┃  ────────────────────                                           ┃
┃  → Scale checkout pods to 10: kubectl scale deploy/checkout     ┃
┃  → If no improvement: Failover to RDS replica                   ┃
┃    📖 wiki.internal/runbooks/rds-failover                       ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  NOTIFY                                                         ┃
┃  ──────                                                         ┃
┃  At minute 0:    Post to #incidents-active                      ┃
┃  At minute 10:   Update in #checkout-oncall                     ┃
┃  At minute 30:   Escalate to Engineering Manager if unresolved  ┃
┃  At minute 60:   Notify VP Engineering                          ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  LINKS                                                          ┃
┃  ─────                                                          ┃
┃  🔗 Full Runbook:     wiki.internal/runbooks/checkout-5xx       ┃
┃  📊 Dashboard:        grafana.internal/d/checkout-overview      ┃
┃  📞 PagerDuty:        app.pagerduty.com/services/checkout       ┃
┃  💬 Slack:            #checkout-oncall                          ┃
┃                                                                 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
                        Last Updated: 2025-01-15
                        Owner: @checkout-team-lead

Card Collection Index:

# Incident Battle Cards Index

## Critical (Tier 1 Services)

| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-001 | [Checkout 5xx Errors](./bc-001-checkout-5xx.md) | Checkout On-call | Revenue stop |
| BC-002 | [Payment Gateway Down](./bc-002-payment-down.md) | Payments On-call | Revenue stop |
| BC-003 | [Database Failover](./bc-003-db-failover.md) | DBA On-call | All services |
| BC-004 | [CDN Outage](./bc-004-cdn-outage.md) | Platform On-call | All frontend |

## Important (Tier 2 Services)

| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-010 | [Search Degraded](./bc-010-search-slow.md) | Search On-call | UX impact |
| BC-011 | [Email Delays](./bc-011-email-delays.md) | Comms On-call | Customer comms |

## Security

| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-050 | [Credential Leak](./bc-050-cred-leak.md) | Security On-call | Data breach |
| BC-051 | [DDoS Attack](./bc-051-ddos.md) | Platform On-call | Availability |

3.5 Real World Outcome

After implementing Battle Cards:

  • On-call engineers have confidence starting incident response
  • MTTR decreases as right actions happen first
  • Fewer escalations due to “I don’t know what to do”
  • New team members can respond effectively sooner

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                    BATTLE CARD SYSTEM                           │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  CARD REPO    │     │  QUICK ACCESS │     │  REVIEW CYCLE │
│  (Git)        │     │               │     │               │
│               │     │ - Slack bot   │     │ - Post-       │
│  Markdown     │     │ - Wiki page   │     │   incident    │
│  templates    │     │ - Mobile app  │     │ - Quarterly   │
└───────────────┘     └───────────────┘     └───────────────┘

4.2 Key Components

  1. Card Template: Standardized Markdown format
  2. Card Repository: Git repo with all cards
  3. Access Mechanism: Slack bot or pinned messages
  4. Review Process: Update after incidents

4.3 Data Structures

# battle-card-schema.yaml
card:
  id: string              # BC-001
  title: string           # Checkout 5xx Errors
  severity: tier-1 | tier-2 | tier-3
  scenario:
    description: string   # What's happening
    symptoms: list        # How you know it's this

  roles:
    primary_lead: string  # Team or rotation
    comms_lead: string
    escalation_path: list

  first_steps:
    - number: 1
      action: string
      link: url
      thresholds:
        normal: string
        warning: string
        critical: string

  conditional_actions:
    - condition: string
      actions: list

  notification_schedule:
    - minute: 0
      action: string
      channel: string
    - minute: 10
      action: string
      channel: string

  links:
    runbook: url
    dashboard: url
    pagerduty: url
    slack_channel: string

  metadata:
    owner: string
    last_updated: date
    last_used: date
    review_due: date

4.4 Algorithm Overview

Card Retrieval:

def get_battle_card(incident_type: str) -> BattleCard:
    # 1. Match incident to card
    cards = load_all_cards()
    matching = [c for c in cards if c.matches(incident_type)]

    if not matching:
        return default_card()  # Generic incident response

    # 2. Return most specific match
    return sorted(matching, key=lambda c: c.specificity)[-1]

def matches(card: BattleCard, incident_type: str) -> bool:
    # Match by keywords in scenario
    keywords = extract_keywords(incident_type)
    return any(kw in card.scenario for kw in keywords)

5. Implementation Guide

5.1 Development Environment Setup

No special setup required—these are Markdown documents.

Optional tools:

  • Markdown editor (VS Code, Obsidian)
  • Static site generator (MkDocs) for nice rendering
  • Slack app for quick access

5.2 Project Structure

battle-cards/
├── cards/
│   ├── tier-1/
│   │   ├── bc-001-checkout-5xx.md
│   │   ├── bc-002-payment-down.md
│   │   └── bc-003-db-failover.md
│   ├── tier-2/
│   │   └── bc-010-search-slow.md
│   └── security/
│       └── bc-050-cred-leak.md
├── templates/
│   └── battle-card-template.md
├── index.md               # Card directory
└── review-log.md          # When cards were reviewed

5.3 The Core Question You’re Answering

“Can your team handle a P0 incident without the ‘Senior Dev’ being awake?”

If your operating model depends on a single person’s intuition, it is not scalable. Battle Cards turn intuition into a repeatable process.

5.4 Concepts You Must Understand First

Stop and research these before writing:

  1. The Incident Command System (ICS)
    • What are the standard roles in an incident?
    • Book Reference: “The Site Reliability Workbook” Ch. 9
  2. The OODA Loop
    • How do you speed up decision-making during a crisis?
    • Reference: Military strategy literature
  3. Checklist Manifesto
    • Why do simple checklists save lives?
    • Book Reference: “The Checklist Manifesto” by Atul Gawande

5.5 Questions to Guide Your Design

Before writing, think through these:

Simplicity

  • Can a tired person read this at 3 AM?
  • Are the links easy to click on mobile?
  • Is there too much text?

Interaction

  • When does the card tell you to stop and call another team?
  • How is that “handover” defined?
  • What if multiple cards might apply?

Maintenance

  • Who updates the card after an incident?
  • How do you ensure cards don’t go stale?
  • How do you test that links still work?

5.6 Thinking Exercise

The “Blank Screen” Drill

Imagine you’re on-call and see a dashboard where every chart is red. You have 3 minutes to decide who to page.

Questions:

  1. Does your Battle Card help narrow down the source?
  2. Does it tell you who to notify before you start the fix?
  3. If the fix takes 2 hours, does the card have a schedule for updates?

Write down:

  • The first thing you’d look at
  • The first person you’d contact
  • When you’d send the first status update

5.7 Hints in Layers

Hint 1: Pick the Top 5 Incidents Don’t write cards for everything. Start with the 5 incidents that happen most often or cause the most damage.

Hint 2: Follow the 3-Step Rule Every card should have exactly 3 “Immediate Actions.” More than 3 overwhelms people under stress.

Hint 3: Explicitly Define Notification The card must say: “At minute 10, post to #incidents. At minute 30, page the CTO.”

Hint 4: Use a Template Every card should have the same layout. People need to know where to look for “Links” or “Contacts” without thinking.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do you organize an incident response team?”
    • ICS roles: Incident Commander, Comms Lead, Ops Lead, Scribe
  2. “What is the role of a ‘Communications Lead’?”
    • Keep stakeholders informed, manage external comms, shield responders from interruptions
  3. “How do you conduct a blameless post-mortem?”
    • Focus on systems not people, find contributing factors, generate action items
  4. “Why is it important to have pre-defined Battle Cards?”
    • Reduces cognitive load, ensures consistent response, enables anyone to respond
  5. “How do you measure the effectiveness of your incident response?”
    • MTTA, MTTR, incident count, severity distribution, post-mortem action completion

5.9 Books That Will Help

Topic Book Chapter
Incident Response “The Site Reliability Workbook” Ch. 9: Incident Response
Post-mortems “SRE Book” (Google) Ch. 15: Postmortem Culture
Checklists “The Checklist Manifesto” All

5.10 Implementation Phases

Phase 1: Template Design (2-3 hours)

  1. Design card format (see example above)
  2. Test with 2-3 engineers for readability
  3. Finalize template

Phase 2: Card Creation (4-5 hours)

  1. Identify top 5 incident types
  2. Interview subject matter experts
  3. Write cards for each

Phase 3: Access Setup (1-2 hours)

  1. Pin cards in Slack incident channel
  2. Add to on-call rotation wiki
  3. Test retrieval speed

Phase 4: Review Process (1-2 hours)

  1. Add card review to post-mortem template
  2. Schedule quarterly card audit
  3. Assign owners to each card

5.11 Key Implementation Decisions

Decision Option A Option B Recommendation
Format Markdown HTML Markdown (easy to edit)
Access Slack pinned Wiki page Both (redundancy)
Structure Flat list By severity By severity (faster triage)
Updates Ad-hoc Post-incident Post-incident (systematic)

6. Testing Strategy

Readability Tests

  • Show card to on-call engineer for 30 seconds
  • Ask them to explain the first 3 steps
  • If they can’t, simplify the card
  • Click every link in every card monthly
  • Automate with a script if possible

Drill Tests

  • Run tabletop exercises using cards
  • Time how long it takes to start response
  • Identify gaps in cards

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Too much text People don’t read it Tried to cover every case Focus on first 3 actions only
Can’t find card Searching during incident Poor organization Pin in Slack, use consistent naming
Stale links Dashboards 404 No maintenance process Add link check to quarterly review
Wrong actions Card causes harm Written without validation Test with dry runs, review after use

8. Extensions & Challenges

Extension 1: Interactive Cards

Build a web app where clicking through the card records your actions (audit trail).

Extension 2: Slack Bot

/battlecard checkout-5xx returns the card directly in Slack.

Extension 3: Card Analytics

Track which cards are used most, which have the best outcomes.

Extension 4: AI-Assisted Triage

Based on alert content, suggest which battle card to use.


9. Real-World Connections

Examples from Industry:

  • PagerDuty: Incident response documentation best practices
  • Atlassian: Incident Management Playbook
  • Google: “Wheel of Misfortune” training with runbooks

Tools:


10. Resources

Incident Management

Templates


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • I can explain the OODA Loop
  • Template is readable at 2 AM (tested with colleague)
  • At least 5 cards are created for top incidents
  • Cards are accessible from Slack in < 30 seconds
  • Each card has an owner
  • Review process is defined
  • At least one card has been used in a real incident

12. Submission / Completion Criteria

This project is complete when you have:

  1. Template that fits on one page
  2. 5+ Battle Cards for common incidents
  3. Index page listing all cards
  4. Access mechanism (Slack pin, wiki)
  5. Review process documented
  6. Test run of at least one card in a drill or real incident

Previous Project: P09: Operational Readiness Review System Next Project: P11: Internal Service Catalog