Project 10: Incident Response Battle Cards

Create highly condensed 1-page guides for specific incident scenarios that define exactly who leads, who to notify, and the first diagnostic steps.

Quick Reference

Attribute	Value
Difficulty	Beginner
Time Estimate	Weekend (8-12 hours)
Primary Language	Markdown / HTML
Alternative Languages	JavaScript (for interactive cards)
Prerequisites	Basic incident management understanding
Key Topics	Incident Command System, OODA Loop, Crisis Communication

1. Learning Objectives

By completing this project, you will:

Codify incident response patterns into reusable templates
Define clear roles for crisis situations
Reduce cognitive load during high-stress moments
Create a scalable system for documenting runbooks
Enable consistent response regardless of who is on-call

2. Theoretical Foundation

2.1 Core Concepts

The Incident Response Problem

WITHOUT BATTLE CARDS                WITH BATTLE CARDS
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ 3:00 AM: Alert fires        │    │ 3:00 AM: Alert fires        │
│ 3:05 AM: "What do I do?"   │    │ 3:02 AM: Pull battle card   │
│ 3:10 AM: Searching Slack   │    │ 3:03 AM: Follow Step 1      │
│ 3:20 AM: "Who do I page?"  │    │ 3:05 AM: Page correct team  │
│ 3:35 AM: Wrong team paged  │    │ 3:10 AM: Issue identified   │
│ 4:00 AM: Still searching   │    │ 3:20 AM: Incident resolved  │
│ 4:30 AM: Finally start fix │    │                             │
└─────────────────────────────┘    └─────────────────────────────┘

The OODA Loop

Military decision-making framework applied to incidents:

        ┌───────────────────────────────────────────────┐
        │                                               │
        ▼                                               │
   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
   │ OBSERVE │───►│ ORIENT  │───►│ DECIDE  │───►│   ACT   │
   │         │    │         │    │         │    │         │
   │ What's  │    │ What    │    │ What    │    │ Execute │
   │ happen- │    │ does it │    │ should  │    │ the     │
   │ ing?    │    │ mean?   │    │ I do?   │    │ action  │
   └─────────┘    └─────────┘    └─────────┘    └─────────┘
        ▲                                               │
        │                                               │
        └───────────────────────────────────────────────┘

Battle Cards accelerate ORIENT and DECIDE phases.

Incident Command System (ICS)

Standardized roles during incidents:

Role	Responsibility
Incident Commander (IC)	Coordinates response, makes decisions
Communications Lead	Updates stakeholders, manages channels
Operations Lead	Executes technical fixes
Scribe	Documents timeline and actions

                    ┌─────────────────────┐
                    │ INCIDENT COMMANDER  │
                    │ (Decision Maker)    │
                    └─────────┬───────────┘
                              │
          ┌───────────────────┼───────────────────┐
          │                   │                   │
          ▼                   ▼                   ▼
   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
   │    COMMS    │     │   OPS LEAD  │     │   SCRIBE    │
   │   (Updates) │     │ (Fix it)    │     │  (Document) │
   └─────────────┘     └─────────────┘     └─────────────┘

2.2 Why This Matters

At 3 AM, under stress:

Memory fails
Judgment is impaired
Procedures are forgotten
Panic spreads

Battle Cards provide:

Immediate starting point
Reduced decision fatigue
Consistent response quality
Faster Mean Time to Resolve (MTTR)

2.3 Historical Context

Incident Command System: Developed for firefighting (1970s)
Aviation Checklists: Proven to prevent errors
DevOps Runbooks: Digital evolution of checklists
Battle Cards: Compact, scannable format for high-stress situations

2.4 Common Misconceptions

Misconception	Reality
“We know what to do”	You won’t at 3 AM after 4 hours of sleep
“Every incident is unique”	80% follow predictable patterns
“Documentation slows us down”	Searching slows you down more
“Senior devs don’t need this”	Senior devs are the source of knowledge—capture it

3. Project Specification

3.1 What You Will Build

Battle Card Template: Standardized format for incident scenarios
Card Collection: 5-10 cards for common incidents
Quick Access System: Physical deck or digital lookup
Update Process: How cards get revised after incidents

3.2 Functional Requirements

Card Format
- Scenario title (what’s happening)
- Primary lead (who runs point)
- First 3 diagnostic steps
- Notification list
- Escalation criteria
- Key links (dashboards, runbooks)
Accessibility
- Accessible in < 30 seconds
- Works offline (if possible)
- Scannable at a glance
- No login required during incident
Maintenance
- Review after every major incident
- Quarterly audit for accuracy
- Clear ownership of each card

3.3 Non-Functional Requirements

Each card must fit on one page (or one screen)
Text must be readable under stress (large fonts, clear hierarchy)
Cards must be available in incident Slack channel
Must support version history

3.4 Example Usage / Output

Battle Card Example:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  🔴 BATTLE CARD: Checkout Service 5xx Errors                    ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃                                                                 ┃
┃  SCENARIO                                                       ┃
┃  Checkout service returning HTTP 500 errors to customers.       ┃
┃  Customers cannot complete purchases.                           ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  ROLES                                                          ┃
┃  ─────                                                          ┃
┃  Primary Lead:  Checkout Team On-call                           ┃
┃  Comms Lead:    SRE Lead (or IC on rotation)                    ┃
┃  Escalation:    If no response in 10 min → Page Manager         ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  FIRST 3 STEPS (Do these in order)                              ┃
┃  ─────────────────────────────────                              ┃
┃  1. Check RDS Latency                                           ┃
┃     📊 grafana.internal/d/checkout-rds                          ┃
┃     ✓ Normal: < 50ms  ⚠️ High: 50-200ms  🔴 Critical: > 200ms   ┃
┃                                                                 ┃
┃  2. Verify Redis Connection Pool                                ┃
┃     📊 grafana.internal/d/checkout-redis                        ┃
┃     ✓ Available: > 50  ⚠️ Low: 10-50  🔴 Exhausted: < 10        ┃
┃                                                                 ┃
┃  3. Check Recent Deployments                                    ┃
┃     📊 argocd.internal/applications/checkout                    ┃
┃     → If deployment in last 2 hours: ROLLBACK FIRST             ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  IF DB LATENCY > 200ms                                          ┃
┃  ────────────────────                                           ┃
┃  → Scale checkout pods to 10: kubectl scale deploy/checkout     ┃
┃  → If no improvement: Failover to RDS replica                   ┃
┃    📖 wiki.internal/runbooks/rds-failover                       ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  NOTIFY                                                         ┃
┃  ──────                                                         ┃
┃  At minute 0:    Post to #incidents-active                      ┃
┃  At minute 10:   Update in #checkout-oncall                     ┃
┃  At minute 30:   Escalate to Engineering Manager if unresolved  ┃
┃  At minute 60:   Notify VP Engineering                          ┃
┃                                                                 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃  LINKS                                                          ┃
┃  ─────                                                          ┃
┃  🔗 Full Runbook:     wiki.internal/runbooks/checkout-5xx       ┃
┃  📊 Dashboard:        grafana.internal/d/checkout-overview      ┃
┃  📞 PagerDuty:        app.pagerduty.com/services/checkout       ┃
┃  💬 Slack:            #checkout-oncall                          ┃
┃                                                                 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
                        Last Updated: 2025-01-15
                        Owner: @checkout-team-lead

Card Collection Index:

# Incident Battle Cards Index

## Critical (Tier 1 Services)

| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-001 | [Checkout 5xx Errors](./bc-001-checkout-5xx.md) | Checkout On-call | Revenue stop |
| BC-002 | [Payment Gateway Down](./bc-002-payment-down.md) | Payments On-call | Revenue stop |
| BC-003 | [Database Failover](./bc-003-db-failover.md) | DBA On-call | All services |
| BC-004 | [CDN Outage](./bc-004-cdn-outage.md) | Platform On-call | All frontend |

## Important (Tier 2 Services)

| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-010 | [Search Degraded](./bc-010-search-slow.md) | Search On-call | UX impact |
| BC-011 | [Email Delays](./bc-011-email-delays.md) | Comms On-call | Customer comms |

## Security

| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-050 | [Credential Leak](./bc-050-cred-leak.md) | Security On-call | Data breach |
| BC-051 | [DDoS Attack](./bc-051-ddos.md) | Platform On-call | Availability |

3.5 Real World Outcome

After implementing Battle Cards:

On-call engineers have confidence starting incident response
MTTR decreases as right actions happen first
Fewer escalations due to “I don’t know what to do”
New team members can respond effectively sooner

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                    BATTLE CARD SYSTEM                           │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  CARD REPO    │     │  QUICK ACCESS │     │  REVIEW CYCLE │
│  (Git)        │     │               │     │               │
│               │     │ - Slack bot   │     │ - Post-       │
│  Markdown     │     │ - Wiki page   │     │   incident    │
│  templates    │     │ - Mobile app  │     │ - Quarterly   │
└───────────────┘     └───────────────┘     └───────────────┘

4.2 Key Components

Card Template: Standardized Markdown format
Card Repository: Git repo with all cards
Access Mechanism: Slack bot or pinned messages
Review Process: Update after incidents

4.3 Data Structures

# battle-card-schema.yaml
card:
  id: string              # BC-001
  title: string           # Checkout 5xx Errors
  severity: tier-1 | tier-2 | tier-3
  scenario:
    description: string   # What's happening
    symptoms: list        # How you know it's this

  roles:
    primary_lead: string  # Team or rotation
    comms_lead: string
    escalation_path: list

  first_steps:
    - number: 1
      action: string
      link: url
      thresholds:
        normal: string
        warning: string
        critical: string

  conditional_actions:
    - condition: string
      actions: list

  notification_schedule:
    - minute: 0
      action: string
      channel: string
    - minute: 10
      action: string
      channel: string

  links:
    runbook: url
    dashboard: url
    pagerduty: url
    slack_channel: string

  metadata:
    owner: string
    last_updated: date
    last_used: date
    review_due: date

4.4 Algorithm Overview

Card Retrieval:

def get_battle_card(incident_type: str) -> BattleCard:
    # 1. Match incident to card
    cards = load_all_cards()
    matching = [c for c in cards if c.matches(incident_type)]

    if not matching:
        return default_card()  # Generic incident response

    # 2. Return most specific match
    return sorted(matching, key=lambda c: c.specificity)[-1]

def matches(card: BattleCard, incident_type: str) -> bool:
    # Match by keywords in scenario
    keywords = extract_keywords(incident_type)
    return any(kw in card.scenario for kw in keywords)

5. Implementation Guide

5.1 Development Environment Setup

No special setup required—these are Markdown documents.

Optional tools:

Markdown editor (VS Code, Obsidian)
Static site generator (MkDocs) for nice rendering
Slack app for quick access

5.2 Project Structure

battle-cards/
├── cards/
│   ├── tier-1/
│   │   ├── bc-001-checkout-5xx.md
│   │   ├── bc-002-payment-down.md
│   │   └── bc-003-db-failover.md
│   ├── tier-2/
│   │   └── bc-010-search-slow.md
│   └── security/
│       └── bc-050-cred-leak.md
├── templates/
│   └── battle-card-template.md
├── index.md               # Card directory
└── review-log.md          # When cards were reviewed

5.3 The Core Question You’re Answering

“Can your team handle a P0 incident without the ‘Senior Dev’ being awake?”

If your operating model depends on a single person’s intuition, it is not scalable. Battle Cards turn intuition into a repeatable process.

5.4 Concepts You Must Understand First

Stop and research these before writing:

The Incident Command System (ICS)
- What are the standard roles in an incident?
- Book Reference: “The Site Reliability Workbook” Ch. 9
The OODA Loop
- How do you speed up decision-making during a crisis?
- Reference: Military strategy literature
Checklist Manifesto
- Why do simple checklists save lives?
- Book Reference: “The Checklist Manifesto” by Atul Gawande

5.5 Questions to Guide Your Design

Before writing, think through these:

Simplicity

Can a tired person read this at 3 AM?
Are the links easy to click on mobile?
Is there too much text?

Interaction

When does the card tell you to stop and call another team?
How is that “handover” defined?
What if multiple cards might apply?

Maintenance

Who updates the card after an incident?
How do you ensure cards don’t go stale?
How do you test that links still work?

5.6 Thinking Exercise

The “Blank Screen” Drill

Imagine you’re on-call and see a dashboard where every chart is red. You have 3 minutes to decide who to page.

Questions:

Does your Battle Card help narrow down the source?
Does it tell you who to notify before you start the fix?
If the fix takes 2 hours, does the card have a schedule for updates?

Write down:

The first thing you’d look at
The first person you’d contact
When you’d send the first status update

5.7 Hints in Layers

Hint 1: Pick the Top 5 Incidents Don’t write cards for everything. Start with the 5 incidents that happen most often or cause the most damage.

Hint 2: Follow the 3-Step Rule Every card should have exactly 3 “Immediate Actions.” More than 3 overwhelms people under stress.

Hint 3: Explicitly Define Notification The card must say: “At minute 10, post to #incidents. At minute 30, page the CTO.”

Hint 4: Use a Template Every card should have the same layout. People need to know where to look for “Links” or “Contacts” without thinking.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

“How do you organize an incident response team?”
- ICS roles: Incident Commander, Comms Lead, Ops Lead, Scribe
“What is the role of a ‘Communications Lead’?”
- Keep stakeholders informed, manage external comms, shield responders from interruptions
“How do you conduct a blameless post-mortem?”
- Focus on systems not people, find contributing factors, generate action items
“Why is it important to have pre-defined Battle Cards?”
- Reduces cognitive load, ensures consistent response, enables anyone to respond
“How do you measure the effectiveness of your incident response?”
- MTTA, MTTR, incident count, severity distribution, post-mortem action completion

5.9 Books That Will Help

Topic	Book	Chapter
Incident Response	“The Site Reliability Workbook”	Ch. 9: Incident Response
Post-mortems	“SRE Book” (Google)	Ch. 15: Postmortem Culture
Checklists	“The Checklist Manifesto”	All

5.10 Implementation Phases

Phase 1: Template Design (2-3 hours)

Design card format (see example above)
Test with 2-3 engineers for readability
Finalize template

Phase 2: Card Creation (4-5 hours)

Identify top 5 incident types
Interview subject matter experts
Write cards for each

Phase 3: Access Setup (1-2 hours)

Pin cards in Slack incident channel
Add to on-call rotation wiki
Test retrieval speed

Phase 4: Review Process (1-2 hours)

Add card review to post-mortem template
Schedule quarterly card audit
Assign owners to each card

5.11 Key Implementation Decisions

Decision	Option A	Option B	Recommendation
Format	Markdown	HTML	Markdown (easy to edit)
Access	Slack pinned	Wiki page	Both (redundancy)
Structure	Flat list	By severity	By severity (faster triage)
Updates	Ad-hoc	Post-incident	Post-incident (systematic)

6. Testing Strategy

Readability Tests

Show card to on-call engineer for 30 seconds
Ask them to explain the first 3 steps
If they can’t, simplify the card

Link Tests

Click every link in every card monthly
Automate with a script if possible

Drill Tests

Run tabletop exercises using cards
Time how long it takes to start response
Identify gaps in cards

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Too much text	People don’t read it	Tried to cover every case	Focus on first 3 actions only
Can’t find card	Searching during incident	Poor organization	Pin in Slack, use consistent naming
Stale links	Dashboards 404	No maintenance process	Add link check to quarterly review
Wrong actions	Card causes harm	Written without validation	Test with dry runs, review after use

8. Extensions & Challenges

Extension 1: Interactive Cards

Build a web app where clicking through the card records your actions (audit trail).

Extension 2: Slack Bot

/battlecard checkout-5xx returns the card directly in Slack.

Extension 3: Card Analytics

Track which cards are used most, which have the best outcomes.

Extension 4: AI-Assisted Triage

Based on alert content, suggest which battle card to use.

9. Real-World Connections

Examples from Industry:

PagerDuty: Incident response documentation best practices
Atlassian: Incident Management Playbook
Google: “Wheel of Misfortune” training with runbooks

Tools:

PagerDuty Runbook Automation
Rootly - Incident management with runbooks
Blameless - Incident retrospectives

10. Resources

Incident Management

Templates

P04: Escalation Logic Tree - Automated escalation
P03: Ownership Mapper - Who to page

11. Self-Assessment Checklist

Before considering this project complete, verify:

I can explain the OODA Loop
Template is readable at 2 AM (tested with colleague)
At least 5 cards are created for top incidents
Cards are accessible from Slack in < 30 seconds
Each card has an owner
Review process is defined
At least one card has been used in a real incident

12. Submission / Completion Criteria

This project is complete when you have:

Template that fits on one page
5+ Battle Cards for common incidents
Index page listing all cards
Access mechanism (Slack pin, wiki)
Review process documented
Test run of at least one card in a drill or real incident

Previous Project: P09: Operational Readiness Review System Next Project: P11: Internal Service Catalog