Project 10: Incident Response Battle Cards
Create highly condensed 1-page guides for specific incident scenarios that define exactly who leads, who to notify, and the first diagnostic steps.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Beginner |
| Time Estimate | Weekend (8-12 hours) |
| Primary Language | Markdown / HTML |
| Alternative Languages | JavaScript (for interactive cards) |
| Prerequisites | Basic incident management understanding |
| Key Topics | Incident Command System, OODA Loop, Crisis Communication |
1. Learning Objectives
By completing this project, you will:
- Codify incident response patterns into reusable templates
- Define clear roles for crisis situations
- Reduce cognitive load during high-stress moments
- Create a scalable system for documenting runbooks
- Enable consistent response regardless of who is on-call
2. Theoretical Foundation
2.1 Core Concepts
The Incident Response Problem
WITHOUT BATTLE CARDS WITH BATTLE CARDS
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ 3:00 AM: Alert fires │ │ 3:00 AM: Alert fires │
│ 3:05 AM: "What do I do?" │ │ 3:02 AM: Pull battle card │
│ 3:10 AM: Searching Slack │ │ 3:03 AM: Follow Step 1 │
│ 3:20 AM: "Who do I page?" │ │ 3:05 AM: Page correct team │
│ 3:35 AM: Wrong team paged │ │ 3:10 AM: Issue identified │
│ 4:00 AM: Still searching │ │ 3:20 AM: Incident resolved │
│ 4:30 AM: Finally start fix │ │ │
└─────────────────────────────┘ └─────────────────────────────┘
The OODA Loop
Military decision-making framework applied to incidents:
┌───────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ OBSERVE │───►│ ORIENT │───►│ DECIDE │───►│ ACT │
│ │ │ │ │ │ │ │
│ What's │ │ What │ │ What │ │ Execute │
│ happen- │ │ does it │ │ should │ │ the │
│ ing? │ │ mean? │ │ I do? │ │ action │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
▲ │
│ │
└───────────────────────────────────────────────┘
Battle Cards accelerate ORIENT and DECIDE phases.
Incident Command System (ICS)
Standardized roles during incidents:
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions |
| Communications Lead | Updates stakeholders, manages channels |
| Operations Lead | Executes technical fixes |
| Scribe | Documents timeline and actions |
┌─────────────────────┐
│ INCIDENT COMMANDER │
│ (Decision Maker) │
└─────────┬───────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ COMMS │ │ OPS LEAD │ │ SCRIBE │
│ (Updates) │ │ (Fix it) │ │ (Document) │
└─────────────┘ └─────────────┘ └─────────────┘
2.2 Why This Matters
At 3 AM, under stress:
- Memory fails
- Judgment is impaired
- Procedures are forgotten
- Panic spreads
Battle Cards provide:
- Immediate starting point
- Reduced decision fatigue
- Consistent response quality
- Faster Mean Time to Resolve (MTTR)
2.3 Historical Context
- Incident Command System: Developed for firefighting (1970s)
- Aviation Checklists: Proven to prevent errors
- DevOps Runbooks: Digital evolution of checklists
- Battle Cards: Compact, scannable format for high-stress situations
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “We know what to do” | You won’t at 3 AM after 4 hours of sleep |
| “Every incident is unique” | 80% follow predictable patterns |
| “Documentation slows us down” | Searching slows you down more |
| “Senior devs don’t need this” | Senior devs are the source of knowledge—capture it |
3. Project Specification
3.1 What You Will Build
- Battle Card Template: Standardized format for incident scenarios
- Card Collection: 5-10 cards for common incidents
- Quick Access System: Physical deck or digital lookup
- Update Process: How cards get revised after incidents
3.2 Functional Requirements
- Card Format
- Scenario title (what’s happening)
- Primary lead (who runs point)
- First 3 diagnostic steps
- Notification list
- Escalation criteria
- Key links (dashboards, runbooks)
- Accessibility
- Accessible in < 30 seconds
- Works offline (if possible)
- Scannable at a glance
- No login required during incident
- Maintenance
- Review after every major incident
- Quarterly audit for accuracy
- Clear ownership of each card
3.3 Non-Functional Requirements
- Each card must fit on one page (or one screen)
- Text must be readable under stress (large fonts, clear hierarchy)
- Cards must be available in incident Slack channel
- Must support version history
3.4 Example Usage / Output
Battle Card Example:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ 🔴 BATTLE CARD: Checkout Service 5xx Errors ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ ┃
┃ SCENARIO ┃
┃ Checkout service returning HTTP 500 errors to customers. ┃
┃ Customers cannot complete purchases. ┃
┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ ROLES ┃
┃ ───── ┃
┃ Primary Lead: Checkout Team On-call ┃
┃ Comms Lead: SRE Lead (or IC on rotation) ┃
┃ Escalation: If no response in 10 min → Page Manager ┃
┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ FIRST 3 STEPS (Do these in order) ┃
┃ ───────────────────────────────── ┃
┃ 1. Check RDS Latency ┃
┃ 📊 grafana.internal/d/checkout-rds ┃
┃ ✓ Normal: < 50ms ⚠️ High: 50-200ms 🔴 Critical: > 200ms ┃
┃ ┃
┃ 2. Verify Redis Connection Pool ┃
┃ 📊 grafana.internal/d/checkout-redis ┃
┃ ✓ Available: > 50 ⚠️ Low: 10-50 🔴 Exhausted: < 10 ┃
┃ ┃
┃ 3. Check Recent Deployments ┃
┃ 📊 argocd.internal/applications/checkout ┃
┃ → If deployment in last 2 hours: ROLLBACK FIRST ┃
┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ IF DB LATENCY > 200ms ┃
┃ ──────────────────── ┃
┃ → Scale checkout pods to 10: kubectl scale deploy/checkout ┃
┃ → If no improvement: Failover to RDS replica ┃
┃ 📖 wiki.internal/runbooks/rds-failover ┃
┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ NOTIFY ┃
┃ ────── ┃
┃ At minute 0: Post to #incidents-active ┃
┃ At minute 10: Update in #checkout-oncall ┃
┃ At minute 30: Escalate to Engineering Manager if unresolved ┃
┃ At minute 60: Notify VP Engineering ┃
┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ LINKS ┃
┃ ───── ┃
┃ 🔗 Full Runbook: wiki.internal/runbooks/checkout-5xx ┃
┃ 📊 Dashboard: grafana.internal/d/checkout-overview ┃
┃ 📞 PagerDuty: app.pagerduty.com/services/checkout ┃
┃ 💬 Slack: #checkout-oncall ┃
┃ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Last Updated: 2025-01-15
Owner: @checkout-team-lead
Card Collection Index:
# Incident Battle Cards Index
## Critical (Tier 1 Services)
| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-001 | [Checkout 5xx Errors](./bc-001-checkout-5xx.md) | Checkout On-call | Revenue stop |
| BC-002 | [Payment Gateway Down](./bc-002-payment-down.md) | Payments On-call | Revenue stop |
| BC-003 | [Database Failover](./bc-003-db-failover.md) | DBA On-call | All services |
| BC-004 | [CDN Outage](./bc-004-cdn-outage.md) | Platform On-call | All frontend |
## Important (Tier 2 Services)
| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-010 | [Search Degraded](./bc-010-search-slow.md) | Search On-call | UX impact |
| BC-011 | [Email Delays](./bc-011-email-delays.md) | Comms On-call | Customer comms |
## Security
| ID | Scenario | Primary Lead | Est. Impact |
|----|----------|--------------|-------------|
| BC-050 | [Credential Leak](./bc-050-cred-leak.md) | Security On-call | Data breach |
| BC-051 | [DDoS Attack](./bc-051-ddos.md) | Platform On-call | Availability |
3.5 Real World Outcome
After implementing Battle Cards:
- On-call engineers have confidence starting incident response
- MTTR decreases as right actions happen first
- Fewer escalations due to “I don’t know what to do”
- New team members can respond effectively sooner
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ BATTLE CARD SYSTEM │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ CARD REPO │ │ QUICK ACCESS │ │ REVIEW CYCLE │
│ (Git) │ │ │ │ │
│ │ │ - Slack bot │ │ - Post- │
│ Markdown │ │ - Wiki page │ │ incident │
│ templates │ │ - Mobile app │ │ - Quarterly │
└───────────────┘ └───────────────┘ └───────────────┘
4.2 Key Components
- Card Template: Standardized Markdown format
- Card Repository: Git repo with all cards
- Access Mechanism: Slack bot or pinned messages
- Review Process: Update after incidents
4.3 Data Structures
# battle-card-schema.yaml
card:
id: string # BC-001
title: string # Checkout 5xx Errors
severity: tier-1 | tier-2 | tier-3
scenario:
description: string # What's happening
symptoms: list # How you know it's this
roles:
primary_lead: string # Team or rotation
comms_lead: string
escalation_path: list
first_steps:
- number: 1
action: string
link: url
thresholds:
normal: string
warning: string
critical: string
conditional_actions:
- condition: string
actions: list
notification_schedule:
- minute: 0
action: string
channel: string
- minute: 10
action: string
channel: string
links:
runbook: url
dashboard: url
pagerduty: url
slack_channel: string
metadata:
owner: string
last_updated: date
last_used: date
review_due: date
4.4 Algorithm Overview
Card Retrieval:
def get_battle_card(incident_type: str) -> BattleCard:
# 1. Match incident to card
cards = load_all_cards()
matching = [c for c in cards if c.matches(incident_type)]
if not matching:
return default_card() # Generic incident response
# 2. Return most specific match
return sorted(matching, key=lambda c: c.specificity)[-1]
def matches(card: BattleCard, incident_type: str) -> bool:
# Match by keywords in scenario
keywords = extract_keywords(incident_type)
return any(kw in card.scenario for kw in keywords)
5. Implementation Guide
5.1 Development Environment Setup
No special setup required—these are Markdown documents.
Optional tools:
- Markdown editor (VS Code, Obsidian)
- Static site generator (MkDocs) for nice rendering
- Slack app for quick access
5.2 Project Structure
battle-cards/
├── cards/
│ ├── tier-1/
│ │ ├── bc-001-checkout-5xx.md
│ │ ├── bc-002-payment-down.md
│ │ └── bc-003-db-failover.md
│ ├── tier-2/
│ │ └── bc-010-search-slow.md
│ └── security/
│ └── bc-050-cred-leak.md
├── templates/
│ └── battle-card-template.md
├── index.md # Card directory
└── review-log.md # When cards were reviewed
5.3 The Core Question You’re Answering
“Can your team handle a P0 incident without the ‘Senior Dev’ being awake?”
If your operating model depends on a single person’s intuition, it is not scalable. Battle Cards turn intuition into a repeatable process.
5.4 Concepts You Must Understand First
Stop and research these before writing:
- The Incident Command System (ICS)
- What are the standard roles in an incident?
- Book Reference: “The Site Reliability Workbook” Ch. 9
- The OODA Loop
- How do you speed up decision-making during a crisis?
- Reference: Military strategy literature
- Checklist Manifesto
- Why do simple checklists save lives?
- Book Reference: “The Checklist Manifesto” by Atul Gawande
5.5 Questions to Guide Your Design
Before writing, think through these:
Simplicity
- Can a tired person read this at 3 AM?
- Are the links easy to click on mobile?
- Is there too much text?
Interaction
- When does the card tell you to stop and call another team?
- How is that “handover” defined?
- What if multiple cards might apply?
Maintenance
- Who updates the card after an incident?
- How do you ensure cards don’t go stale?
- How do you test that links still work?
5.6 Thinking Exercise
The “Blank Screen” Drill
Imagine you’re on-call and see a dashboard where every chart is red. You have 3 minutes to decide who to page.
Questions:
- Does your Battle Card help narrow down the source?
- Does it tell you who to notify before you start the fix?
- If the fix takes 2 hours, does the card have a schedule for updates?
Write down:
- The first thing you’d look at
- The first person you’d contact
- When you’d send the first status update
5.7 Hints in Layers
Hint 1: Pick the Top 5 Incidents Don’t write cards for everything. Start with the 5 incidents that happen most often or cause the most damage.
Hint 2: Follow the 3-Step Rule Every card should have exactly 3 “Immediate Actions.” More than 3 overwhelms people under stress.
Hint 3: Explicitly Define Notification The card must say: “At minute 10, post to #incidents. At minute 30, page the CTO.”
Hint 4: Use a Template Every card should have the same layout. People need to know where to look for “Links” or “Contacts” without thinking.
5.8 The Interview Questions They’ll Ask
Prepare to answer these:
- “How do you organize an incident response team?”
- ICS roles: Incident Commander, Comms Lead, Ops Lead, Scribe
- “What is the role of a ‘Communications Lead’?”
- Keep stakeholders informed, manage external comms, shield responders from interruptions
- “How do you conduct a blameless post-mortem?”
- Focus on systems not people, find contributing factors, generate action items
- “Why is it important to have pre-defined Battle Cards?”
- Reduces cognitive load, ensures consistent response, enables anyone to respond
- “How do you measure the effectiveness of your incident response?”
- MTTA, MTTR, incident count, severity distribution, post-mortem action completion
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Incident Response | “The Site Reliability Workbook” | Ch. 9: Incident Response |
| Post-mortems | “SRE Book” (Google) | Ch. 15: Postmortem Culture |
| Checklists | “The Checklist Manifesto” | All |
5.10 Implementation Phases
Phase 1: Template Design (2-3 hours)
- Design card format (see example above)
- Test with 2-3 engineers for readability
- Finalize template
Phase 2: Card Creation (4-5 hours)
- Identify top 5 incident types
- Interview subject matter experts
- Write cards for each
Phase 3: Access Setup (1-2 hours)
- Pin cards in Slack incident channel
- Add to on-call rotation wiki
- Test retrieval speed
Phase 4: Review Process (1-2 hours)
- Add card review to post-mortem template
- Schedule quarterly card audit
- Assign owners to each card
5.11 Key Implementation Decisions
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Format | Markdown | HTML | Markdown (easy to edit) |
| Access | Slack pinned | Wiki page | Both (redundancy) |
| Structure | Flat list | By severity | By severity (faster triage) |
| Updates | Ad-hoc | Post-incident | Post-incident (systematic) |
6. Testing Strategy
Readability Tests
- Show card to on-call engineer for 30 seconds
- Ask them to explain the first 3 steps
- If they can’t, simplify the card
Link Tests
- Click every link in every card monthly
- Automate with a script if possible
Drill Tests
- Run tabletop exercises using cards
- Time how long it takes to start response
- Identify gaps in cards
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Too much text | People don’t read it | Tried to cover every case | Focus on first 3 actions only |
| Can’t find card | Searching during incident | Poor organization | Pin in Slack, use consistent naming |
| Stale links | Dashboards 404 | No maintenance process | Add link check to quarterly review |
| Wrong actions | Card causes harm | Written without validation | Test with dry runs, review after use |
8. Extensions & Challenges
Extension 1: Interactive Cards
Build a web app where clicking through the card records your actions (audit trail).
Extension 2: Slack Bot
/battlecard checkout-5xx returns the card directly in Slack.
Extension 3: Card Analytics
Track which cards are used most, which have the best outcomes.
Extension 4: AI-Assisted Triage
Based on alert content, suggest which battle card to use.
9. Real-World Connections
Examples from Industry:
- PagerDuty: Incident response documentation best practices
- Atlassian: Incident Management Playbook
- Google: “Wheel of Misfortune” training with runbooks
Tools:
- PagerDuty Runbook Automation
- Rootly - Incident management with runbooks
- Blameless - Incident retrospectives
10. Resources
Incident Management
Templates
Related Projects
- P04: Escalation Logic Tree - Automated escalation
- P03: Ownership Mapper - Who to page
11. Self-Assessment Checklist
Before considering this project complete, verify:
- I can explain the OODA Loop
- Template is readable at 2 AM (tested with colleague)
- At least 5 cards are created for top incidents
- Cards are accessible from Slack in < 30 seconds
- Each card has an owner
- Review process is defined
- At least one card has been used in a real incident
12. Submission / Completion Criteria
This project is complete when you have:
- Template that fits on one page
- 5+ Battle Cards for common incidents
- Index page listing all cards
- Access mechanism (Slack pin, wiki)
- Review process documented
- Test run of at least one card in a drill or real incident
Previous Project: P09: Operational Readiness Review System Next Project: P11: Internal Service Catalog